未验证 提交 a2e7ccac 编写于 作者: W WilliamZhang06 提交者: GitHub

Merge branch 'PaddlePaddle:develop' into develop

repos:
- repo: https://github.com/pre-commit/mirrors-yapf.git - repo: https://github.com/pre-commit/mirrors-yapf.git
sha: v0.16.0 rev: v0.16.0
hooks: hooks:
- id: yapf - id: yapf
files: \.py$ files: \.py$
exclude: (?=third_party).*(\.py)$ exclude: (?=third_party).*(\.py)$
- repo: https://github.com/pre-commit/pre-commit-hooks - repo: https://github.com/pre-commit/pre-commit-hooks
sha: a11d9314b22d8f8c7556443875b731ef05965464 rev: a11d9314b22d8f8c7556443875b731ef05965464
hooks: hooks:
- id: check-merge-conflict - id: check-merge-conflict
- id: check-symlinks - id: check-symlinks
...@@ -31,7 +32,7 @@ ...@@ -31,7 +32,7 @@
- --jobs=1 - --jobs=1
exclude: (?=third_party).*(\.py)$ exclude: (?=third_party).*(\.py)$
- repo : https://github.com/Lucas-C/pre-commit-hooks - repo : https://github.com/Lucas-C/pre-commit-hooks
sha: v1.0.1 rev: v1.0.1
hooks: hooks:
- id: forbid-crlf - id: forbid-crlf
files: \.md$ files: \.md$
......
...@@ -539,6 +539,7 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P ...@@ -539,6 +539,7 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
- Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files. - Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files.
- Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function. - Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function.
- Many thanks to [745165806](https://github.com/745165806)/[PaddleSpeechTask](https://github.com/745165806/PaddleSpeechTask) for contributing Punctuation Restoration model. - Many thanks to [745165806](https://github.com/745165806)/[PaddleSpeechTask](https://github.com/745165806/PaddleSpeechTask) for contributing Punctuation Restoration model.
- Many thanks to [kslz](https://github.com/745165806) for supplementary Chinese documents.
Besides, PaddleSpeech depends on a lot of open source repositories. See [references](./docs/source/reference.md) for more information. Besides, PaddleSpeech depends on a lot of open source repositories. See [references](./docs/source/reference.md) for more information.
......
...@@ -548,6 +548,7 @@ year={2021} ...@@ -548,6 +548,7 @@ year={2021}
- 非常感谢 [mymagicpower](https://github.com/mymagicpower) 采用PaddleSpeech 对 ASR 的[短语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk)[长语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk)进行 Java 实现。 - 非常感谢 [mymagicpower](https://github.com/mymagicpower) 采用PaddleSpeech 对 ASR 的[短语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk)[长语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk)进行 Java 实现。
- 非常感谢 [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) 采用 PaddleSpeech 语音合成功能实现 Virtual Uploader(VUP)/Virtual YouTuber(VTuber) 虚拟主播。 - 非常感谢 [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) 采用 PaddleSpeech 语音合成功能实现 Virtual Uploader(VUP)/Virtual YouTuber(VTuber) 虚拟主播。
- 非常感谢 [745165806](https://github.com/745165806)/[PaddleSpeechTask](https://github.com/745165806/PaddleSpeechTask) 贡献标点重建相关模型。 - 非常感谢 [745165806](https://github.com/745165806)/[PaddleSpeechTask](https://github.com/745165806/PaddleSpeechTask) 贡献标点重建相关模型。
- 非常感谢 [kslz](https://github.com/kslz) 补充中文文档。
此外,PaddleSpeech 依赖于许多开源存储库。有关更多信息,请参阅 [references](./docs/source/reference.md) 此外,PaddleSpeech 依赖于许多开源存储库。有关更多信息,请参阅 [references](./docs/source/reference.md)
......
# [VoxCeleb](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/)
VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube。
VoxCeleb contains speech from speakers spanning a wide range of different ethnicities, accents, professions and ages.
All speaking face-tracks are captured "in the wild", with background chatter, laughter, overlapping speech, pose variation and different lighting conditions.
VoxCeleb consists of both audio and video. Each segment is at least 3 seconds long.
The dataset consists of two versions, VoxCeleb1 and VoxCeleb2. Each version has it's own train/test split. For each we provide YouTube URLs, face detections and tracks, audio files, cropped face videos and speaker meta-data. There is no overlap between the two versions.
more info in details refers to http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Prepare VoxCeleb1 dataset
create manifest files.
Manifest file is a json-format file with each line containing the
meta data (i.e. audio filepath, transcript and audio duration)
of each audio file in the data set.
researchers should download the voxceleb1 dataset yourselves
through google form to get the username & password and unpack the data
"""
import argparse
import codecs
import glob
import json
import os
import subprocess
from pathlib import Path
import soundfile
from utils.utility import check_md5sum
from utils.utility import download
from utils.utility import unzip
# all the data will be download in the current data/voxceleb directory default
DATA_HOME = os.path.expanduser('.')
# if you use the http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/ as the download base url
# you need to get the username & password via the google form
# if you use the https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a as the download base url,
# you need use --no-check-certificate to connect the target download url
BASE_URL = "https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a"
# dev data
DEV_LIST = {
"vox1_dev_wav_partaa": "e395d020928bc15670b570a21695ed96",
"vox1_dev_wav_partab": "bbfaaccefab65d82b21903e81a8a8020",
"vox1_dev_wav_partac": "017d579a2a96a077f40042ec33e51512",
"vox1_dev_wav_partad": "7bb1e9f70fddc7a678fa998ea8b3ba19",
}
DEV_TARGET_DATA = "vox1_dev_wav_parta* vox1_dev_wav.zip ae63e55b951748cc486645f532ba230b"
# test data
TEST_LIST = {"vox1_test_wav.zip": "185fdc63c3c739954633d50379a3d102"}
TEST_TARGET_DATA = "vox1_test_wav.zip vox1_test_wav.zip 185fdc63c3c739954633d50379a3d102"
# kaldi trial
# this trial file is organized by kaldi according the official file,
# which is a little different with the official trial veri_test2.txt
KALDI_BASE_URL = "http://www.openslr.org/resources/49/"
TRIAL_LIST = {"voxceleb1_test_v2.txt": "29fc7cc1c5d59f0816dc15d6e8be60f7"}
TRIAL_TARGET_DATA = "voxceleb1_test_v2.txt voxceleb1_test_v2.txt 29fc7cc1c5d59f0816dc15d6e8be60f7"
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--target_dir",
default=DATA_HOME + "/voxceleb1/",
type=str,
help="Directory to save the voxceleb1 dataset. (default: %(default)s)")
parser.add_argument(
"--manifest_prefix",
default="manifest",
type=str,
help="Filepath prefix for output manifests. (default: %(default)s)")
args = parser.parse_args()
def create_manifest(data_dir, manifest_path_prefix):
print("Creating manifest %s ..." % manifest_path_prefix)
json_lines = []
data_path = os.path.join(data_dir, "wav", "**", "*.wav")
total_sec = 0.0
total_text = 0.0
total_num = 0
speakers = set()
for audio_path in glob.glob(data_path, recursive=True):
audio_id = "-".join(audio_path.split("/")[-3:])
utt2spk = audio_path.split("/")[-3]
duration = soundfile.info(audio_path).duration
text = ""
json_lines.append(
json.dumps(
{
"utt": audio_id,
"utt2spk": str(utt2spk),
"feat": audio_path,
"feat_shape": (duration, ),
"text": text # compatible with asr data format
},
ensure_ascii=False))
total_sec += duration
total_text += len(text)
total_num += 1
speakers.add(utt2spk)
# data_dir_name refer to dev or test
# voxceleb1 is given explicit in the path
data_dir_name = Path(data_dir).name
manifest_path_prefix = manifest_path_prefix + "." + data_dir_name
with codecs.open(manifest_path_prefix, 'w', encoding='utf-8') as f:
for line in json_lines:
f.write(line + "\n")
manifest_dir = os.path.dirname(manifest_path_prefix)
meta_path = os.path.join(manifest_dir, "voxceleb1." +
data_dir_name) + ".meta"
with codecs.open(meta_path, 'w', encoding='utf-8') as f:
print(f"{total_num} utts", file=f)
print(f"{len(speakers)} speakers", file=f)
print(f"{total_sec / (60 * 60)} h", file=f)
print(f"{total_text} text", file=f)
print(f"{total_text / total_sec} text/sec", file=f)
print(f"{total_sec / total_num} sec/utt", file=f)
def prepare_dataset(base_url, data_list, target_dir, manifest_path,
target_data):
if not os.path.exists(target_dir):
os.mkdir(target_dir)
# wav directory already exists, it need do nothing
if not os.path.exists(os.path.join(target_dir, "wav")):
# download all dataset part
for zip_part in data_list.keys():
download_url = " --no-check-certificate " + base_url + "/" + zip_part
download(
url=download_url,
md5sum=data_list[zip_part],
target_dir=target_dir)
# pack the all part to target zip file
all_target_part, target_name, target_md5sum = target_data.split()
target_name = os.path.join(target_dir, target_name)
if not os.path.exists(target_name):
pack_part_cmd = "cat {}/{} > {}".format(target_dir, all_target_part,
target_name)
subprocess.call(pack_part_cmd, shell=True)
# check the target zip file md5sum
if not check_md5sum(target_name, target_md5sum):
raise RuntimeError("{} MD5 checkssum failed".format(target_name))
else:
print("Check {} md5sum successfully".format(target_name))
# unzip the all zip file
if target_name.endswith(".zip"):
unzip(target_name, target_dir)
# create the manifest file
create_manifest(data_dir=target_dir, manifest_path_prefix=manifest_path)
def main():
if args.target_dir.startswith('~'):
args.target_dir = os.path.expanduser(args.target_dir)
prepare_dataset(
base_url=BASE_URL,
data_list=DEV_LIST,
target_dir=os.path.join(args.target_dir, "dev"),
manifest_path=args.manifest_prefix,
target_data=DEV_TARGET_DATA)
prepare_dataset(
base_url=BASE_URL,
data_list=TEST_LIST,
target_dir=os.path.join(args.target_dir, "test"),
manifest_path=args.manifest_prefix,
target_data=TEST_TARGET_DATA)
print("Manifest prepare done!")
if __name__ == '__main__':
main()
...@@ -9,9 +9,10 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | ...@@ -9,9 +9,10 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER |
[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0)
[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 284 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.056 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 284 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.056 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1)
[Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer Aishell ASR1](../../examples/aishell/asr1) [Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer Aishell ASR1](../../examples/aishell/asr1)
[Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0337 | 960 h | [Conformer Librispeech ASR1](../../example/librispeech/asr1) [Ds2 Offline Librispeech ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr0/asr0_deepspeech2_librispeech_ckpt_0.1.1.model.tar.gz)| Librispeech Dataset | Char-based | 518 MB | 2 Conv + 3 bidirectional LSTM layers| - |0.0725| 960 h | [Ds2 Offline Librispeech ASR0](../../examples/librispeech/asr0)
[Transformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0381 | 960 h | [Transformer Librispeech ASR1](../../example/librispeech/asr1) [Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0337 | 960 h | [Conformer Librispeech ASR1](../../examples/librispeech/asr1)
[Transformer Librispeech ASR2 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM |-| 0.0240 | 960 h | [Transformer Librispeech ASR2](../../example/librispeech/asr2) [Transformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0381 | 960 h | [Transformer Librispeech ASR1](../../examples/librispeech/asr1)
[Transformer Librispeech ASR2 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM |-| 0.0240 | 960 h | [Transformer Librispeech ASR2](../../examples/librispeech/asr2)
### Language Model based on NGram ### Language Model based on NGram
Language Model | Training Data | Token-based | Size | Descriptions Language Model | Training Data | Token-based | Size | Descriptions
...@@ -65,7 +66,7 @@ GE2E + FastSpeech2 | AISHELL-3 |[ge2e-fastspeech2-aishell3](https://github.com/ ...@@ -65,7 +66,7 @@ GE2E + FastSpeech2 | AISHELL-3 |[ge2e-fastspeech2-aishell3](https://github.com/
Model Type | Dataset| Example Link | Pretrained Models Model Type | Dataset| Example Link | Pretrained Models
:-------------:| :------------:| :-----: | :-----: :-------------:| :------------:| :-----: | :-----:
PANN | Audioset| [audioset_tagging_cnn](https://github.com/qiuqiangkong/audioset_tagging_cnn) | [panns_cnn6.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn6.pdparams), [panns_cnn10.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn10.pdparams), [panns_cnn14.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn14.pdparams) PANN | Audioset| [audioset_tagging_cnn](https://github.com/qiuqiangkong/audioset_tagging_cnn) | [panns_cnn6.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn6.pdparams), [panns_cnn10.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn10.pdparams), [panns_cnn14.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn14.pdparams)
PANN | ESC-50 |[pann-esc50]("./examples/esc50/cls0")|[esc50_cnn6.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn6.tar.gz), [esc50_cnn10.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn10.tar.gz), [esc50_cnn14.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn14.tar.gz) PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn6.tar.gz), [esc50_cnn10.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn10.tar.gz), [esc50_cnn14.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn14.tar.gz)
## Punctuation Restoration Models ## Punctuation Restoration Models
Model Type | Dataset| Example Link | Pretrained Models Model Type | Dataset| Example Link | Pretrained Models
......
...@@ -71,7 +71,3 @@ Check our [website](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) ...@@ -71,7 +71,3 @@ Check our [website](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
#### GE2E #### GE2E
1. [ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip) 1. [ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip)
## License
Parakeet is provided under the [Apache-2.0 license](LICENSE).
([简体中文](./quick_start_cn.md)|English)
# Quick Start of Text-to-Speech # Quick Start of Text-to-Speech
The examples in PaddleSpeech are mainly classified by datasets, the TTS datasets we mainly used are: The examples in PaddleSpeech are mainly classified by datasets, the TTS datasets we mainly used are:
* CSMCS (Mandarin single speaker) * CSMCS (Mandarin single speaker)
......
(简体中文|[English](./quick_start.md))
# 语音合成快速开始
这些PaddleSpeech中的样例主要按数据集分类,我们主要使用的TTS数据集有:
* CSMCS (普通话单发音人)
* AISHELL3 (普通话多发音人)
* LJSpeech (英文单发音人)
* VCTK (英文多发音人)
PaddleSpeech 的 TTS 模型具有以下映射关系:
* tts0 - Tactron2
* tts1 - TransformerTTS
* tts2 - SpeedySpeech
* tts3 - FastSpeech2
* voc0 - WaveFlow
* voc1 - Parallel WaveGAN
* voc2 - MelGAN
* voc3 - MultiBand MelGAN
* voc4 - Style MelGAN
* voc5 - HiFiGAN
* vc0 - Tactron2 Voice Clone with GE2E
* vc1 - FastSpeech2 Voice Clone with GE2E
## 快速开始
让我们以 FastSpeech2 + Parallel WaveGAN 和 CSMSC 数据集 为例. [examples/csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc)
### 用 CSMSC 数据集训练 Parallel WaveGAN
- 进入目录
```bash
cd examples/csmsc/voc1
```
- 设置环境变量
```bash
source path.sh
```
**在你开始做任何事情之前,必须先做这步**
`MAIN_ROOT` 设置为项目目录. 使用 `parallelwave_gan` 模型作为 `MODEL`.
- 运行
```bash
bash run.sh
```
这只是一个演示,请确保源数据已经准备好,并且在下一个 `step` 之前每个 `step` 都运行正常.
### 用CSMSC数据集训练FastSpeech2
- 进入目录
```bash
cd examples/csmsc/tts3
```
- 设置环境变量
```bash
source path.sh
```
**在你开始做任何事情之前,必须先做这步**
`MAIN_ROOT` 设置为项目目录. 使用 `fastspeech2` 模型作为 `MODEL`
- 运行
```bash
bash run.sh
```
这只是一个演示,请确保源数据已经准备好,并且在下一个 `step` 之前每个 `step` 都运行正常。
`run.sh` 中主要包括以下步骤:
- 设置路径。
- 预处理数据集,
- 训练模型。
-`metadata.jsonl` 中合成波形
- 从文本文件合成波形。(在声学模型中)
- 使用静态模型进行推理。(可选)
有关更多详细信息,请参见 examples 中的 `README.md`
## TTS 流水线
本节介绍如何使用 TTS 提供的预训练模型,并对其进行推理。
TTS中的预训练模型在压缩包中提供。将其解压缩以获得如下文件夹:
**Acoustic Models:**
```text
checkpoint_name
├── default.yaml
├── snapshot_iter_*.pdz
├── speech_stats.npy
├── phone_id_map.txt
├── spk_id_map.txt (optimal)
└── tone_id_map.txt (optimal)
```
**Vocoders:**
```text
checkpoint_name
├── default.yaml
├── snapshot_iter_*.pdz
└── stats.npy
```
- `default.yaml` 存储用于训练模型的配置。
- `snapshot_iter_*.pdz` 是检查点文件,其中`*`是它经过训练的步骤。
- `*_stats.npy` 是特征的统计文件,如果它在训练前已被标准化。
- `phone_id_map.txt` 是音素到音素 ID 的映射关系。
- `tone_id_map.txt` 是在训练声学模型之前分割音调和拼音时,音调到音调 ID 的映射关系。(例如在 csmsc/speedyspeech 的示例中)
- `spk_id_map.txt` 是多发音人声学模型中 "发音人" 到 "spk_ids" 的映射关系。
下面的示例代码显示了如何使用模型进行预测。
### Acoustic Models 声学模型(文本到频谱图)
下面的代码显示了如何使用 `FastSpeech2` 模型。加载预训练模型后,使用它和 normalizer 对象构建预测对象,然后使用 `fastspeech2_inferencet(phone_ids)` 生成频谱图,频谱图可进一步用于使用声码器合成原始音频。
```python
from pathlib import Path
import numpy as np
import paddle
import yaml
from yacs.config import CfgNode
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
from paddlespeech.t2s.modules.normalizer import ZScore
# examples/fastspeech2/baker/frontend.py
from frontend import Frontend
# 加载预训练模型
checkpoint_dir = Path("fastspeech2_nosil_baker_ckpt_0.4")
with open(checkpoint_dir / "phone_id_map.txt", "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
with open(checkpoint_dir / "default.yaml") as f:
fastspeech2_config = CfgNode(yaml.safe_load(f))
odim = fastspeech2_config.n_mels
model = FastSpeech2(
idim=vocab_size, odim=odim, **fastspeech2_config["model"])
model.set_state_dict(
paddle.load(args.fastspeech2_checkpoint)["main_params"])
model.eval()
# 加载特征文件
stat = np.load(checkpoint_dir / "speech_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)
# 构建预测对象
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
# load Chinese Frontend
frontend = Frontend(checkpoint_dir / "phone_id_map.txt")
# 构建一个中文前端
sentence = "你好吗?"
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
phone_ids = input_ids["phone_ids"]
flags = 0
# 构建预测对象加载中文前端,对中文文本前端的输出进行分段
for part_phone_ids in phone_ids:
with paddle.no_grad():
temp_mel = fastspeech2_inference(part_phone_ids)
if flags == 0:
mel = temp_mel
flags = 1
else:
mel = paddle.concat([mel, temp_mel])
```
### Vcoder声码器(谱图到波形)
下面的代码显示了如何使用 `Parallel WaveGAN` 模型。像上面的例子一样,加载预训练模型后,使用它和 normalizer 对象构建预测对象,然后使用 `pwg_inference(mel)` 生成原始音频( wav 格式)。
```python
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
from paddlespeech.t2s.modules.normalizer import ZScore
# 加载预训练模型
checkpoint_dir = Path("parallel_wavegan_baker_ckpt_0.4")
with open(checkpoint_dir / "pwg_default.yaml") as f:
pwg_config = CfgNode(yaml.safe_load(f))
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(paddle.load(args.pwg_params))
vocoder.remove_weight_norm()
vocoder.eval()
# 加载特征文件
stat = np.load(checkpoint_dir / "pwg_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
# 加载预训练模型构造预测对象
pwg_inference = PWGInference(pwg_normalizer, vocoder)
# 频谱图到波形
wav = pwg_inference(mel)
sf.write(
audio_path,
wav.numpy(),
samplerate=fastspeech2_config.fs)
```
\ No newline at end of file
# Tacotron2 + AISHELL-3 Voice Cloning # Tacotron2 + AISHELL-3 Voice Cloning
This example contains code used to train a [Tacotron2 ](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf). The general steps are as follows: This example contains code used to train a [Tacotron2](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf). The general steps are as follows:
1. Speaker Encoder: We use Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in Tacotron2 because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e). 1. Speaker Encoder: We use Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `Tacotron2` because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of Tacotron2 which will be concated with encoder outputs. 2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of `Tacotron2` which will be concated with encoder outputs.
3. Vocoder: We use WaveFlow as the neural Vocoder, refer to [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0). 3. Vocoder: We use [Parallel Wave GAN](http://arxiv.org/abs/1910.11480) as the neural Vocoder, refer to [voc1](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1).
## Dataset
### Download and Extract
Download AISHELL-3.
```bash
wget https://www.openslr.org/resources/93/data_aishell3.tgz
```
Extract AISHELL-3.
```bash
mkdir data_aishell3
tar zxvf data_aishell3.tgz -C data_aishell3
```
### Get MFA Result and Extract
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
## Pretrained GE2E Model
We use pretrained GE2E model to generate speaker embedding for each sentence.
Download pretrained GE2E model from here [ge2e_ckpt_0.3.zip](https://bj.bcebos.com/paddlespeech/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip), and `unzip` it.
## Get Started ## Get Started
Assume the path to the dataset is `~/datasets/data_aishell3`. Assume the path to the dataset is `~/datasets/data_aishell3`.
Assume the path to the MFA result of AISHELL-3 is `./alignment`. Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
Assume the path to the pretrained ge2e model is `ge2e_ckpt_path=./ge2e_ckpt_0.3/step-3000000` Assume the path to the pretrained ge2e model is `./ge2e_ckpt_0.3`.
Run the command below to Run the command below to
1. **source path**. 1. **source path**.
2. preprocess the dataset. 2. preprocess the dataset.
3. train the model. 3. train the model.
4. start a voice cloning inference. 4. synthesize waveform from `metadata.jsonl`.
5. start a voice cloning inference.
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
### Data Preprocessing ### Data Preprocessing
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${input} ${preprocess_path} ${alignment} ${ge2e_ckpt_path} CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${conf_path} ${ge2e_ckpt_path}
``` ```
#### Generate Speaker Embedding When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
Use pretrained GE2E (speaker encoder) to generate speaker embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`. ```text
dump
```bash ├── dev
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then │ ├── norm
python3 ${BIN_DIR}/../ge2e/inference.py \ │ └── raw
--input=${input} \ ├── embed
--output=${preprocess_path}/embed \ │ ├── SSB0005
--ngpu=1 \ │ ├── SSB0009
--checkpoint_path=${ge2e_ckpt_path} │ ├── ...
fi │ └── ...
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── speech_stats.npy
``` ```
The `embed` contains the generated speaker embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`.
The computing time of utterance embedding can be x hours. The computing time of utterance embedding can be x hours.
#### Process Wav
There is silence in the edge of AISHELL-3's wavs, and the audio amplitude is very small, so, we need to remove the silence and normalize the audio. You can the silence remove method based on volume or energy, but the effect is not very good, We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get the alignment of text and speech, then utilize the alignment results to remove the silence.
We use Montreal Force Aligner 1.0. The label in aishell3 includes pinyin,so the lexicon we provided to MFA is pinyin rather than Chinese characters. And the prosody marks(`$` and `%`) need to be removed. You should preprocess the dataset into the format which MFA needs, the texts have the same name with wavs and have the suffix `.lab`.
We use [lexicon.txt](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon. The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/alignment_aishell3.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, speaker, and id of each utterance.
The preprocessing step is very similar to that one of [tts0](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0), but there is one more `ge2e/inference` step here.
### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`.
```bash ```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
echo "Process wav ..."
python3 ${BIN_DIR}/process_wav.py \
--input=${input}/wav \
--output=${preprocess_path}/normalized_wav \
--alignment=${alignment}
fi
``` ```
The training step is very similar to that one of [tts0](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`.
#### Preprocess Transcription ### Synthesizing
We revert the transcription into `phones` and `tones`. It is worth noting that our processing here is different from that used for MFA, we separated the tones. This is a processing method, of course, you can only segment initials and vowels. We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder.
Download pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it.
```bash ```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then unzip pwg_aishell3_ckpt_0.5.zip
python3 ${BIN_DIR}/preprocess_transcription.py \
--input=${input} \
--output=${preprocess_path}
fi
``` ```
The default input is `~/datasets/data_aishell3/train`,which contains `label_train-set.txt`, the processed results are `metadata.yaml` and `metadata.pickle`. the former is a text format for easy viewing, and the latter is a binary format for direct reading. Parallel WaveGAN checkpoint contains files listed below.
#### Extract Mel ```text
```python pwg_aishell3_ckpt_0.5
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then ├── default.yaml # default config used to train parallel wavegan
python3 ${BIN_DIR}/extract_mel.py \ ├── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
--input=${preprocess_path}/normalized_wav \ └── snapshot_iter_1000000.pdz # generator parameters of parallel wavegan
--output=${preprocess_path}/mel
fi
``` ```
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
### Model Training
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
The synthesizing step is very similar to that one of [tts0](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/../synthesize.py`.
Our model removes stop token prediction in Tacotron2, because of the problem of the extremely unbalanced proportion of positive and negative samples of stop token prediction, and it's very sensitive to the clip of audio silence. We use the last symbol from the highest point of attention to the encoder side as the termination condition.
In addition, to accelerate the convergence of the model, we add `guided attention loss` to induce the alignment between encoder and decoder to show diagonal lines faster.
### Voice Cloning ### Voice Cloning
Assume there are some reference audios in `./ref_audio`
```text
ref_audio
├── 001238.wav
├── LJ015-0254.wav
└── audio_self_test.mp3
```
`./local/voice_cloning.sh` calls `${BIN_DIR}/../voice_cloning.py`
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${ge2e_params_path} ${tacotron2_params_path} ${waveflow_params_path} ${vc_input} ${vc_output} CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir}
``` ```
## Pretrained Model
[tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_0.3.zip).
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # sr
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
# Only used for feats_type != raw
fmin: 80 # Minimum frequency of Mel basis.
fmax: 7600 # Maximum frequency of Mel basis.
n_mels: 80 # The number of mel basis.
###########################################################
# DATA SETTING #
###########################################################
batch_size: 64
num_workers: 2
###########################################################
# MODEL SETTING #
###########################################################
model: # keyword arguments for the selected model
embed_dim: 512 # char or phn embedding dimension
elayers: 1 # number of blstm layers in encoder
eunits: 512 # number of blstm units
econv_layers: 3 # number of convolutional layers in encoder
econv_chans: 512 # number of channels in convolutional layer
econv_filts: 5 # filter size of convolutional layer
atype: location # attention function type
adim: 512 # attention dimension
aconv_chans: 32 # number of channels in convolutional layer of attention
aconv_filts: 15 # filter size of convolutional layer of attention
cumulate_att_w: True # whether to cumulate attention weight
dlayers: 2 # number of lstm layers in decoder
dunits: 1024 # number of lstm units in decoder
prenet_layers: 2 # number of layers in prenet
prenet_units: 256 # number of units in prenet
postnet_layers: 5 # number of layers in postnet
postnet_chans: 512 # number of channels in postnet
postnet_filts: 5 # filter size of postnet layer
output_activation: null # activation function for the final output
use_batch_norm: True # whether to use batch normalization in encoder
use_concate: True # whether to concatenate encoder embedding with decoder outputs
use_residual: False # whether to use residual connection in encoder
dropout_rate: 0.5 # dropout rate
zoneout_rate: 0.1 # zoneout rate
reduction_factor: 1 # reduction factor
spk_embed_dim: 256 # speaker embedding dimension
spk_embed_integration_type: concat # how to integrate speaker embedding
###########################################################
# UPDATER SETTING #
###########################################################
updater:
use_masking: True # whether to apply masking for padded part in loss calculation
bce_pos_weight: 5.0 # weight of positive sample in binary cross entropy calculation
use_guided_attn_loss: True # whether to use guided attention loss
guided_attn_loss_sigma: 0.4 # sigma of guided attention loss
guided_attn_loss_lambda: 1.0 # strength of guided attention loss
##########################################################
# OPTIMIZER SETTING #
##########################################################
optimizer:
optim: adam # optimizer type
learning_rate: 1.0e-03 # learning rate
epsilon: 1.0e-06 # epsilon
weight_decay: 0.0 # weight decay coefficient
###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 200
num_snapshots: 5
###########################################################
# OTHER SETTING #
###########################################################
seed: 42
\ No newline at end of file
#!/bin/bash #!/bin/bash
stage=0 stage=3
stop_stage=100 stop_stage=100
input=$1 config_path=$1
preprocess_path=$2 ge2e_ckpt_path=$2
alignment=$3
ge2e_ckpt_path=$4
# gen speaker embedding
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${MAIN_ROOT}/paddlespeech/vector/exps/ge2e/inference.py \ python3 ${MAIN_ROOT}/paddlespeech/vector/exps/ge2e/inference.py \
--input=${input}/wav \ --input=~/datasets/data_aishell3/train/wav/ \
--output=${preprocess_path}/embed \ --output=dump/embed \
--checkpoint_path=${ge2e_ckpt_path} --checkpoint_path=${ge2e_ckpt_path}
fi fi
# copy from tts3/preprocess
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "Process wav ..." # get durations from MFA's result
python3 ${BIN_DIR}/process_wav.py \ echo "Generate durations.txt from MFA results ..."
--input=${input}/wav \ python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--output=${preprocess_path}/normalized_wav \ --inputdir=./aishell3_alignment_tone \
--alignment=${alignment} --output durations.txt \
--config=${config_path}
fi fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
python3 ${BIN_DIR}/preprocess_transcription.py \ # extract features
--input=${input} \ echo "Extract features ..."
--output=${preprocess_path} python3 ${BIN_DIR}/preprocess.py \
--dataset=aishell3 \
--rootdir=~/datasets/data_aishell3/ \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--num-cpu=20 \
--cut-sil=True \
--spk_emb_dir=dump/embed
fi fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
python3 ${BIN_DIR}/extract_mel.py \ # get features' stats(mean and std)
--input=${preprocess_path}/normalized_wav \ echo "Get features' stats ..."
--output=${preprocess_path}/mel python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="speech"
fi
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# normalize and covert phone to id, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
fi fi
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=tacotron2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_aishell3 \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--voice-cloning=True
#!/bin/bash #!/bin/bash
preprocess_path=$1 config_path=$1
train_output_path=$2 train_output_path=$2
python3 ${BIN_DIR}/train.py \ python3 ${BIN_DIR}/train.py \
--data=${preprocess_path} \ --train-metadata=dump/train/norm/metadata.jsonl \
--output=${train_output_path} \ --dev-metadata=dump/dev/norm/metadata.jsonl \
--ngpu=1 --config=${config_path} \
\ No newline at end of file --output-dir=${train_output_path} \
--ngpu=2 \
--phones-dict=dump/phone_id_map.txt \
--voice-cloning=True
\ No newline at end of file
#!/bin/bash #!/bin/bash
ge2e_params_path=$1 config_path=$1
tacotron2_params_path=$2 train_output_path=$2
waveflow_params_path=$3 ckpt_name=$3
vc_input=$4 ge2e_params_path=$4
vc_output=$5 ref_audio_dir=$5
python3 ${BIN_DIR}/voice_cloning.py \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../voice_cloning.py \
--am=tacotron2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_aishell3 \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--ge2e_params_path=${ge2e_params_path} \ --ge2e_params_path=${ge2e_params_path} \
--tacotron2_params_path=${tacotron2_params_path} \ --text="凯莫瑞安联合体的经济崩溃迫在眉睫。" \
--waveflow_params_path=${waveflow_params_path} \ --input-dir=${ref_audio_dir} \
--input-dir=${vc_input} \ --output-dir=${train_output_path}/vc_syn \
--output-dir=${vc_output} --phones-dict=dump/phone_id_map.txt
\ No newline at end of file
...@@ -9,5 +9,5 @@ export PYTHONDONTWRITEBYTECODE=1 ...@@ -9,5 +9,5 @@ export PYTHONDONTWRITEBYTECODE=1
export PYTHONIOENCODING=UTF-8 export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=voice_cloning/tacotron2_ge2e MODEL=new_tacotron2
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL} export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
...@@ -3,25 +3,20 @@ ...@@ -3,25 +3,20 @@
set -e set -e
source path.sh source path.sh
gpus=0 gpus=0,1
stage=0 stage=0
stop_stage=100 stop_stage=100
input=~/datasets/data_aishell3/train conf_path=conf/default.yaml
preprocess_path=dump train_output_path=exp/default
alignment=./alignment ckpt_name=snapshot_iter_482.pdz
ref_audio_dir=ref_audio
# not include ".pdparams" here # not include ".pdparams" here
ge2e_ckpt_path=./ge2e_ckpt_0.3/step-3000000 ge2e_ckpt_path=./ge2e_ckpt_0.3/step-3000000
train_output_path=output
# include ".pdparams" here # include ".pdparams" here
ge2e_params_path=${ge2e_ckpt_path}.pdparams ge2e_params_path=${ge2e_ckpt_path}.pdparams
tacotron2_params_path=${train_output_path}/checkpoints/step-1000.pdparams
# pretrained model
# tacotron2_params_path=./tacotron2_aishell3_ckpt_0.3/step-450000.pdparams
waveflow_params_path=./waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams
vc_input=ref_audio
vc_output=syn_audio
# with the following command, you can choose the stage range you want to run # with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0` # such as `./run.sh --stage 0 --stop-stage 0`
...@@ -30,15 +25,20 @@ source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 ...@@ -30,15 +25,20 @@ source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data # prepare data
CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${input} ${preprocess_path} ${alignment} ${ge2e_ckpt_path} || exit -1 CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${conf_path} ${ge2e_ckpt_path} || exit -1
fi fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path} || exit -1 # train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${ge2e_params_path} ${tacotron2_params_path} ${waveflow_params_path} ${vc_input} ${vc_output} || exit -1 # synthesize, vocoder is pwgan
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# synthesize, vocoder is pwgan
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir} || exit -1
fi
...@@ -114,7 +114,7 @@ ref_audio ...@@ -114,7 +114,7 @@ ref_audio
├── LJ015-0254.wav ├── LJ015-0254.wav
└── audio_self_test.mp3 └── audio_self_test.mp3
``` ```
`./local/voice_cloning.sh` calls `${BIN_DIR}/voice_cloning.py` `./local/voice_cloning.sh` calls `${BIN_DIR}/../voice_cloning.py`
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir} CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir}
......
...@@ -8,13 +8,15 @@ ref_audio_dir=$5 ...@@ -8,13 +8,15 @@ ref_audio_dir=$5
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/voice_cloning.py \ python3 ${BIN_DIR}/../voice_cloning.py \
--fastspeech2-config=${config_path} \ --am=fastspeech2_aishell3 \
--fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --am_config=${config_path} \
--fastspeech2-stat=dump/train/speech_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \ --am_stat=dump/train/speech_stats.npy \
--pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ --voc=pwgan_aishell3 \
--pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--ge2e_params_path=${ge2e_params_path} \ --ge2e_params_path=${ge2e_params_path} \
--text="凯莫瑞安联合体的经济崩溃迫在眉睫。" \ --text="凯莫瑞安联合体的经济崩溃迫在眉睫。" \
--input-dir=${ref_audio_dir} \ --input-dir=${ref_audio_dir} \
......
...@@ -44,15 +44,13 @@ dump ...@@ -44,15 +44,13 @@ dump
│ ├── norm │ ├── norm
│ └── raw │ └── raw
└── train └── train
├── energy_stats.npy
├── norm ├── norm
├── pitch_stats.npy
├── raw ├── raw
└── speech_stats.npy └── speech_stats.npy
``` ```
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and the id of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, speaker, and the id of each utterance.
### Model Training ### Model Training
```bash ```bash
......
#!/bin/bash
train_output_path=$1
stage=0
stop_stage=0
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=tacotron2_csmsc \
--voc=pwgan_csmsc \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt
fi
# for more GAN Vocoders
# multi band melgan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=tacotron2_csmsc \
--voc=mb_melgan_csmsc \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt
fi
# style melgan
# style melgan's Dygraph to Static Graph is not ready now
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=tacotron2_csmsc \
--voc=style_melgan_csmsc \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt
fi
# hifigan
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=tacotron2_csmsc \
--voc=hifigan_csmsc \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt
fi
\ No newline at end of file
...@@ -22,8 +22,9 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then ...@@ -22,8 +22,9 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--lang=zh \ --lang=zh \
--text=${BIN_DIR}/../sentences.txt \ --text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \ --output_dir=${train_output_path}/test_e2e \
--inference_dir=${train_output_path}/inference \ --phones_dict=dump/phone_id_map.txt \
--phones_dict=dump/phone_id_map.txt --inference_dir=${train_output_path}/inference
fi fi
# for more GAN Vocoders # for more GAN Vocoders
......
([简体中文](./README_cn.md)|English)
# FastSpeech2 with CSMSC # FastSpeech2 with CSMSC
This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
......
(简体中文|[English](./README.md))
# 用 CSMSC 数据集训练 FastSpeech2 模型
本用例包含用于训练 [Fastspeech2](https://arxiv.org/abs/2006.04558) 模型的代码,使用 [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html) 数据集。
## 数据集
### 下载并解压
[官方网站](https://test.data-baker.com/data/index/source) 下载数据集
### 获取MFA结果并解压
我们使用 [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) 去获得 fastspeech2 的音素持续时间。
你们可以从这里下载 [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), 或参考 [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) 训练你自己的模型。
## 开始
假设数据集的路径是 `~/datasets/BZNSYP`.
假设CSMSC的MFA结果路径为 `./baker_alignment_tone`.
运行下面的命令会进行如下操作:
1. **设置原路径**
2. 对数据集进行预处理。
3. 训练模型
4. 合成波形
-`metadata.jsonl` 合成波形。
- 从文本文件合成波形。
5. 使用静态模型进行推理。
```bash
./run.sh
```
您可以选择要运行的一系列阶段,或者将 `stage` 设置为 `stop-stage` 以仅使用一个阶段,例如,运行以下命令只会预处理数据集。
```bash
./run.sh --stage 0 --stop-stage 0
```
### 数据预处理
```bash
./local/preprocess.sh ${conf_path}
```
当它完成时。将在当前目录中创建 `dump` 文件夹。转储文件夹的结构如下所示。
```text
dump
├── dev
│ ├── norm
│ └── raw
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│ ├── norm
│ └── raw
└── train
├── energy_stats.npy
├── norm
├── pitch_stats.npy
├── raw
└── speech_stats.npy
```
数据集分为三个部分,即 `train``dev``test` ,每个部分都包含一个 `norm``raw` 子文件夹。原始文件夹包含每个话语的语音、音调和能量特征,而 `norm` 文件夹包含规范化的特征。用于规范化特征的统计数据是从 `dump/train/*_stats.npy` 中的训练集计算出来的。
此外,还有一个 `metadata.jsonl` 在每个子文件夹中。它是一个类似表格的文件,包含音素、文本长度、语音长度、持续时间、语音特征路径、音调特征路径、能量特征路径、说话人和每个话语的 id。
### 模型训练
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
`./local/train.sh` 调用 `${BIN_DIR}/train.py`
以下是完整的帮助信息。
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--phones-dict PHONES_DICT]
[--speaker-dict SPEAKER_DICT] [--voice-cloning VOICE_CLONING]
Train a FastSpeech2 model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG fastspeech2 config file.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu=0, use cpu.
--phones-dict PHONES_DICT
phone vocabulary file.
--speaker-dict SPEAKER_DICT
speaker id map file for multiple speaker model.
--voice-cloning VOICE_CLONING
whether training voice cloning model.
```
1. `--config` 是一个 yaml 格式的配置文件,用于覆盖默认配置,位于 `conf/default.yaml`.
2. `--train-metadata``--dev-metadata` 应为 `dump` 文件夹中 `train``dev` 下的规范化元数据文件
3. `--output-dir` 是保存结果的目录。 检查点保存在此目录中的 `checkpoints/` 目录下。
4. `--ngpu` 要使用的 GPU 数,如果 ngpu==0,则使用 cpu 。
5. `--phones-dict` 是音素词汇表文件的路径。
### 合成
我们使用 [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1) 作为神经声码器(vocoder)。
[pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip) 下载预训练的 parallel wavegan 模型并将其解压。
```bash
unzip pwg_baker_ckpt_0.4.zip
```
Parallel WaveGAN 检查点包含如下文件。
```text
pwg_baker_ckpt_0.4
├── pwg_default.yaml # 用于训练 parallel wavegan 的默认配置
├── pwg_snapshot_iter_400000.pdz # parallel wavegan 的模型参数
└── pwg_stats.npy # 训练平行波形时用于规范化谱图的统计数据
```
`./local/synthesize.sh` 调用 `${BIN_DIR}/../synthesize.py` 即可从 `metadata.jsonl`中合成波形。
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--voice-cloning VOICE_CLONING]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
Synthesize with acoustic model & vocoder
optional arguments:
-h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
Choose acoustic model type of tts task.
--am_config AM_CONFIG
Config of acoustic model. Use deault config when it is
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model.
--phones_dict PHONES_DICT
phone vocabulary file.
--tones_dict TONES_DICT
tone vocabulary file.
--speaker_dict SPEAKER_DICT
speaker id map file.
--voice-cloning VOICE_CLONING
whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
Choose vocoder type of tts task.
--voc_config VOC_CONFIG
Config of voc. Use deault config when it is None.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--ngpu NGPU if ngpu == 0, use cpu.
--test_metadata TEST_METADATA
test metadata.
--output_dir OUTPUT_DIR
output dir.
```
`./local/synthesize_e2e.sh` 调用 `${BIN_DIR}/../synthesize_e2e.py`,即可从文本文件中合成波形。
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize_e2e.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT]
[--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU]
[--text TEXT] [--output_dir OUTPUT_DIR]
Synthesize with acoustic model & vocoder
optional arguments:
-h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
Choose acoustic model type of tts task.
--am_config AM_CONFIG
Config of acoustic model. Use deault config when it is
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model.
--phones_dict PHONES_DICT
phone vocabulary file.
--tones_dict TONES_DICT
tone vocabulary file.
--speaker_dict SPEAKER_DICT
speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
Choose vocoder type of tts task.
--voc_config VOC_CONFIG
Config of voc. Use deault config when it is None.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--lang LANG Choose model language. zh or en
--inference_dir INFERENCE_DIR
dir to save inference models
--ngpu NGPU if ngpu == 0, use cpu.
--text TEXT text to synthesize, a 'utt_id sentence' pair per line.
--output_dir OUTPUT_DIR
output dir.
```
1. `--am` 声学模型格式是否符合 {model_name}_{dataset}
2. `--am_config`, `--am_checkpoint`, `--am_stat``--phones_dict` 是声学模型的参数,对应于 fastspeech2 预训练模型中的 4 个文件。
3. `--voc` 声码器(vocoder)格式是否符合 {model_name}_{dataset}
4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` 是声码器的参数,对应于 parallel wavegan 预训练模型中的 3 个文件。
5. `--lang` 对应模型的语言可以是 `zh``en`
6. `--test_metadata` 应为 `dump` 文件夹中 `test` 下的规范化元数据文件、
7. `--text` 是文本文件,其中包含要合成的句子。
8. `--output_dir` 是保存合成音频文件的目录。
9. `--ngpu` 要使用的GPU数,如果 ngpu==0,则使用 cpu 。
### 推理
在合成之后,我们将在 `${train_output_path}/inference` 中得到 fastspeech2 和 pwgan 的静态模型
`./local/inference.sh` 调用 `${BIN_DIR}/inference.py` 为 fastspeech2 + pwgan 综合提供了一个 paddle 静态模型推理示例。
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
```
## 预训练模型
预先训练的 FastSpeech2 模型,在音频边缘没有空白音频:
- [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)
- [fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)
静态模型可以在这里下载 [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip).
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
default| 2(gpu) x 76000|1.0991|0.59132|0.035815|0.31915|0.15287|
conformer| 2(gpu) x 76000|1.0675|0.56103|0.035869|0.31553|0.15509|
FastSpeech2检查点包含下列文件。
```text
fastspeech2_nosil_baker_ckpt_0.4
├── default.yaml # 用于训练 fastspeech2 的默认配置
├── phone_id_map.txt # 训练 fastspeech2 时的音素词汇文件
├── snapshot_iter_76000.pdz # 模型参数和优化器状态
└── speech_stats.npy # 训练 fastspeech2 时用于规范化频谱图的统计数据
```
您可以使用以下脚本通过使用预训练的 fastspeech2 和 parallel wavegan 模型为 `${BIN_DIR}/../sentences.txt` 合成句子
```bash
source path.sh
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_csmsc \
--am_config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
--am_ckpt=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
--am_stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
--voc=pwgan_csmsc \
--voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=exp/default/test_e2e \
--inference_dir=exp/default/inference \
--phones_dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
```
...@@ -49,3 +49,14 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then ...@@ -49,3 +49,14 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
--output_dir=${train_output_path}/pd_infer_out \ --output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt --phones_dict=dump/phone_id_map.txt
fi fi
# wavernn
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=fastspeech2_csmsc \
--voc=wavernn_csmsc \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt
fi
\ No newline at end of file
...@@ -89,3 +89,25 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then ...@@ -89,3 +89,25 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
--inference_dir=${train_output_path}/inference \ --inference_dir=${train_output_path}/inference \
--phones_dict=dump/phone_id_map.txt --phones_dict=dump/phone_id_map.txt
fi fi
# wavernn
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
echo "in wavernn syn_e2e"
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=wavernn_csmsc \
--voc_config=wavernn_test/default.yaml \
--voc_ckpt=wavernn_test/snapshot_iter_5000.pdz \
--voc_stat=wavernn_test/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--inference_dir=${train_output_path}/inference
fi
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
mu_law: True # Recommended to suppress noise if using raw bitsexit()
###########################################################
# MODEL SETTING #
###########################################################
model:
rnn_dims: 512 # Hidden dims of RNN Layers.
fc_dims: 512
bits: 9 # Bit depth of signal
aux_context_window: 2 # Context window size for auxiliary feature.
# If set to 2, previous 2 and future 2 frames will be considered.
aux_channels: 80 # Number of channels for auxiliary feature conv.
# Must be the same as num_mels.
upsample_scales: [4, 5, 3, 5] # Upsampling scales. Prodcut of these must be the same as hop size, same with pwgan here
compute_dims: 128 # Dims of Conv1D in MelResNet.
res_out_dims: 128 # Dims of output in MelResNet.
res_blocks: 10 # Number of residual blocks.
mode: RAW # either 'raw'(softmax on raw bits) or 'mold' (sample from mixture of logistics)
inference:
gen_batched: True # whether to genenate sample in batch mode
target: 12000 # target number of samples to be generated in each batch entry
overlap: 600 # number of samples for crossfading between batches
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 64 # Batch size.
batch_max_steps: 4500 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in DataLoader.
###########################################################
# OPTIMIZER SETTING #
###########################################################
grad_clip: 4.0
learning_rate: 1.0e-4
###########################################################
# INTERVAL SETTING #
###########################################################
train_max_steps: 400000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
gen_eval_samples_interval_steps: 5000 # the iteration interval of generating valid samples
generate_num: 5 # number of samples to generate at each checkpoint
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./baker_alignment_tone \
--output=durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/../gan_vocoder/preprocess.py \
--rootdir=~/datasets/BZNSYP/ \
--dataset=baker \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--cut-sil=True \
--num-cpu=20
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="feats"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../gan_vocoder/normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../gan_vocoder/normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../gan_vocoder/normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
fi
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \
--config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test
#!/bin/bash
config_path=$1
train_output_path=$2
FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=1
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=wavernn
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
\ No newline at end of file
#!/bin/bash
set -e
source path.sh
gpus=0,1
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
test_input=dump/dump_gta_test
ckpt_name=snapshot_iter_100000.pdz
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# prepare data
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
# Tacotron2 with LJSpeech
PaddlePaddle dynamic graph implementation of Tacotron2, a neural network architecture for speech synthesis directly from the text. The implementation is based on [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884).
## Dataset
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Get Started
Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize mels.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
Here's the complete help message.
```text
usage: train.py [-h] [--config FILE] [--data DATA_DIR] [--output OUTPUT_DIR]
[--checkpoint_path CHECKPOINT_PATH] [--ngpu NGPU] [--opts ...]
optional arguments:
-h, --help show this help message and exit
--config FILE path of the config file to overwrite to default config
with.
--data DATA_DIR path to the dataset.
--output OUTPUT_DIR path to save checkpoint and logs.
--checkpoint_path CHECKPOINT_PATH
path of the checkpoint to load
--ngpu NGPU if ngpu == 0, use cpu.
--opts ... options to overwrite --config file and the default
config, passing in KEY VALUE pairs
```
If you want to train on CPU, just set `--ngpu=0`.
If you want to train on multiple GPUs, just set `--ngpu` as the num of GPU.
By default, training will be resumed from the latest checkpoint in `--output`, if you want to start a new training, please use a new `${OUTPUTPATH}` with no checkpoint.
And if you want to resume from another existing model, you should set `checkpoint_path` to be the checkpoint path you want to load.
**Note: The checkpoint path cannot contain the file extension.**
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which synthesize **mels** from text_list here.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h] [--config FILE] [--checkpoint_path CHECKPOINT_PATH]
[--input INPUT] [--output OUTPUT] [--ngpu NGPU]
[--opts ...] [-v]
generate mel spectrogram with TransformerTTS.
optional arguments:
-h, --help show this help message and exit
--config FILE extra config to overwrite the default config
--checkpoint_path CHECKPOINT_PATH
path of the checkpoint to load.
--input INPUT path of the text sentences
--output OUTPUT path to save outputs
--ngpu NGPU if ngpu == 0, use cpu.
--opts ... options to overwrite --config file and the default
config, passing in KEY VALUE pairs
-v, --verbose print msg
```
**Ps.** You can use [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0) as the neural vocoder to synthesize mels to wavs. (Please refer to `synthesize.sh` in our LJSpeech waveflow example)
## Pretrained Models
Pretrained Models can be downloaded from the links below. We provide 2 models with different configurations.
1. This model uses a binary classifier to predict the stop token. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3.zip)
2. This model does not have a stop token predictor. It uses the attention peak position to decide whether all the contents have been uttered. Also, guided attention loss is used to speed up training. This model is trained with `configs/alternative.yaml`.[tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3_alternative.zip)
# This configuration is for Paddle to train Tacotron 2. Compared to the
# original paper, this configuration additionally use the guided attention
# loss to accelerate the learning of the diagonal attention. It requires
# only a single GPU with 12 GB memory and it takes ~1 days to finish the
# training on Titan V.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 22050 # Sampling rate.
n_fft: 1024 # FFT size (samples).
n_shift: 256 # Hop size (samples). 11.6ms
win_length: null # Window length (samples).
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# DATA SETTING #
###########################################################
batch_size: 64
num_workers: 2
###########################################################
# MODEL SETTING #
###########################################################
model: # keyword arguments for the selected model
embed_dim: 512 # char or phn embedding dimension
elayers: 1 # number of blstm layers in encoder
eunits: 512 # number of blstm units
econv_layers: 3 # number of convolutional layers in encoder
econv_chans: 512 # number of channels in convolutional layer
econv_filts: 5 # filter size of convolutional layer
atype: location # attention function type
adim: 512 # attention dimension
aconv_chans: 32 # number of channels in convolutional layer of attention
aconv_filts: 15 # filter size of convolutional layer of attention
cumulate_att_w: True # whether to cumulate attention weight
dlayers: 2 # number of lstm layers in decoder
dunits: 1024 # number of lstm units in decoder
prenet_layers: 2 # number of layers in prenet
prenet_units: 256 # number of units in prenet
postnet_layers: 5 # number of layers in postnet
postnet_chans: 512 # number of channels in postnet
postnet_filts: 5 # filter size of postnet layer
output_activation: null # activation function for the final output
use_batch_norm: True # whether to use batch normalization in encoder
use_concate: True # whether to concatenate encoder embedding with decoder outputs
use_residual: False # whether to use residual connection in encoder
dropout_rate: 0.5 # dropout rate
zoneout_rate: 0.1 # zoneout rate
reduction_factor: 1 # reduction factor
spk_embed_dim: null # speaker embedding dimension
###########################################################
# UPDATER SETTING #
###########################################################
updater:
use_masking: True # whether to apply masking for padded part in loss calculation
bce_pos_weight: 5.0 # weight of positive sample in binary cross entropy calculation
use_guided_attn_loss: True # whether to use guided attention loss
guided_attn_loss_sigma: 0.4 # sigma of guided attention loss
guided_attn_loss_lambda: 1.0 # strength of guided attention loss
##########################################################
# OPTIMIZER SETTING #
##########################################################
optimizer:
optim: adam # optimizer type
learning_rate: 1.0e-03 # learning rate
epsilon: 1.0e-06 # epsilon
weight_decay: 0.0 # weight decay coefficient
###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 300
num_snapshots: 5
###########################################################
# OTHER SETTING #
###########################################################
seed: 42
#!/bin/bash #!/bin/bash
preprocess_path=$1 stage=0
stop_stage=100
python3 ${BIN_DIR}/preprocess.py \ config_path=$1
--input=~/datasets/LJSpeech-1.1 \
--output=${preprocess_path} \ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
-v \ # get durations from MFA's result
\ No newline at end of file echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./ljspeech_alignment \
--output=durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/preprocess.py \
--dataset=ljspeech \
--rootdir=~/datasets/LJSpeech-1.1/ \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--num-cpu=20 \
--cut-sil=True
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="speech"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize and covert phone to id, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
fi
#!/bin/bash #!/bin/bash
train_output_path=$1 config_path=$1
ckpt_name=$2 train_output_path=$2
ckpt_name=$3
python3 ${BIN_DIR}/synthesize.py \ FLAGS_allocator_strategy=naive_best_fit \
--config=${train_output_path}/config.yaml \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
--checkpoint_path=${train_output_path}/checkpoints/${ckpt_name} \ python3 ${BIN_DIR}/../synthesize.py \
--input=${BIN_DIR}/../sentences_en.txt \ --am=tacotron2_ljspeech \
--output=${train_output_path}/test \ --am_config=${config_path} \
--ngpu=1 --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_ljspeech \
--voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
--voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt
#!/bin/bash #!/bin/bash
preprocess_path=$1 config_path=$1
train_output_path=$2 train_output_path=$2
python3 ${BIN_DIR}/train.py \ python3 ${BIN_DIR}/train.py \
--data=${preprocess_path} \ --train-metadata=dump/train/norm/metadata.jsonl \
--output=${train_output_path} \ --dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=1 \ --ngpu=1 \
--phones-dict=dump/phone_id_map.txt
\ No newline at end of file
...@@ -9,5 +9,5 @@ export PYTHONDONTWRITEBYTECODE=1 ...@@ -9,5 +9,5 @@ export PYTHONDONTWRITEBYTECODE=1
export PYTHONIOENCODING=UTF-8 export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=tacotron2 MODEL=new_tacotron2
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL} export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
...@@ -3,13 +3,13 @@ ...@@ -3,13 +3,13 @@
set -e set -e
source path.sh source path.sh
gpus=0 gpus=0,1
stage=0 stage=0
stop_stage=100 stop_stage=100
preprocess_path=preprocessed_ljspeech conf_path=conf/default.yaml
train_output_path=output train_output_path=exp/default
ckpt_name=step-35000 ckpt_name=snapshot_iter_201.pdz
# with the following command, you can choose the stage range you want to run # with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0` # such as `./run.sh --stage 0 --stop-stage 0`
...@@ -18,16 +18,20 @@ source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 ...@@ -18,16 +18,20 @@ source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data # prepare data
./local/preprocess.sh ${preprocess_path} || exit -1 ./local/preprocess.sh ${conf_path} || exit -1
fi fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir # train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path} || exit -1 CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir # synthesize, vocoder is pwgan
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${train_output_path} ${ckpt_name} || exit -1 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# synthesize_e2e, vocoder is pwgan
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
...@@ -10,7 +10,7 @@ stop_stage=100 ...@@ -10,7 +10,7 @@ stop_stage=100
preprocess_path=preprocessed_ljspeech preprocess_path=preprocessed_ljspeech
train_output_path=output train_output_path=output
# mel generated by Tacotron2 # mel generated by Tacotron2
input_mel_path=../tts0/output/test input_mel_path=${preprocess_path}/mel_test
ckpt_name=step-10000 ckpt_name=step-10000
# with the following command, you can choose the stage range you want to run # with the following command, you can choose the stage range you want to run
...@@ -28,5 +28,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then ...@@ -28,5 +28,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
fi fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
mkdir -p ${preprocess_path}/mel_test
cp ${preprocess_path}/mel/LJ050-001*.npy ${preprocess_path}/mel_test/
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${input_mel_path} ${train_output_path} ${ckpt_name} || exit -1 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${input_mel_path} ${train_output_path} ${ckpt_name} || exit -1
fi fi
...@@ -27,7 +27,7 @@ cd a0 ...@@ -27,7 +27,7 @@ cd a0
应用程序会自动下载 THCHS-30数据集,处理成 MFA 所需的文件格式并开始训练,您可以修改 `run.sh` 中的参数 `LEXICON_NAME` 来决定您需要强制对齐的级别(word、syllable 和 phone) 应用程序会自动下载 THCHS-30数据集,处理成 MFA 所需的文件格式并开始训练,您可以修改 `run.sh` 中的参数 `LEXICON_NAME` 来决定您需要强制对齐的级别(word、syllable 和 phone)
## MFA 所使用的字典 ## MFA 所使用的字典
--- ---
MFA 字典的格式请参考: [MFA 官方文档 Dictionary format ](https://montreal-forced-aligner.readthedocs.io/en/latest/dictionary.html) MFA 字典的格式请参考: [MFA 官方文档](https://montreal-forced-aligner.readthedocs.io/en/latest/)
phone.lexicon 直接使用的是 `THCHS-30/data_thchs30/lm_phone/lexicon.txt` phone.lexicon 直接使用的是 `THCHS-30/data_thchs30/lm_phone/lexicon.txt`
word.lexicon 考虑到了中文的多音字,使用**带概率的字典**, 生成规则请参考 `local/gen_word2phone.py` word.lexicon 考虑到了中文的多音字,使用**带概率的字典**, 生成规则请参考 `local/gen_word2phone.py`
`syllable.lexicon` 获取自 [DNSun/thchs30-pinyin2tone](https://github.com/DNSun/thchs30-pinyin2tone) `syllable.lexicon` 获取自 [DNSun/thchs30-pinyin2tone](https://github.com/DNSun/thchs30-pinyin2tone)
...@@ -39,4 +39,4 @@ word.lexicon 考虑到了中文的多音字,使用**带概率的字典**, 生 ...@@ -39,4 +39,4 @@ word.lexicon 考虑到了中文的多音字,使用**带概率的字典**, 生
**syllabel 级别:** [syllable.lexicon](https://paddlespeech.bj.bcebos.com/MFA/THCHS30/syllable/syllable.lexicon)[对齐结果](https://paddlespeech.bj.bcebos.com/MFA/THCHS30/syllable/thchs30_alignment.tar.gz)[模型](https://paddlespeech.bj.bcebos.com/MFA/THCHS30/syllable/thchs30_model.zip) **syllabel 级别:** [syllable.lexicon](https://paddlespeech.bj.bcebos.com/MFA/THCHS30/syllable/syllable.lexicon)[对齐结果](https://paddlespeech.bj.bcebos.com/MFA/THCHS30/syllable/thchs30_alignment.tar.gz)[模型](https://paddlespeech.bj.bcebos.com/MFA/THCHS30/syllable/thchs30_model.zip)
**word 级别:** [word.lexicon](https://paddlespeech.bj.bcebos.com/MFA/THCHS30/word/word.lexicon)[对齐结果](https://paddlespeech.bj.bcebos.com/MFA/THCHS30/word/thchs30_alignment.tar.gz)[模型](https://paddlespeech.bj.bcebos.com/MFA/THCHS30/word/thchs30_model.zip) **word 级别:** [word.lexicon](https://paddlespeech.bj.bcebos.com/MFA/THCHS30/word/word.lexicon)[对齐结果](https://paddlespeech.bj.bcebos.com/MFA/THCHS30/word/thchs30_alignment.tar.gz)[模型](https://paddlespeech.bj.bcebos.com/MFA/THCHS30/word/thchs30_model.zip)
随后,您可以参考 [MFA 官方文档 Align using pretrained models](https://montreal-forced-aligner.readthedocs.io/en/stable/aligning.html#align-using-pretrained-models) 使用我们给您提供好的模型直接对自己的数据集进行强制对齐,注意,您需要使用和模型对应的 lexicon 文件,当文本是汉字时,您需要用空格把不同的**汉字**(而不是词语)分开 随后,您可以参考 [MFA 官方文档](https://montreal-forced-aligner.readthedocs.io/en/latest/) 使用我们给您提供好的模型直接对自己的数据集进行强制对齐,注意,您需要使用和模型对应的 lexicon 文件,当文本是汉字时,您需要用空格把不同的**汉字**(而不是词语)分开
...@@ -91,6 +91,20 @@ pretrained_models = { ...@@ -91,6 +91,20 @@ pretrained_models = {
'lm_md5': 'lm_md5':
'29e02312deb2e59b3c8686c7966d4fe3' '29e02312deb2e59b3c8686c7966d4fe3'
}, },
"deepspeech2offline_librispeech-en-16k": {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr0/asr0_deepspeech2_librispeech_ckpt_0.1.1.model.tar.gz',
'md5':
'f5666c81ad015c8de03aac2bc92e5762',
'cfg_path':
'model.yaml',
'ckpt_path':
'exp/deepspeech2/checkpoints/avg_1',
'lm_url':
'https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm',
'lm_md5':
'099a601759d467cd0a8523ff939819c5'
},
} }
model_alias = { model_alias = {
...@@ -171,8 +185,9 @@ class ASRExecutor(BaseExecutor): ...@@ -171,8 +185,9 @@ class ASRExecutor(BaseExecutor):
""" """
Download and returns pretrained resources path of current task. Download and returns pretrained resources path of current task.
""" """
assert tag in pretrained_models, 'Can not find pretrained resources of {}.'.format( support_models = list(pretrained_models.keys())
tag) assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
tag, '\n\t\t'.join(support_models))
res_path = os.path.join(MODEL_HOME, tag) res_path = os.path.join(MODEL_HOME, tag)
decompressed_path = download_and_decompress(pretrained_models[tag], decompressed_path = download_and_decompress(pretrained_models[tag],
...@@ -328,18 +343,15 @@ class ASRExecutor(BaseExecutor): ...@@ -328,18 +343,15 @@ class ASRExecutor(BaseExecutor):
audio = self._inputs["audio"] audio = self._inputs["audio"]
audio_len = self._inputs["audio_len"] audio_len = self._inputs["audio_len"]
if "deepspeech2online" in model_type or "deepspeech2offline" in model_type: if "deepspeech2online" in model_type or "deepspeech2offline" in model_type:
result_transcripts = self.model.decode( decode_batch_size = audio.shape[0]
audio, self.model.decoder.init_decoder(
audio_len, decode_batch_size, self.text_feature.vocab_list,
self.text_feature.vocab_list, cfg.decoding_method, cfg.lang_model_path, cfg.alpha, cfg.beta,
decoding_method=cfg.decoding_method, cfg.beam_size, cfg.cutoff_prob, cfg.cutoff_top_n,
lang_model_path=cfg.lang_model_path, cfg.num_proc_bsearch)
beam_alpha=cfg.alpha,
beam_beta=cfg.beta, result_transcripts = self.model.decode(audio, audio_len)
beam_size=cfg.beam_size, self.model.decoder.del_decoder()
cutoff_prob=cfg.cutoff_prob,
cutoff_top_n=cfg.cutoff_top_n,
num_processes=cfg.num_proc_bsearch)
self._outputs["result"] = result_transcripts[0] self._outputs["result"] = result_transcripts[0]
elif "conformer" in model_type or "transformer" in model_type: elif "conformer" in model_type or "transformer" in model_type:
......
...@@ -34,7 +34,7 @@ from .entry import commands ...@@ -34,7 +34,7 @@ from .entry import commands
try: try:
from .. import __version__ from .. import __version__
except ImportError: except ImportError:
__version__ = 0.0.0 # for develop branch __version__ = "0.0.0" # for develop branch
requests.adapters.DEFAULT_RETRIES = 3 requests.adapters.DEFAULT_RETRIES = 3
......
...@@ -51,7 +51,7 @@ def _batch_shuffle(indices, batch_size, epoch, clipped=False): ...@@ -51,7 +51,7 @@ def _batch_shuffle(indices, batch_size, epoch, clipped=False):
""" """
rng = np.random.RandomState(epoch) rng = np.random.RandomState(epoch)
shift_len = rng.randint(0, batch_size - 1) shift_len = rng.randint(0, batch_size - 1)
batch_indices = list(zip(* [iter(indices[shift_len:])] * batch_size)) batch_indices = list(zip(*[iter(indices[shift_len:])] * batch_size))
rng.shuffle(batch_indices) rng.shuffle(batch_indices)
batch_indices = [item for batch in batch_indices for item in batch] batch_indices = [item for batch in batch_indices for item in batch]
assert clipped is False assert clipped is False
......
...@@ -12,5 +12,6 @@ ...@@ -12,5 +12,6 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from .audio import AudioProcessor from .audio import AudioProcessor
from .codec import *
from .spec_normalizer import LogMagnitude from .spec_normalizer import LogMagnitude
from .spec_normalizer import NormalizerBase from .spec_normalizer import NormalizerBase
...@@ -11,3 +11,41 @@ ...@@ -11,3 +11,41 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import math
import numpy as np
import paddle
# x: [0: 2**bit-1], return: [-1, 1]
def label_2_float(x, bits):
return 2 * x / (2**bits - 1.) - 1.
#x: [-1, 1], return: [0, 2**bits-1]
def float_2_label(x, bits):
assert abs(x).max() <= 1.0
x = (x + 1.) * (2**bits - 1) / 2
return x.clip(0, 2**bits - 1)
# y: [-1, 1], mu: 2**bits, return: [0, 2**bits-1]
# see https://en.wikipedia.org/wiki/%CE%9C-law_algorithm
# be careful the input `mu` here, which is +1 than that of the link above
def encode_mu_law(x, mu):
mu = mu - 1
fx = np.sign(x) * np.log(1 + mu * np.abs(x)) / np.log(1 + mu)
return np.floor((fx + 1) / 2 * mu + 0.5)
# from_labels = True:
# y: [0: 2**bit-1], mu: 2**bits, return: [-1,1]
# from_labels = False:
# y: [-1, 1], return: [-1, 1]
def decode_mu_law(y, mu, from_labels=True):
# TODO: get rid of log2 - makes no sense
if from_labels:
y = label_2_float(y, math.log2(mu))
mu = mu - 1
x = paddle.sign(y) / mu * ((1 + mu)**paddle.abs(y) - 1)
return x
...@@ -46,6 +46,47 @@ def tacotron2_single_spk_batch_fn(examples): ...@@ -46,6 +46,47 @@ def tacotron2_single_spk_batch_fn(examples):
return batch return batch
def tacotron2_multi_spk_batch_fn(examples):
# fields = ["text", "text_lengths", "speech", "speech_lengths"]
text = [np.array(item["text"], dtype=np.int64) for item in examples]
speech = [np.array(item["speech"], dtype=np.float32) for item in examples]
text_lengths = [
np.array(item["text_lengths"], dtype=np.int64) for item in examples
]
speech_lengths = [
np.array(item["speech_lengths"], dtype=np.int64) for item in examples
]
text = batch_sequences(text)
speech = batch_sequences(speech)
# convert each batch to paddle.Tensor
text = paddle.to_tensor(text)
speech = paddle.to_tensor(speech)
text_lengths = paddle.to_tensor(text_lengths)
speech_lengths = paddle.to_tensor(speech_lengths)
batch = {
"text": text,
"text_lengths": text_lengths,
"speech": speech,
"speech_lengths": speech_lengths,
}
# spk_emb has a higher priority than spk_id
if "spk_emb" in examples[0]:
spk_emb = [
np.array(item["spk_emb"], dtype=np.float32) for item in examples
]
spk_emb = batch_sequences(spk_emb)
spk_emb = paddle.to_tensor(spk_emb)
batch["spk_emb"] = spk_emb
elif "spk_id" in examples[0]:
spk_id = [np.array(item["spk_id"], dtype=np.int64) for item in examples]
spk_id = paddle.to_tensor(spk_id)
batch["spk_id"] = spk_id
return batch
def speedyspeech_single_spk_batch_fn(examples): def speedyspeech_single_spk_batch_fn(examples):
# fields = ["phones", "tones", "num_phones", "num_frames", "feats", "durations"] # fields = ["phones", "tones", "num_phones", "num_frames", "feats", "durations"]
phones = [np.array(item["phones"], dtype=np.int64) for item in examples] phones = [np.array(item["phones"], dtype=np.int64) for item in examples]
......
...@@ -14,6 +14,10 @@ ...@@ -14,6 +14,10 @@
import numpy as np import numpy as np
import paddle import paddle
from paddlespeech.t2s.audio.codec import encode_mu_law
from paddlespeech.t2s.audio.codec import float_2_label
from paddlespeech.t2s.audio.codec import label_2_float
class Clip(object): class Clip(object):
"""Collate functor for training vocoders. """Collate functor for training vocoders.
...@@ -49,7 +53,7 @@ class Clip(object): ...@@ -49,7 +53,7 @@ class Clip(object):
self.end_offset = -(self.batch_max_frames + aux_context_window) self.end_offset = -(self.batch_max_frames + aux_context_window)
self.mel_threshold = self.batch_max_frames + 2 * aux_context_window self.mel_threshold = self.batch_max_frames + 2 * aux_context_window
def __call__(self, examples): def __call__(self, batch):
"""Convert into batch tensors. """Convert into batch tensors.
Parameters Parameters
...@@ -67,11 +71,11 @@ class Clip(object): ...@@ -67,11 +71,11 @@ class Clip(object):
""" """
# check length # check length
examples = [ batch = [
self._adjust_length(b['wave'], b['feats']) for b in examples self._adjust_length(b['wave'], b['feats']) for b in batch
if b['feats'].shape[0] > self.mel_threshold if b['feats'].shape[0] > self.mel_threshold
] ]
xs, cs = [b[0] for b in examples], [b[1] for b in examples] xs, cs = [b[0] for b in batch], [b[1] for b in batch]
# make batch with random cut # make batch with random cut
c_lengths = [c.shape[0] for c in cs] c_lengths = [c.shape[0] for c in cs]
...@@ -89,7 +93,7 @@ class Clip(object): ...@@ -89,7 +93,7 @@ class Clip(object):
c_batch = np.stack( c_batch = np.stack(
[c[start:end] for c, start, end in zip(cs, c_starts, c_ends)]) [c[start:end] for c, start, end in zip(cs, c_starts, c_ends)])
# convert each batch to tensor, asuume that each item in batch has the same length # convert each batch to tensor, assume that each item in batch has the same length
y_batch = paddle.to_tensor( y_batch = paddle.to_tensor(
y_batch, dtype=paddle.float32).unsqueeze(1) # (B, 1, T) y_batch, dtype=paddle.float32).unsqueeze(1) # (B, 1, T)
c_batch = paddle.to_tensor( c_batch = paddle.to_tensor(
...@@ -120,3 +124,113 @@ class Clip(object): ...@@ -120,3 +124,113 @@ class Clip(object):
0] * self.hop_size, f"wave length: ({len(x)}), mel length: ({c.shape[0]})" 0] * self.hop_size, f"wave length: ({len(x)}), mel length: ({c.shape[0]})"
return x, c return x, c
class WaveRNNClip(Clip):
def __init__(self,
mode: str='RAW',
batch_max_steps: int=4500,
hop_size: int=300,
aux_context_window: int=2,
bits: int=9,
mu_law: bool=True):
self.mode = mode
self.mel_win = batch_max_steps // hop_size + 2 * aux_context_window
self.batch_max_steps = batch_max_steps
self.hop_size = hop_size
self.aux_context_window = aux_context_window
self.mu_law = mu_law
self.batch_max_frames = batch_max_steps // hop_size
self.mel_threshold = self.batch_max_frames + 2 * aux_context_window
if self.mode == 'MOL':
self.bits = 16
else:
self.bits = bits
def to_quant(self, wav):
if self.mode == 'RAW':
if self.mu_law:
quant = encode_mu_law(wav, mu=2**self.bits)
else:
quant = float_2_label(wav, bits=self.bits)
elif self.mode == 'MOL':
quant = float_2_label(wav, bits=16)
quant = quant.astype(np.int64)
return quant
def __call__(self, batch):
# voc_pad = 2 this will pad the input so that the resnet can 'see' wider than input length
# max_offsets = n_frames - 2 - (mel_win + 2 * hp.voc_pad) = n_frames - 15
"""Convert into batch tensors.
Parameters
----------
batch : list
list of tuple of the pair of audio and features.
Audio shape (T, ), features shape(T', C).
Returns
----------
Tensor
Input signal batch (B, 1, T).
Tensor
Target signal batch (B, 1, T).
Tensor
Auxiliary feature batch (B, C, T'), where
T = (T' - 2 * aux_context_window) * hop_size.
"""
# check length
batch = [
self._adjust_length(b['wave'], b['feats']) for b in batch
if b['feats'].shape[0] > self.mel_threshold
]
wav, mel = [b[0] for b in batch], [b[1] for b in batch]
# mel 此处需要转置
mel = [x.T for x in mel]
max_offsets = [
x.shape[-1] - 2 - (self.mel_win + 2 * self.aux_context_window)
for x in mel
]
# the slice point of mel selecting randomly
mel_offsets = [np.random.randint(0, offset) for offset in max_offsets]
# the slice point of wav selecting randomly, which is behind 2(=pad) frames
sig_offsets = [(offset + self.aux_context_window) * self.hop_size
for offset in mel_offsets]
# mels.shape[1] = voc_seq_len // hop_length + 2 * voc_pad
mels = [
x[:, mel_offsets[i]:mel_offsets[i] + self.mel_win]
for i, x in enumerate(mel)
]
# label.shape[1] = voc_seq_len + 1
wav = [self.to_quant(x) for x in wav]
labels = [
x[sig_offsets[i]:sig_offsets[i] + self.batch_max_steps + 1]
for i, x in enumerate(wav)
]
mels = np.stack(mels).astype(np.float32)
labels = np.stack(labels).astype(np.int64)
mels = paddle.to_tensor(mels)
labels = paddle.to_tensor(labels, dtype='int64')
# x is input, y is label
x = labels[:, :self.batch_max_steps]
y = labels[:, 1:]
'''
mode = RAW:
mu_law = True:
quant: bits = 9 0, 1, 2, ..., 509, 510, 511 int
mu_law = False
quant bits = 9 [0, 511] float
mode = MOL:
quant: bits = 16 [0. 65536] float
'''
# x should be normalizes in.[0, 1] in RAW mode
x = label_2_float(paddle.cast(x, dtype='float32'), self.bits)
# y should be normalizes in.[0, 1] in MOL mode
if self.mode == 'MOL':
y = label_2_float(paddle.cast(y, dtype='float32'), self.bits)
return x, y, mels
...@@ -29,6 +29,7 @@ from paddlespeech.t2s.datasets.preprocess_utils import merge_silence ...@@ -29,6 +29,7 @@ from paddlespeech.t2s.datasets.preprocess_utils import merge_silence
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2 from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
from paddlespeech.t2s.models.fastspeech2 import StyleFastSpeech2Inference from paddlespeech.t2s.models.fastspeech2 import StyleFastSpeech2Inference
from paddlespeech.t2s.modules.normalizer import ZScore from paddlespeech.t2s.modules.normalizer import ZScore
from paddlespeech.t2s.utils import str2bool
def evaluate(args, fastspeech2_config): def evaluate(args, fastspeech2_config):
...@@ -196,9 +197,6 @@ def main(): ...@@ -196,9 +197,6 @@ def main():
parser.add_argument( parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument( parser.add_argument(
"--cut-sil", "--cut-sil",
type=str2bool, type=str2bool,
......
...@@ -35,6 +35,7 @@ from paddlespeech.t2s.datasets.preprocess_utils import get_input_token ...@@ -35,6 +35,7 @@ from paddlespeech.t2s.datasets.preprocess_utils import get_input_token
from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur
from paddlespeech.t2s.datasets.preprocess_utils import get_spk_id_map from paddlespeech.t2s.datasets.preprocess_utils import get_spk_id_map
from paddlespeech.t2s.datasets.preprocess_utils import merge_silence from paddlespeech.t2s.datasets.preprocess_utils import merge_silence
from paddlespeech.t2s.utils import str2bool
def process_sentence(config: Dict[str, Any], def process_sentence(config: Dict[str, Any],
...@@ -203,9 +204,6 @@ def main(): ...@@ -203,9 +204,6 @@ def main():
parser.add_argument( parser.add_argument(
"--num-cpu", type=int, default=1, help="number of process.") "--num-cpu", type=int, default=1, help="number of process.")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument( parser.add_argument(
"--cut-sil", "--cut-sil",
type=str2bool, type=str2bool,
......
...@@ -38,6 +38,7 @@ from paddlespeech.t2s.training.extensions.visualizer import VisualDL ...@@ -38,6 +38,7 @@ from paddlespeech.t2s.training.extensions.visualizer import VisualDL
from paddlespeech.t2s.training.optimizer import build_optimizers from paddlespeech.t2s.training.optimizer import build_optimizers
from paddlespeech.t2s.training.seeding import seed_everything from paddlespeech.t2s.training.seeding import seed_everything
from paddlespeech.t2s.training.trainer import Trainer from paddlespeech.t2s.training.trainer import Trainer
from paddlespeech.t2s.utils import str2bool
def train_sp(args, config): def train_sp(args, config):
...@@ -182,9 +183,6 @@ def main(): ...@@ -182,9 +183,6 @@ def main():
default=None, default=None,
help="speaker id map file for multiple speaker model.") help="speaker id map file for multiple speaker model.")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument( parser.add_argument(
"--voice-cloning", "--voice-cloning",
type=str2bool, type=str2bool,
......
...@@ -41,6 +41,7 @@ from paddlespeech.t2s.training.extensions.snapshot import Snapshot ...@@ -41,6 +41,7 @@ from paddlespeech.t2s.training.extensions.snapshot import Snapshot
from paddlespeech.t2s.training.extensions.visualizer import VisualDL from paddlespeech.t2s.training.extensions.visualizer import VisualDL
from paddlespeech.t2s.training.seeding import seed_everything from paddlespeech.t2s.training.seeding import seed_everything
from paddlespeech.t2s.training.trainer import Trainer from paddlespeech.t2s.training.trainer import Trainer
from paddlespeech.t2s.utils import str2bool
def train_sp(args, config): def train_sp(args, config):
...@@ -204,8 +205,6 @@ def train_sp(args, config): ...@@ -204,8 +205,6 @@ def train_sp(args, config):
def main(): def main():
# parse args and config and redirect to train_sp # parse args and config and redirect to train_sp
def str2bool(str):
return True if str.lower() == 'true' else False
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Train a ParallelWaveGAN model.") description="Train a ParallelWaveGAN model.")
......
...@@ -30,6 +30,7 @@ from yacs.config import CfgNode ...@@ -30,6 +30,7 @@ from yacs.config import CfgNode
from paddlespeech.t2s.data.get_feats import LogMelFBank from paddlespeech.t2s.data.get_feats import LogMelFBank
from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur
from paddlespeech.t2s.datasets.preprocess_utils import merge_silence from paddlespeech.t2s.datasets.preprocess_utils import merge_silence
from paddlespeech.t2s.utils import str2bool
def process_sentence(config: Dict[str, Any], def process_sentence(config: Dict[str, Any],
...@@ -165,9 +166,6 @@ def main(): ...@@ -165,9 +166,6 @@ def main():
parser.add_argument( parser.add_argument(
"--dur-file", default=None, type=str, help="path to durations.txt.") "--dur-file", default=None, type=str, help="path to durations.txt.")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument( parser.add_argument(
"--cut-sil", "--cut-sil",
type=str2bool, type=str2bool,
......
...@@ -33,7 +33,7 @@ def main(): ...@@ -33,7 +33,7 @@ def main():
default='fastspeech2_csmsc', default='fastspeech2_csmsc',
choices=[ choices=[
'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_aishell3', 'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_aishell3',
'fastspeech2_vctk' 'fastspeech2_vctk', 'tacotron2_csmsc'
], ],
help='Choose acoustic model type of tts task.') help='Choose acoustic model type of tts task.')
parser.add_argument( parser.add_argument(
...@@ -54,7 +54,7 @@ def main(): ...@@ -54,7 +54,7 @@ def main():
default='pwgan_csmsc', default='pwgan_csmsc',
choices=[ choices=[
'pwgan_csmsc', 'mb_melgan_csmsc', 'hifigan_csmsc', 'pwgan_aishell3', 'pwgan_csmsc', 'mb_melgan_csmsc', 'hifigan_csmsc', 'pwgan_aishell3',
'pwgan_vctk' 'pwgan_vctk', 'wavernn_csmsc'
], ],
help='Choose vocoder type of tts task.') help='Choose vocoder type of tts task.')
# other # other
......
...@@ -33,6 +33,7 @@ from paddlespeech.t2s.datasets.preprocess_utils import get_input_token ...@@ -33,6 +33,7 @@ from paddlespeech.t2s.datasets.preprocess_utils import get_input_token
from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur
from paddlespeech.t2s.datasets.preprocess_utils import get_spk_id_map from paddlespeech.t2s.datasets.preprocess_utils import get_spk_id_map
from paddlespeech.t2s.datasets.preprocess_utils import merge_silence from paddlespeech.t2s.datasets.preprocess_utils import merge_silence
from paddlespeech.t2s.utils import str2bool
def process_sentence(config: Dict[str, Any], def process_sentence(config: Dict[str, Any],
...@@ -179,9 +180,6 @@ def main(): ...@@ -179,9 +180,6 @@ def main():
parser.add_argument( parser.add_argument(
"--num-cpu", type=int, default=1, help="number of process.") "--num-cpu", type=int, default=1, help="number of process.")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument( parser.add_argument(
"--cut-sil", "--cut-sil",
type=str2bool, type=str2bool,
......
...@@ -27,6 +27,7 @@ from paddle.io import DataLoader ...@@ -27,6 +27,7 @@ from paddle.io import DataLoader
from paddle.io import DistributedBatchSampler from paddle.io import DistributedBatchSampler
from yacs.config import CfgNode from yacs.config import CfgNode
from paddlespeech.t2s.datasets.am_batch_fn import tacotron2_multi_spk_batch_fn
from paddlespeech.t2s.datasets.am_batch_fn import tacotron2_single_spk_batch_fn from paddlespeech.t2s.datasets.am_batch_fn import tacotron2_single_spk_batch_fn
from paddlespeech.t2s.datasets.data_table import DataTable from paddlespeech.t2s.datasets.data_table import DataTable
from paddlespeech.t2s.models.new_tacotron2 import Tacotron2 from paddlespeech.t2s.models.new_tacotron2 import Tacotron2
...@@ -37,6 +38,7 @@ from paddlespeech.t2s.training.extensions.visualizer import VisualDL ...@@ -37,6 +38,7 @@ from paddlespeech.t2s.training.extensions.visualizer import VisualDL
from paddlespeech.t2s.training.optimizer import build_optimizers from paddlespeech.t2s.training.optimizer import build_optimizers
from paddlespeech.t2s.training.seeding import seed_everything from paddlespeech.t2s.training.seeding import seed_everything
from paddlespeech.t2s.training.trainer import Trainer from paddlespeech.t2s.training.trainer import Trainer
from paddlespeech.t2s.utils import str2bool
def train_sp(args, config): def train_sp(args, config):
...@@ -60,33 +62,38 @@ def train_sp(args, config): ...@@ -60,33 +62,38 @@ def train_sp(args, config):
# dataloader has been too verbose # dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True logging.getLogger("DataLoader").disabled = True
# construct dataset for training and validation fields = [
with jsonlines.open(args.train_metadata, 'r') as reader:
train_metadata = list(reader)
train_dataset = DataTable(
data=train_metadata,
fields=[
"text", "text",
"text_lengths", "text_lengths",
"speech", "speech",
"speech_lengths", "speech_lengths",
], ]
converters={
converters = {
"speech": np.load, "speech": np.load,
}, ) }
if args.voice_cloning:
print("Training voice cloning!")
collate_fn = tacotron2_multi_spk_batch_fn
fields += ["spk_emb"]
converters["spk_emb"] = np.load
else:
print("single speaker tacotron2!")
collate_fn = tacotron2_single_spk_batch_fn
# construct dataset for training and validation
with jsonlines.open(args.train_metadata, 'r') as reader:
train_metadata = list(reader)
train_dataset = DataTable(
data=train_metadata,
fields=fields,
converters=converters, )
with jsonlines.open(args.dev_metadata, 'r') as reader: with jsonlines.open(args.dev_metadata, 'r') as reader:
dev_metadata = list(reader) dev_metadata = list(reader)
dev_dataset = DataTable( dev_dataset = DataTable(
data=dev_metadata, data=dev_metadata,
fields=[ fields=fields,
"text", converters=converters, )
"text_lengths",
"speech",
"speech_lengths",
],
converters={
"speech": np.load,
}, )
# collate function and dataloader # collate function and dataloader
train_sampler = DistributedBatchSampler( train_sampler = DistributedBatchSampler(
...@@ -100,7 +107,7 @@ def train_sp(args, config): ...@@ -100,7 +107,7 @@ def train_sp(args, config):
train_dataloader = DataLoader( train_dataloader = DataLoader(
train_dataset, train_dataset,
batch_sampler=train_sampler, batch_sampler=train_sampler,
collate_fn=tacotron2_single_spk_batch_fn, collate_fn=collate_fn,
num_workers=config.num_workers) num_workers=config.num_workers)
dev_dataloader = DataLoader( dev_dataloader = DataLoader(
...@@ -108,7 +115,7 @@ def train_sp(args, config): ...@@ -108,7 +115,7 @@ def train_sp(args, config):
shuffle=False, shuffle=False,
drop_last=False, drop_last=False,
batch_size=config.batch_size, batch_size=config.batch_size,
collate_fn=tacotron2_single_spk_batch_fn, collate_fn=collate_fn,
num_workers=config.num_workers) num_workers=config.num_workers)
print("dataloaders done!") print("dataloaders done!")
...@@ -166,6 +173,12 @@ def main(): ...@@ -166,6 +173,12 @@ def main():
parser.add_argument( parser.add_argument(
"--phones-dict", type=str, default=None, help="phone vocabulary file.") "--phones-dict", type=str, default=None, help="phone vocabulary file.")
parser.add_argument(
"--voice-cloning",
type=str2bool,
default=False,
help="whether training voice cloning model.")
args = parser.parse_args() args = parser.parse_args()
with open(args.config) as f: with open(args.config) as f:
......
...@@ -30,6 +30,7 @@ from paddlespeech.t2s.frontend.zh_frontend import Frontend ...@@ -30,6 +30,7 @@ from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.models.speedyspeech import SpeedySpeech from paddlespeech.t2s.models.speedyspeech import SpeedySpeech
from paddlespeech.t2s.models.speedyspeech import SpeedySpeechInference from paddlespeech.t2s.models.speedyspeech import SpeedySpeechInference
from paddlespeech.t2s.modules.normalizer import ZScore from paddlespeech.t2s.modules.normalizer import ZScore
from paddlespeech.t2s.utils import str2bool
def evaluate(args, speedyspeech_config): def evaluate(args, speedyspeech_config):
...@@ -213,9 +214,6 @@ def main(): ...@@ -213,9 +214,6 @@ def main():
parser.add_argument( parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument( parser.add_argument(
"--cut-sil", "--cut-sil",
type=str2bool, type=str2bool,
......
...@@ -23,6 +23,7 @@ from sklearn.preprocessing import StandardScaler ...@@ -23,6 +23,7 @@ from sklearn.preprocessing import StandardScaler
from tqdm import tqdm from tqdm import tqdm
from paddlespeech.t2s.datasets.data_table import DataTable from paddlespeech.t2s.datasets.data_table import DataTable
from paddlespeech.t2s.utils import str2bool
def main(): def main():
...@@ -55,9 +56,6 @@ def main(): ...@@ -55,9 +56,6 @@ def main():
default=1, default=1,
help="logging level. higher is more logging. (default=1)") help="logging level. higher is more logging. (default=1)")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument( parser.add_argument(
"--use-relative-path", "--use-relative-path",
type=str2bool, type=str2bool,
......
...@@ -33,6 +33,7 @@ from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur ...@@ -33,6 +33,7 @@ from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur
from paddlespeech.t2s.datasets.preprocess_utils import get_phones_tones from paddlespeech.t2s.datasets.preprocess_utils import get_phones_tones
from paddlespeech.t2s.datasets.preprocess_utils import get_spk_id_map from paddlespeech.t2s.datasets.preprocess_utils import get_spk_id_map
from paddlespeech.t2s.datasets.preprocess_utils import merge_silence from paddlespeech.t2s.datasets.preprocess_utils import merge_silence
from paddlespeech.t2s.utils import str2bool
def process_sentence(config: Dict[str, Any], def process_sentence(config: Dict[str, Any],
...@@ -190,9 +191,6 @@ def main(): ...@@ -190,9 +191,6 @@ def main():
parser.add_argument( parser.add_argument(
"--num-cpu", type=int, default=1, help="number of process.") "--num-cpu", type=int, default=1, help="number of process.")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument( parser.add_argument(
"--cut-sil", "--cut-sil",
type=str2bool, type=str2bool,
......
...@@ -38,6 +38,7 @@ from paddlespeech.t2s.training.extensions.visualizer import VisualDL ...@@ -38,6 +38,7 @@ from paddlespeech.t2s.training.extensions.visualizer import VisualDL
from paddlespeech.t2s.training.optimizer import build_optimizers from paddlespeech.t2s.training.optimizer import build_optimizers
from paddlespeech.t2s.training.seeding import seed_everything from paddlespeech.t2s.training.seeding import seed_everything
from paddlespeech.t2s.training.trainer import Trainer from paddlespeech.t2s.training.trainer import Trainer
from paddlespeech.t2s.utils import str2bool
def train_sp(args, config): def train_sp(args, config):
...@@ -186,9 +187,6 @@ def main(): ...@@ -186,9 +187,6 @@ def main():
parser.add_argument( parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument( parser.add_argument(
"--use-relative-path", "--use-relative-path",
type=str2bool, type=str2bool,
......
...@@ -25,6 +25,7 @@ from yacs.config import CfgNode ...@@ -25,6 +25,7 @@ from yacs.config import CfgNode
from paddlespeech.s2t.utils.dynamic_import import dynamic_import from paddlespeech.s2t.utils.dynamic_import import dynamic_import
from paddlespeech.t2s.datasets.data_table import DataTable from paddlespeech.t2s.datasets.data_table import DataTable
from paddlespeech.t2s.modules.normalizer import ZScore from paddlespeech.t2s.modules.normalizer import ZScore
from paddlespeech.t2s.utils import str2bool
model_alias = { model_alias = {
# acoustic model # acoustic model
...@@ -97,6 +98,9 @@ def evaluate(args): ...@@ -97,6 +98,9 @@ def evaluate(args):
fields = ["utt_id", "phones", "tones"] fields = ["utt_id", "phones", "tones"]
elif am_name == 'tacotron2': elif am_name == 'tacotron2':
fields = ["utt_id", "text"] fields = ["utt_id", "text"]
if args.voice_cloning:
print("voice cloning!")
fields += ["spk_emb"]
test_dataset = DataTable(data=test_metadata, fields=fields) test_dataset = DataTable(data=test_metadata, fields=fields)
...@@ -178,7 +182,11 @@ def evaluate(args): ...@@ -178,7 +182,11 @@ def evaluate(args):
mel = am_inference(phone_ids, tone_ids) mel = am_inference(phone_ids, tone_ids)
elif am_name == 'tacotron2': elif am_name == 'tacotron2':
phone_ids = paddle.to_tensor(datum["text"]) phone_ids = paddle.to_tensor(datum["text"])
mel = am_inference(phone_ids) spk_emb = None
# multi speaker
if args.voice_cloning and "spk_emb" in datum:
spk_emb = paddle.to_tensor(np.load(datum["spk_emb"]))
mel = am_inference(phone_ids, spk_emb=spk_emb)
# vocoder # vocoder
wav = voc_inference(mel) wav = voc_inference(mel)
sf.write( sf.write(
...@@ -199,7 +207,8 @@ def main(): ...@@ -199,7 +207,8 @@ def main():
default='fastspeech2_csmsc', default='fastspeech2_csmsc',
choices=[ choices=[
'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_ljspeech', 'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_ljspeech',
'fastspeech2_aishell3', 'fastspeech2_vctk', 'tacotron2_csmsc' 'fastspeech2_aishell3', 'fastspeech2_vctk', 'tacotron2_csmsc',
'tacotron2_ljspeech', 'tacotron2_aishell3'
], ],
help='Choose acoustic model type of tts task.') help='Choose acoustic model type of tts task.')
parser.add_argument( parser.add_argument(
...@@ -225,9 +234,6 @@ def main(): ...@@ -225,9 +234,6 @@ def main():
parser.add_argument( parser.add_argument(
"--speaker_dict", type=str, default=None, help="speaker id map file.") "--speaker_dict", type=str, default=None, help="speaker id map file.")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument( parser.add_argument(
"--voice-cloning", "--voice-cloning",
type=str2bool, type=str2bool,
......
...@@ -59,6 +59,10 @@ model_alias = { ...@@ -59,6 +59,10 @@ model_alias = {
"paddlespeech.t2s.models.hifigan:HiFiGANGenerator", "paddlespeech.t2s.models.hifigan:HiFiGANGenerator",
"hifigan_inference": "hifigan_inference":
"paddlespeech.t2s.models.hifigan:HiFiGANInference", "paddlespeech.t2s.models.hifigan:HiFiGANInference",
"wavernn":
"paddlespeech.t2s.models.wavernn:WaveRNN",
"wavernn_inference":
"paddlespeech.t2s.models.wavernn:WaveRNNInference",
} }
...@@ -151,10 +155,16 @@ def evaluate(args): ...@@ -151,10 +155,16 @@ def evaluate(args):
voc_name = args.voc[:args.voc.rindex('_')] voc_name = args.voc[:args.voc.rindex('_')]
voc_class = dynamic_import(voc_name, model_alias) voc_class = dynamic_import(voc_name, model_alias)
voc_inference_class = dynamic_import(voc_name + '_inference', model_alias) voc_inference_class = dynamic_import(voc_name + '_inference', model_alias)
if voc_name != 'wavernn':
voc = voc_class(**voc_config["generator_params"]) voc = voc_class(**voc_config["generator_params"])
voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"]) voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"])
voc.remove_weight_norm() voc.remove_weight_norm()
voc.eval() voc.eval()
else:
voc = voc_class(**voc_config["model"])
voc.set_state_dict(paddle.load(args.voc_ckpt)["main_params"])
voc.eval()
voc_mu, voc_std = np.load(args.voc_stat) voc_mu, voc_std = np.load(args.voc_stat)
voc_mu = paddle.to_tensor(voc_mu) voc_mu = paddle.to_tensor(voc_mu)
voc_std = paddle.to_tensor(voc_std) voc_std = paddle.to_tensor(voc_std)
...@@ -178,10 +188,7 @@ def evaluate(args): ...@@ -178,10 +188,7 @@ def evaluate(args):
am_inference = jit.to_static( am_inference = jit.to_static(
am_inference, am_inference,
input_spec=[InputSpec([-1], dtype=paddle.int64)]) input_spec=[InputSpec([-1], dtype=paddle.int64)])
paddle.jit.save(am_inference,
os.path.join(args.inference_dir, args.am))
am_inference = paddle.jit.load(
os.path.join(args.inference_dir, args.am))
elif am_name == 'speedyspeech': elif am_name == 'speedyspeech':
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict: if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
am_inference = jit.to_static( am_inference = jit.to_static(
...@@ -200,8 +207,11 @@ def evaluate(args): ...@@ -200,8 +207,11 @@ def evaluate(args):
InputSpec([-1], dtype=paddle.int64) InputSpec([-1], dtype=paddle.int64)
]) ])
paddle.jit.save(am_inference, elif am_name == 'tacotron2':
os.path.join(args.inference_dir, args.am)) am_inference = jit.to_static(
am_inference, input_spec=[InputSpec([-1], dtype=paddle.int64)])
paddle.jit.save(am_inference, os.path.join(args.inference_dir, args.am))
am_inference = paddle.jit.load( am_inference = paddle.jit.load(
os.path.join(args.inference_dir, args.am)) os.path.join(args.inference_dir, args.am))
...@@ -285,7 +295,7 @@ def main(): ...@@ -285,7 +295,7 @@ def main():
choices=[ choices=[
'speedyspeech_csmsc', 'speedyspeech_aishell3', 'fastspeech2_csmsc', 'speedyspeech_csmsc', 'speedyspeech_aishell3', 'fastspeech2_csmsc',
'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk', 'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk',
'tacotron2_csmsc' 'tacotron2_csmsc', 'tacotron2_ljspeech'
], ],
help='Choose acoustic model type of tts task.') help='Choose acoustic model type of tts task.')
parser.add_argument( parser.add_argument(
...@@ -322,7 +332,8 @@ def main(): ...@@ -322,7 +332,8 @@ def main():
default='pwgan_csmsc', default='pwgan_csmsc',
choices=[ choices=[
'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk', 'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk',
'mb_melgan_csmsc', 'style_melgan_csmsc', 'hifigan_csmsc' 'mb_melgan_csmsc', 'style_melgan_csmsc', 'hifigan_csmsc',
'wavernn_csmsc'
], ],
help='Choose vocoder type of tts task.') help='Choose vocoder type of tts task.')
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from yacs.config import CfgNode as CN
_C = CN()
_C.data = CN(
dict(
batch_size=32, # batch size
valid_size=64, # the first N examples are reserved for validation
sample_rate=22050, # Hz, sample rate
n_fft=1024, # fft frame size
win_length=1024, # window size
hop_length=256, # hop size between ajacent frame
fmax=8000, # Hz, max frequency when converting to mel
fmin=0, # Hz, min frequency when converting to mel
n_mels=80, # mel bands
padding_idx=0, # text embedding's padding index
))
_C.model = CN(
dict(
vocab_size=37, # set this according to the frontend's vocab size
n_tones=None,
reduction_factor=1, # reduction factor
d_encoder=512, # embedding & encoder's internal size
encoder_conv_layers=3, # number of conv layer in tacotron2 encoder
encoder_kernel_size=5, # kernel size of conv layers in tacotron2 encoder
d_prenet=256, # hidden size of decoder prenet
d_attention_rnn=1024, # hidden size of the first rnn layer in tacotron2 decoder
d_decoder_rnn=1024, # hidden size of the second rnn layer in tacotron2 decoder
d_attention=128, # hidden size of decoder location linear layer
attention_filters=32, # number of filter in decoder location conv layer
attention_kernel_size=31, # kernel size of decoder location conv layer
d_postnet=512, # hidden size of decoder postnet
postnet_kernel_size=5, # kernel size of conv layers in postnet
postnet_conv_layers=5, # number of conv layer in decoder postnet
p_encoder_dropout=0.5, # droput probability in encoder
p_prenet_dropout=0.5, # droput probability in decoder prenet
p_attention_dropout=0.1, # droput probability of first rnn layer in decoder
p_decoder_dropout=0.1, # droput probability of second rnn layer in decoder
p_postnet_dropout=0.5, # droput probability in decoder postnet
d_global_condition=None,
use_stop_token=True, # wherther to use binary classifier to predict when to stop
use_guided_attention_loss=False, # whether to use guided attention loss
guided_attention_loss_sigma=0.2 # sigma in guided attention loss
))
_C.training = CN(
dict(
lr=1e-3, # learning rate
weight_decay=1e-6, # the coeff of weight decay
grad_clip_thresh=1.0, # the clip norm of grad clip.
plot_interval=1000, # plot attention and spectrogram
valid_interval=1000, # validation
save_interval=1000, # checkpoint
max_iteration=500000, # max iteration to train
))
def get_cfg_defaults():
"""Get a yacs CfgNode object with default values for my_project."""
# Return a clone so that the defaults will not be altered
# This is for the "local variable" use pattern
return _C.clone()
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pickle
from pathlib import Path
import numpy as np
from paddle.io import Dataset
from paddlespeech.t2s.data.batch import batch_spec
from paddlespeech.t2s.data.batch import batch_text_id
class LJSpeech(Dataset):
"""A simple dataset adaptor for the processed ljspeech dataset."""
def __init__(self, root):
self.root = Path(root).expanduser()
records = []
with open(self.root / "metadata.pkl", 'rb') as f:
metadata = pickle.load(f)
for mel_name, text, ids in metadata:
mel_name = self.root / "mel" / (mel_name + ".npy")
records.append((mel_name, text, ids))
self.records = records
def __getitem__(self, i):
mel_name, _, ids = self.records[i]
mel = np.load(mel_name)
return ids, mel
def __len__(self):
return len(self.records)
class LJSpeechCollector(object):
"""A simple callable to batch LJSpeech examples."""
def __init__(self, padding_idx=0, padding_value=0., padding_stop_token=1.0):
self.padding_idx = padding_idx
self.padding_value = padding_value
self.padding_stop_token = padding_stop_token
def __call__(self, examples):
texts = []
mels = []
text_lens = []
mel_lens = []
for data in examples:
text, mel = data
text = np.array(text, dtype=np.int64)
text_lens.append(len(text))
mels.append(mel)
texts.append(text)
mel_lens.append(mel.shape[1])
# Sort by text_len in descending order
texts = [
i for i, _ in sorted(
zip(texts, text_lens), key=lambda x: x[1], reverse=True)
]
mels = [
i for i, _ in sorted(
zip(mels, text_lens), key=lambda x: x[1], reverse=True)
]
mel_lens = [
i for i, _ in sorted(
zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)
]
mel_lens = np.array(mel_lens, dtype=np.int64)
text_lens = np.array(sorted(text_lens, reverse=True), dtype=np.int64)
# Pad sequence with largest len of the batch
texts, _ = batch_text_id(texts, pad_id=self.padding_idx)
mels, _ = batch_spec(mels, pad_value=self.padding_value)
mels = np.transpose(mels, axes=(0, 2, 1))
return texts, mels, text_lens, mel_lens
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import pickle
from pathlib import Path
import numpy as np
import tqdm
from paddlespeech.t2s.audio import AudioProcessor
from paddlespeech.t2s.audio import LogMagnitude
from paddlespeech.t2s.datasets import LJSpeechMetaData
from paddlespeech.t2s.exps.tacotron2.config import get_cfg_defaults
from paddlespeech.t2s.frontend import EnglishCharacter
def create_dataset(config, source_path, target_path, verbose=False):
# create output dir
target_path = Path(target_path).expanduser()
mel_path = target_path / "mel"
os.makedirs(mel_path, exist_ok=True)
meta_data = LJSpeechMetaData(source_path)
frontend = EnglishCharacter()
processor = AudioProcessor(
sample_rate=config.data.sample_rate,
n_fft=config.data.n_fft,
n_mels=config.data.n_mels,
win_length=config.data.win_length,
hop_length=config.data.hop_length,
fmax=config.data.fmax,
fmin=config.data.fmin)
normalizer = LogMagnitude()
records = []
for (fname, text, _) in tqdm.tqdm(meta_data):
wav = processor.read_wav(fname)
mel = processor.mel_spectrogram(wav)
mel = normalizer.transform(mel)
ids = frontend(text)
mel_name = os.path.splitext(os.path.basename(fname))[0]
# save mel spectrogram
records.append((mel_name, text, ids))
np.save(mel_path / mel_name, mel)
if verbose:
print("save mel spectrograms into {}".format(mel_path))
# save meta data as pickle archive
with open(target_path / "metadata.pkl", 'wb') as f:
pickle.dump(records, f)
if verbose:
print("saved metadata into {}".format(target_path / "metadata.pkl"))
print("Done.")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="create dataset")
parser.add_argument(
"--config",
type=str,
metavar="FILE",
help="extra config to overwrite the default config")
parser.add_argument(
"--input", type=str, help="path of the ljspeech dataset")
parser.add_argument(
"--output", type=str, help="path to save output dataset")
parser.add_argument(
"--opts",
nargs=argparse.REMAINDER,
help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
)
parser.add_argument(
"-v", "--verbose", action="store_true", help="print msg")
config = get_cfg_defaults()
args = parser.parse_args()
if args.config:
config.merge_from_file(args.config)
if args.opts:
config.merge_from_list(args.opts)
config.freeze()
print(config.data)
create_dataset(config, args.input, args.output, args.verbose)
因为 它太大了无法显示 source diff 。你可以改为 查看blob
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from pathlib import Path
import numpy as np
import paddle
from matplotlib import pyplot as plt
from paddlespeech.t2s.exps.tacotron2.config import get_cfg_defaults
from paddlespeech.t2s.frontend import EnglishCharacter
from paddlespeech.t2s.models.tacotron2 import Tacotron2
from paddlespeech.t2s.utils import display
def main(config, args):
if args.ngpu == 0:
paddle.set_device("cpu")
elif args.ngpu > 0:
paddle.set_device("gpu")
else:
print("ngpu should >= 0 !")
# model
frontend = EnglishCharacter()
model = Tacotron2.from_pretrained(config, args.checkpoint_path)
model.eval()
# inputs
input_path = Path(args.input).expanduser()
sentences = []
with open(input_path, "rt") as f:
for line in f:
line_list = line.strip().split()
utt_id = line_list[0]
sentence = " ".join(line_list[1:])
sentences.append((utt_id, sentence))
if args.output is None:
output_dir = input_path.parent / "synthesis"
else:
output_dir = Path(args.output).expanduser()
output_dir.mkdir(exist_ok=True)
for i, sentence in enumerate(sentences):
sentence = paddle.to_tensor(frontend(sentence)).unsqueeze(0)
outputs = model.infer(sentence)
mel_output = outputs["mel_outputs_postnet"][0].numpy().T
alignment = outputs["alignments"][0].numpy().T
np.save(str(output_dir / f"sentence_{i}"), mel_output)
display.plot_alignment(alignment)
plt.savefig(str(output_dir / f"sentence_{i}.png"))
if args.verbose:
print("spectrogram saved at {}".format(output_dir /
f"sentence_{i}.npy"))
if __name__ == "__main__":
config = get_cfg_defaults()
parser = argparse.ArgumentParser(
description="generate mel spectrogram with TransformerTTS.")
parser.add_argument(
"--config",
type=str,
metavar="FILE",
help="extra config to overwrite the default config")
parser.add_argument(
"--checkpoint_path", type=str, help="path of the checkpoint to load.")
parser.add_argument("--input", type=str, help="path of the text sentences")
parser.add_argument("--output", type=str, help="path to save outputs")
parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
parser.add_argument(
"--opts",
nargs=argparse.REMAINDER,
help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
)
parser.add_argument(
"-v", "--verbose", action="store_true", help="print msg")
args = parser.parse_args()
if args.config:
config.merge_from_file(args.config)
if args.opts:
config.merge_from_list(args.opts)
config.freeze()
print(config)
print(args)
main(config, args)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import time
from collections import defaultdict
import numpy as np
import paddle
from paddle import distributed as dist
from paddle.io import DataLoader
from paddle.io import DistributedBatchSampler
from paddlespeech.t2s.data import dataset
from paddlespeech.t2s.exps.tacotron2.config import get_cfg_defaults
from paddlespeech.t2s.exps.tacotron2.ljspeech import LJSpeech
from paddlespeech.t2s.exps.tacotron2.ljspeech import LJSpeechCollector
from paddlespeech.t2s.models.tacotron2 import Tacotron2
from paddlespeech.t2s.models.tacotron2 import Tacotron2Loss
from paddlespeech.t2s.training.cli import default_argument_parser
from paddlespeech.t2s.training.experiment import ExperimentBase
from paddlespeech.t2s.utils import display
from paddlespeech.t2s.utils import mp_tools
class Experiment(ExperimentBase):
def compute_losses(self, inputs, outputs):
texts, mel_targets, plens, slens = inputs
mel_outputs = outputs["mel_output"]
mel_outputs_postnet = outputs["mel_outputs_postnet"]
attention_weight = outputs["alignments"]
if self.config.model.use_stop_token:
stop_logits = outputs["stop_logits"]
else:
stop_logits = None
losses = self.criterion(mel_outputs, mel_outputs_postnet, mel_targets,
attention_weight, slens, plens, stop_logits)
return losses
def train_batch(self):
start = time.time()
batch = self.read_batch()
data_loader_time = time.time() - start
self.optimizer.clear_grad()
self.model.train()
texts, mels, text_lens, output_lens = batch
outputs = self.model(texts, text_lens, mels, output_lens)
losses = self.compute_losses(batch, outputs)
loss = losses["loss"]
loss.backward()
self.optimizer.step()
iteration_time = time.time() - start
losses_np = {k: float(v) for k, v in losses.items()}
# logging
msg = "Rank: {}, ".format(dist.get_rank())
msg += "step: {}, ".format(self.iteration)
msg += "time: {:>.3f}s/{:>.3f}s, ".format(data_loader_time,
iteration_time)
msg += ', '.join('{}: {:>.6f}'.format(k, v)
for k, v in losses_np.items())
self.logger.info(msg)
if dist.get_rank() == 0:
for k, v in losses_np.items():
self.visualizer.add_scalar(f"train_loss/{k}", v, self.iteration)
@mp_tools.rank_zero_only
@paddle.no_grad()
def valid(self):
valid_losses = defaultdict(list)
for i, batch in enumerate(self.valid_loader):
texts, mels, text_lens, output_lens = batch
outputs = self.model(texts, text_lens, mels, output_lens)
losses = self.compute_losses(batch, outputs)
for k, v in losses.items():
valid_losses[k].append(float(v))
attention_weights = outputs["alignments"]
self.visualizer.add_figure(
f"valid_sentence_{i}_alignments",
display.plot_alignment(attention_weights[0].numpy().T),
self.iteration)
self.visualizer.add_figure(
f"valid_sentence_{i}_target_spectrogram",
display.plot_spectrogram(mels[0].numpy().T), self.iteration)
self.visualizer.add_figure(
f"valid_sentence_{i}_predicted_spectrogram",
display.plot_spectrogram(outputs['mel_outputs_postnet'][0]
.numpy().T), self.iteration)
# write visual log
valid_losses = {k: np.mean(v) for k, v in valid_losses.items()}
# logging
msg = "Valid: "
msg += "step: {}, ".format(self.iteration)
msg += ', '.join('{}: {:>.6f}'.format(k, v)
for k, v in valid_losses.items())
self.logger.info(msg)
for k, v in valid_losses.items():
self.visualizer.add_scalar(f"valid/{k}", v, self.iteration)
def setup_model(self):
config = self.config
model = Tacotron2(
vocab_size=config.model.vocab_size,
d_mels=config.data.n_mels,
d_encoder=config.model.d_encoder,
encoder_conv_layers=config.model.encoder_conv_layers,
encoder_kernel_size=config.model.encoder_kernel_size,
d_prenet=config.model.d_prenet,
d_attention_rnn=config.model.d_attention_rnn,
d_decoder_rnn=config.model.d_decoder_rnn,
attention_filters=config.model.attention_filters,
attention_kernel_size=config.model.attention_kernel_size,
d_attention=config.model.d_attention,
d_postnet=config.model.d_postnet,
postnet_kernel_size=config.model.postnet_kernel_size,
postnet_conv_layers=config.model.postnet_conv_layers,
reduction_factor=config.model.reduction_factor,
p_encoder_dropout=config.model.p_encoder_dropout,
p_prenet_dropout=config.model.p_prenet_dropout,
p_attention_dropout=config.model.p_attention_dropout,
p_decoder_dropout=config.model.p_decoder_dropout,
p_postnet_dropout=config.model.p_postnet_dropout,
use_stop_token=config.model.use_stop_token)
if self.parallel:
model = paddle.DataParallel(model)
grad_clip = paddle.nn.ClipGradByGlobalNorm(
config.training.grad_clip_thresh)
optimizer = paddle.optimizer.Adam(
learning_rate=config.training.lr,
parameters=model.parameters(),
weight_decay=paddle.regularizer.L2Decay(
config.training.weight_decay),
grad_clip=grad_clip)
criterion = Tacotron2Loss(
use_stop_token_loss=config.model.use_stop_token,
use_guided_attention_loss=config.model.use_guided_attention_loss,
sigma=config.model.guided_attention_loss_sigma)
self.model = model
self.optimizer = optimizer
self.criterion = criterion
def setup_dataloader(self):
args = self.args
config = self.config
ljspeech_dataset = LJSpeech(args.data)
valid_set, train_set = dataset.split(ljspeech_dataset,
config.data.valid_size)
batch_fn = LJSpeechCollector(padding_idx=config.data.padding_idx)
if not self.parallel:
self.train_loader = DataLoader(
train_set,
batch_size=config.data.batch_size,
shuffle=True,
drop_last=True,
collate_fn=batch_fn)
else:
sampler = DistributedBatchSampler(
train_set,
batch_size=config.data.batch_size,
shuffle=True,
drop_last=True)
self.train_loader = DataLoader(
train_set, batch_sampler=sampler, collate_fn=batch_fn)
self.valid_loader = DataLoader(
valid_set,
batch_size=config.data.batch_size,
shuffle=False,
drop_last=False,
collate_fn=batch_fn)
def main_sp(config, args):
exp = Experiment(config, args)
exp.setup()
exp.resume_or_load()
exp.run()
def main(config, args):
if args.ngpu > 1:
dist.spawn(main_sp, args=(config, args), nprocs=args.ngpu)
else:
main_sp(config, args)
if __name__ == "__main__":
config = get_cfg_defaults()
parser = default_argument_parser()
args = parser.parse_args()
if args.config:
config.merge_from_file(args.config)
if args.opts:
config.merge_from_list(args.opts)
config.freeze()
print(config)
print(args)
main(config, args)
...@@ -130,6 +130,9 @@ def main(): ...@@ -130,6 +130,9 @@ def main():
"speech_lengths": item['speech_lengths'], "speech_lengths": item['speech_lengths'],
"speech": str(speech_path), "speech": str(speech_path),
} }
# add spk_emb for voice cloning
if "spk_emb" in item:
record["spk_emb"] = str(item["spk_emb"])
output_metadata.append(record) output_metadata.append(record)
output_metadata.sort(key=itemgetter('utt_id')) output_metadata.sort(key=itemgetter('utt_id'))
output_metadata_path = Path(args.dumpdir) / "metadata.jsonl" output_metadata_path = Path(args.dumpdir) / "metadata.jsonl"
......
...@@ -21,17 +21,43 @@ import soundfile as sf ...@@ -21,17 +21,43 @@ import soundfile as sf
import yaml import yaml
from yacs.config import CfgNode from yacs.config import CfgNode
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
from paddlespeech.t2s.frontend.zh_frontend import Frontend from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
from paddlespeech.t2s.modules.normalizer import ZScore from paddlespeech.t2s.modules.normalizer import ZScore
from paddlespeech.vector.exps.ge2e.audio_processor import SpeakerVerificationPreprocessor from paddlespeech.vector.exps.ge2e.audio_processor import SpeakerVerificationPreprocessor
from paddlespeech.vector.models.lstm_speaker_encoder import LSTMSpeakerEncoder from paddlespeech.vector.models.lstm_speaker_encoder import LSTMSpeakerEncoder
model_alias = {
# acoustic model
"fastspeech2":
"paddlespeech.t2s.models.fastspeech2:FastSpeech2",
"fastspeech2_inference":
"paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
"tacotron2":
"paddlespeech.t2s.models.new_tacotron2:Tacotron2",
"tacotron2_inference":
"paddlespeech.t2s.models.new_tacotron2:Tacotron2Inference",
# voc
"pwgan":
"paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
"pwgan_inference":
"paddlespeech.t2s.models.parallel_wavegan:PWGInference",
}
def voice_cloning(args):
# Init body.
with open(args.am_config) as f:
am_config = CfgNode(yaml.safe_load(f))
with open(args.voc_config) as f:
voc_config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(am_config)
print(voc_config)
def voice_cloning(args, fastspeech2_config, pwg_config):
# speaker encoder # speaker encoder
p = SpeakerVerificationPreprocessor( p = SpeakerVerificationPreprocessor(
sampling_rate=16000, sampling_rate=16000,
...@@ -57,40 +83,52 @@ def voice_cloning(args, fastspeech2_config, pwg_config): ...@@ -57,40 +83,52 @@ def voice_cloning(args, fastspeech2_config, pwg_config):
phn_id = [line.strip().split() for line in f.readlines()] phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id) vocab_size = len(phn_id)
print("vocab_size:", vocab_size) print("vocab_size:", vocab_size)
odim = fastspeech2_config.n_mels
model = FastSpeech2(
idim=vocab_size, odim=odim, **fastspeech2_config["model"])
model.set_state_dict( # acoustic model
paddle.load(args.fastspeech2_checkpoint)["main_params"]) odim = am_config.n_mels
model.eval() # model: {model_name}_{dataset}
am_name = args.am[:args.am.rindex('_')]
vocoder = PWGGenerator(**pwg_config["generator_params"]) am_dataset = args.am[args.am.rindex('_') + 1:]
vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
vocoder.remove_weight_norm() am_class = dynamic_import(am_name, model_alias)
vocoder.eval() am_inference_class = dynamic_import(am_name + '_inference', model_alias)
print("model done!")
if am_name == 'fastspeech2':
am = am_class(
idim=vocab_size, odim=odim, spk_num=None, **am_config["model"])
elif am_name == 'tacotron2':
am = am_class(idim=vocab_size, odim=odim, **am_config["model"])
am.set_state_dict(paddle.load(args.am_ckpt)["main_params"])
am.eval()
am_mu, am_std = np.load(args.am_stat)
am_mu = paddle.to_tensor(am_mu)
am_std = paddle.to_tensor(am_std)
am_normalizer = ZScore(am_mu, am_std)
am_inference = am_inference_class(am_normalizer, am)
am_inference.eval()
print("acoustic model done!")
# vocoder
# model: {model_name}_{dataset}
voc_name = args.voc[:args.voc.rindex('_')]
voc_class = dynamic_import(voc_name, model_alias)
voc_inference_class = dynamic_import(voc_name + '_inference', model_alias)
voc = voc_class(**voc_config["generator_params"])
voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"])
voc.remove_weight_norm()
voc.eval()
voc_mu, voc_std = np.load(args.voc_stat)
voc_mu = paddle.to_tensor(voc_mu)
voc_std = paddle.to_tensor(voc_std)
voc_normalizer = ZScore(voc_mu, voc_std)
voc_inference = voc_inference_class(voc_normalizer, voc)
voc_inference.eval()
print("voc done!")
frontend = Frontend(phone_vocab_path=args.phones_dict) frontend = Frontend(phone_vocab_path=args.phones_dict)
print("frontend done!") print("frontend done!")
stat = np.load(args.fastspeech2_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)
stat = np.load(args.pwg_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
fastspeech2_inference.eval()
pwg_inference = PWGInference(pwg_normalizer, vocoder)
pwg_inference.eval()
output_dir = Path(args.output_dir) output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True) output_dir.mkdir(parents=True, exist_ok=True)
...@@ -112,24 +150,23 @@ def voice_cloning(args, fastspeech2_config, pwg_config): ...@@ -112,24 +150,23 @@ def voice_cloning(args, fastspeech2_config, pwg_config):
# print("spk_emb shape: ", spk_emb.shape) # print("spk_emb shape: ", spk_emb.shape)
with paddle.no_grad(): with paddle.no_grad():
wav = pwg_inference( wav = voc_inference(am_inference(phone_ids, spk_emb=spk_emb))
fastspeech2_inference(phone_ids, spk_emb=spk_emb))
sf.write( sf.write(
str(output_dir / (utt_id + ".wav")), str(output_dir / (utt_id + ".wav")),
wav.numpy(), wav.numpy(),
samplerate=fastspeech2_config.fs) samplerate=am_config.fs)
print(f"{utt_id} done!") print(f"{utt_id} done!")
# Randomly generate numbers of 0 ~ 0.2, 256 is the dim of spk_emb # Randomly generate numbers of 0 ~ 0.2, 256 is the dim of spk_emb
random_spk_emb = np.random.rand(256) * 0.2 random_spk_emb = np.random.rand(256) * 0.2
random_spk_emb = paddle.to_tensor(random_spk_emb) random_spk_emb = paddle.to_tensor(random_spk_emb)
utt_id = "random_spk_emb" utt_id = "random_spk_emb"
with paddle.no_grad(): with paddle.no_grad():
wav = pwg_inference(fastspeech2_inference(phone_ids, spk_emb=spk_emb)) wav = voc_inference(am_inference(phone_ids, spk_emb=spk_emb))
sf.write( sf.write(
str(output_dir / (utt_id + ".wav")), str(output_dir / (utt_id + ".wav")),
wav.numpy(), wav.numpy(),
samplerate=fastspeech2_config.fs) samplerate=am_config.fs)
print(f"{utt_id} done!") print(f"{utt_id} done!")
...@@ -137,32 +174,53 @@ def main(): ...@@ -137,32 +174,53 @@ def main():
# parse args and config and redirect to train_sp # parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(description="") parser = argparse.ArgumentParser(description="")
parser.add_argument( parser.add_argument(
"--fastspeech2-config", type=str, help="fastspeech2 config file.") '--am',
parser.add_argument(
"--fastspeech2-checkpoint",
type=str, type=str,
help="fastspeech2 checkpoint to load.") default='fastspeech2_csmsc',
choices=['fastspeech2_aishell3', 'tacotron2_aishell3'],
help='Choose acoustic model type of tts task.')
parser.add_argument( parser.add_argument(
"--fastspeech2-stat", '--am_config',
type=str, type=str,
help="mean and standard deviation used to normalize spectrogram when training fastspeech2." default=None,
) help='Config of acoustic model. Use deault config when it is None.')
parser.add_argument( parser.add_argument(
"--pwg-config", type=str, help="parallel wavegan config file.") '--am_ckpt',
parser.add_argument(
"--pwg-checkpoint",
type=str, type=str,
help="parallel wavegan generator parameters to load.") default=None,
help='Checkpoint file of acoustic model.')
parser.add_argument( parser.add_argument(
"--pwg-stat", "--am_stat",
type=str, type=str,
help="mean and standard deviation used to normalize spectrogram when training parallel wavegan." default=None,
help="mean and standard deviation used to normalize spectrogram when training acoustic model."
) )
parser.add_argument( parser.add_argument(
"--phones-dict", "--phones-dict",
type=str, type=str,
default="phone_id_map.txt", default="phone_id_map.txt",
help="phone vocabulary file.") help="phone vocabulary file.")
# vocoder
parser.add_argument(
'--voc',
type=str,
default='pwgan_csmsc',
choices=['pwgan_aishell3'],
help='Choose vocoder type of tts task.')
parser.add_argument(
'--voc_config',
type=str,
default=None,
help='Config of voc. Use deault config when it is None.')
parser.add_argument(
'--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.')
parser.add_argument(
"--voc_stat",
type=str,
default=None,
help="mean and standard deviation used to normalize spectrogram when training voc."
)
parser.add_argument( parser.add_argument(
"--text", "--text",
type=str, type=str,
...@@ -190,18 +248,7 @@ def main(): ...@@ -190,18 +248,7 @@ def main():
else: else:
print("ngpu should >= 0 !") print("ngpu should >= 0 !")
with open(args.fastspeech2_config) as f: voice_cloning(args)
fastspeech2_config = CfgNode(yaml.safe_load(f))
with open(args.pwg_config) as f:
pwg_config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(fastspeech2_config)
print(pwg_config)
voice_cloning(args, fastspeech2_config, pwg_config)
if __name__ == "__main__": if __name__ == "__main__":
......
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pickle
from pathlib import Path
import numpy as np
from paddle.io import Dataset
from paddlespeech.t2s.data import batch_spec
from paddlespeech.t2s.data import batch_text_id
from paddlespeech.t2s.exps.voice_cloning.tacotron2_ge2e.preprocess_transcription import _phones
from paddlespeech.t2s.exps.voice_cloning.tacotron2_ge2e.preprocess_transcription import _tones
from paddlespeech.t2s.frontend import Vocab
voc_phones = Vocab(sorted(list(_phones)))
print("vocab_phones:\n", voc_phones)
voc_tones = Vocab(sorted(list(_tones)))
print("vocab_tones:\n", voc_tones)
class AiShell3(Dataset):
"""Processed AiShell3 dataset."""
def __init__(self, root):
super().__init__()
self.root = Path(root).expanduser()
self.embed_dir = self.root / "embed"
self.mel_dir = self.root / "mel"
with open(self.root / "metadata.pickle", 'rb') as f:
self.records = pickle.load(f)
def __getitem__(self, index):
metadatum = self.records[index]
sentence_id = metadatum["sentence_id"]
speaker_id = sentence_id[:7]
phones = metadatum["phones"]
tones = metadatum["tones"]
phones = np.array(
[voc_phones.lookup(item) for item in phones], dtype=np.int64)
tones = np.array(
[voc_tones.lookup(item) for item in tones], dtype=np.int64)
mel = np.load(str(self.mel_dir / speaker_id / (sentence_id + ".npy")))
embed = np.load(
str(self.embed_dir / speaker_id / (sentence_id + ".npy")))
return phones, tones, mel, embed
def __len__(self):
return len(self.records)
def collate_aishell3_examples(examples):
phones, tones, mel, embed = list(zip(*examples))
text_lengths = np.array([item.shape[0] for item in phones], dtype=np.int64)
spec_lengths = np.array([item.shape[1] for item in mel], dtype=np.int64)
T_dec = np.max(spec_lengths)
stop_tokens = (
np.arange(T_dec) >= np.expand_dims(spec_lengths, -1)).astype(np.float32)
phones, _ = batch_text_id(phones)
tones, _ = batch_text_id(tones)
mel, _ = batch_spec(mel)
mel = np.transpose(mel, (0, 2, 1))
embed = np.stack(embed)
# 7 fields
# (B, T), (B, T), (B, T, C), (B, C), (B,), (B,), (B, T)
return phones, tones, mel, embed, text_lengths, spec_lengths, stop_tokens
if __name__ == "__main__":
dataset = AiShell3("~/datasets/aishell3/train")
example = dataset[0]
examples = [dataset[i] for i in range(10)]
batch = collate_aishell3_examples(examples)
for field in batch:
print(field.shape, field.dtype)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List
from typing import Tuple
from pypinyin import lazy_pinyin
from pypinyin import Style
from paddlespeech.t2s.exps.voice_cloning.tacotron2_ge2e.preprocess_transcription import split_syllable
def convert_to_pinyin(text: str) -> List[str]:
"""convert text into list of syllables, other characters that are not chinese, thus
cannot be converted to pinyin are splited.
"""
syllables = lazy_pinyin(
text, style=Style.TONE3, neutral_tone_with_five=True)
return syllables
def convert_sentence(text: str) -> List[Tuple[str]]:
"""convert a sentence into two list: phones and tones"""
syllables = convert_to_pinyin(text)
phones = []
tones = []
for syllable in syllables:
p, t = split_syllable(syllable)
phones.extend(p)
tones.extend(t)
return phones, tones
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from yacs.config import CfgNode as CN
_C = CN()
_C.data = CN(
dict(
batch_size=32, # batch size
valid_size=64, # the first N examples are reserved for validation
sample_rate=22050, # Hz, sample rate
n_fft=1024, # fft frame size
win_length=1024, # window size
hop_length=256, # hop size between ajacent frame
fmax=8000, # Hz, max frequency when converting to mel
fmin=0, # Hz, min frequency when converting to mel
d_mels=80, # mel bands
padding_idx=0, # text embedding's padding index
))
_C.model = CN(
dict(
vocab_size=70,
n_tones=10,
reduction_factor=1, # reduction factor
d_encoder=512, # embedding & encoder's internal size
encoder_conv_layers=3, # number of conv layer in tacotron2 encoder
encoder_kernel_size=5, # kernel size of conv layers in tacotron2 encoder
d_prenet=256, # hidden size of decoder prenet
# hidden size of the first rnn layer in tacotron2 decoder
d_attention_rnn=1024,
# hidden size of the second rnn layer in tacotron2 decoder
d_decoder_rnn=1024,
d_attention=128, # hidden size of decoder location linear layer
attention_filters=32, # number of filter in decoder location conv layer
attention_kernel_size=31, # kernel size of decoder location conv layer
d_postnet=512, # hidden size of decoder postnet
postnet_kernel_size=5, # kernel size of conv layers in postnet
postnet_conv_layers=5, # number of conv layer in decoder postnet
p_encoder_dropout=0.5, # droput probability in encoder
p_prenet_dropout=0.5, # droput probability in decoder prenet
# droput probability of first rnn layer in decoder
p_attention_dropout=0.1,
# droput probability of second rnn layer in decoder
p_decoder_dropout=0.1,
p_postnet_dropout=0.5, # droput probability in decoder postnet
guided_attention_loss_sigma=0.2,
d_global_condition=256,
# whether to use a classifier to predict stop probability
use_stop_token=False,
# whether to use guided attention loss in training
use_guided_attention_loss=True, ))
_C.training = CN(
dict(
lr=1e-3, # learning rate
weight_decay=1e-6, # the coeff of weight decay
grad_clip_thresh=1.0, # the clip norm of grad clip.
valid_interval=1000, # validation
save_interval=1000, # checkpoint
max_iteration=500000, # max iteration to train
))
def get_cfg_defaults():
"""Get a yacs CfgNode object with default values for my_project."""
# Return a clone so that the defaults will not be altered
# This is for the "local variable" use pattern
return _C.clone()
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import multiprocessing as mp
from functools import partial
from pathlib import Path
import numpy as np
import tqdm
from paddlespeech.t2s.audio import AudioProcessor
from paddlespeech.t2s.audio.spec_normalizer import LogMagnitude
from paddlespeech.t2s.audio.spec_normalizer import NormalizerBase
from paddlespeech.t2s.exps.voice_cloning.tacotron2_ge2e.config import get_cfg_defaults
def extract_mel(fname: Path,
input_dir: Path,
output_dir: Path,
p: AudioProcessor,
n: NormalizerBase):
relative_path = fname.relative_to(input_dir)
out_path = (output_dir / relative_path).with_suffix(".npy")
out_path.parent.mkdir(parents=True, exist_ok=True)
wav = p.read_wav(fname)
mel = p.mel_spectrogram(wav)
mel = n.transform(mel)
np.save(out_path, mel)
def extract_mel_multispeaker(config, input_dir, output_dir, extension=".wav"):
input_dir = Path(input_dir).expanduser()
fnames = list(input_dir.rglob(f"*{extension}"))
output_dir = Path(output_dir).expanduser()
output_dir.mkdir(parents=True, exist_ok=True)
p = AudioProcessor(config.sample_rate, config.n_fft, config.win_length,
config.hop_length, config.d_mels, config.fmin,
config.fmax)
n = LogMagnitude(1e-5)
func = partial(
extract_mel, input_dir=input_dir, output_dir=output_dir, p=p, n=n)
with mp.Pool(16) as pool:
list(
tqdm.tqdm(
pool.imap(func, fnames), total=len(fnames), unit="utterance"))
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Extract mel spectrogram from processed wav in AiShell3 training dataset."
)
parser.add_argument(
"--config",
type=str,
help="yaml config file to overwrite the default config")
parser.add_argument(
"--input",
type=str,
default="~/datasets/aishell3/train/normalized_wav",
help="path of the processed wav folder")
parser.add_argument(
"--output",
type=str,
default="~/datasets/aishell3/train/mel",
help="path of the folder to save mel spectrograms")
parser.add_argument(
"--opts",
nargs=argparse.REMAINDER,
help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
)
default_config = get_cfg_defaults()
args = parser.parse_args()
if args.config:
default_config.merge_from_file(args.config)
if args.opts:
default_config.merge_from_list(args.opts)
default_config.freeze()
audio_config = default_config.data
extract_mel_multispeaker(audio_config, args.input, args.output)
zhi1 zh iii1
zhi2 zh iii2
zhi3 zh iii3
zhi4 zh iii4
zhi5 zh iii5
chi1 ch iii1
chi2 ch iii2
chi3 ch iii3
chi4 ch iii4
chi5 ch iii5
shi1 sh iii1
shi2 sh iii2
shi3 sh iii3
shi4 sh iii4
shi5 sh iii5
ri1 r iii1
ri2 r iii2
ri3 r iii3
ri4 r iii4
ri5 r iii5
zi1 z ii1
zi2 z ii2
zi3 z ii3
zi4 z ii4
zi5 z ii5
ci1 c ii1
ci2 c ii2
ci3 c ii3
ci4 c ii4
ci5 c ii5
si1 s ii1
si2 s ii2
si3 s ii3
si4 s ii4
si5 s ii5
a1 a1
a2 a2
a3 a3
a4 a4
a5 a5
ba1 b a1
ba2 b a2
ba3 b a3
ba4 b a4
ba5 b a5
pa1 p a1
pa2 p a2
pa3 p a3
pa4 p a4
pa5 p a5
ma1 m a1
ma2 m a2
ma3 m a3
ma4 m a4
ma5 m a5
fa1 f a1
fa2 f a2
fa3 f a3
fa4 f a4
fa5 f a5
da1 d a1
da2 d a2
da3 d a3
da4 d a4
da5 d a5
ta1 t a1
ta2 t a2
ta3 t a3
ta4 t a4
ta5 t a5
na1 n a1
na2 n a2
na3 n a3
na4 n a4
na5 n a5
la1 l a1
la2 l a2
la3 l a3
la4 l a4
la5 l a5
ga1 g a1
ga2 g a2
ga3 g a3
ga4 g a4
ga5 g a5
ka1 k a1
ka2 k a2
ka3 k a3
ka4 k a4
ka5 k a5
ha1 h a1
ha2 h a2
ha3 h a3
ha4 h a4
ha5 h a5
zha1 zh a1
zha2 zh a2
zha3 zh a3
zha4 zh a4
zha5 zh a5
cha1 ch a1
cha2 ch a2
cha3 ch a3
cha4 ch a4
cha5 ch a5
sha1 sh a1
sha2 sh a2
sha3 sh a3
sha4 sh a4
sha5 sh a5
za1 z a1
za2 z a2
za3 z a3
za4 z a4
za5 z a5
ca1 c a1
ca2 c a2
ca3 c a3
ca4 c a4
ca5 c a5
sa1 s a1
sa2 s a2
sa3 s a3
sa4 s a4
sa5 s a5
o1 o1
o2 o2
o3 o3
o4 o4
o5 o5
bo1 b uo1
bo2 b uo2
bo3 b uo3
bo4 b uo4
bo5 b uo5
po1 p uo1
po2 p uo2
po3 p uo3
po4 p uo4
po5 p uo5
mo1 m uo1
mo2 m uo2
mo3 m uo3
mo4 m uo4
mo5 m uo5
fo1 f uo1
fo2 f uo2
fo3 f uo3
fo4 f uo4
fo5 f uo5
lo1 l o1
lo2 l o2
lo3 l o3
lo4 l o4
lo5 l o5
e1 e1
e2 e2
e3 e3
e4 e4
e5 e5
me1 m e1
me2 m e2
me3 m e3
me4 m e4
me5 m e5
de1 d e1
de2 d e2
de3 d e3
de4 d e4
de5 d e5
te1 t e1
te2 t e2
te3 t e3
te4 t e4
te5 t e5
ne1 n e1
ne2 n e2
ne3 n e3
ne4 n e4
ne5 n e5
le1 l e1
le2 l e2
le3 l e3
le4 l e4
le5 l e5
ge1 g e1
ge2 g e2
ge3 g e3
ge4 g e4
ge5 g e5
ke1 k e1
ke2 k e2
ke3 k e3
ke4 k e4
ke5 k e5
he1 h e1
he2 h e2
he3 h e3
he4 h e4
he5 h e5
zhe1 zh e1
zhe2 zh e2
zhe3 zh e3
zhe4 zh e4
zhe5 zh e5
che1 ch e1
che2 ch e2
che3 ch e3
che4 ch e4
che5 ch e5
she1 sh e1
she2 sh e2
she3 sh e3
she4 sh e4
she5 sh e5
re1 r e1
re2 r e2
re3 r e3
re4 r e4
re5 r e5
ze1 z e1
ze2 z e2
ze3 z e3
ze4 z e4
ze5 z e5
ce1 c e1
ce2 c e2
ce3 c e3
ce4 c e4
ce5 c e5
se1 s e1
se2 s e2
se3 s e3
se4 s e4
se5 s e5
ea1 ea1
ea2 ea2
ea3 ea3
ea4 ea4
ea5 ea5
ai1 ai1
ai2 ai2
ai3 ai3
ai4 ai4
ai5 ai5
bai1 b ai1
bai2 b ai2
bai3 b ai3
bai4 b ai4
bai5 b ai5
pai1 p ai1
pai2 p ai2
pai3 p ai3
pai4 p ai4
pai5 p ai5
mai1 m ai1
mai2 m ai2
mai3 m ai3
mai4 m ai4
mai5 m ai5
dai1 d ai1
dai2 d ai2
dai3 d ai3
dai4 d ai4
dai5 d ai5
tai1 t ai1
tai2 t ai2
tai3 t ai3
tai4 t ai4
tai5 t ai5
nai1 n ai1
nai2 n ai2
nai3 n ai3
nai4 n ai4
nai5 n ai5
lai1 l ai1
lai2 l ai2
lai3 l ai3
lai4 l ai4
lai5 l ai5
gai1 g ai1
gai2 g ai2
gai3 g ai3
gai4 g ai4
gai5 g ai5
kai1 k ai1
kai2 k ai2
kai3 k ai3
kai4 k ai4
kai5 k ai5
hai1 h ai1
hai2 h ai2
hai3 h ai3
hai4 h ai4
hai5 h ai5
zhai1 zh ai1
zhai2 zh ai2
zhai3 zh ai3
zhai4 zh ai4
zhai5 zh ai5
chai1 ch ai1
chai2 ch ai2
chai3 ch ai3
chai4 ch ai4
chai5 ch ai5
shai1 sh ai1
shai2 sh ai2
shai3 sh ai3
shai4 sh ai4
shai5 sh ai5
zai1 z ai1
zai2 z ai2
zai3 z ai3
zai4 z ai4
zai5 z ai5
cai1 c ai1
cai2 c ai2
cai3 c ai3
cai4 c ai4
cai5 c ai5
sai1 s ai1
sai2 s ai2
sai3 s ai3
sai4 s ai4
sai5 s ai5
ei1 ei1
ei2 ei2
ei3 ei3
ei4 ei4
ei5 ei5
bei1 b ei1
bei2 b ei2
bei3 b ei3
bei4 b ei4
bei5 b ei5
pei1 p ei1
pei2 p ei2
pei3 p ei3
pei4 p ei4
pei5 p ei5
mei1 m ei1
mei2 m ei2
mei3 m ei3
mei4 m ei4
mei5 m ei5
fei1 f ei1
fei2 f ei2
fei3 f ei3
fei4 f ei4
fei5 f ei5
dei1 d ei1
dei2 d ei2
dei3 d ei3
dei4 d ei4
dei5 d ei5
tei1 t ei1
tei2 t ei2
tei3 t ei3
tei4 t ei4
tei5 t ei5
nei1 n ei1
nei2 n ei2
nei3 n ei3
nei4 n ei4
nei5 n ei5
lei1 l ei1
lei2 l ei2
lei3 l ei3
lei4 l ei4
lei5 l ei5
gei1 g ei1
gei2 g ei2
gei3 g ei3
gei4 g ei4
gei5 g ei5
kei1 k ei1
kei2 k ei2
kei3 k ei3
kei4 k ei4
kei5 k ei5
hei1 h ei1
hei2 h ei2
hei3 h ei3
hei4 h ei4
hei5 h ei5
zhei1 zh ei1
zhei2 zh ei2
zhei3 zh ei3
zhei4 zh ei4
zhei5 zh ei5
shei1 sh ei1
shei2 sh ei2
shei3 sh ei3
shei4 sh ei4
shei5 sh ei5
zei1 z ei1
zei2 z ei2
zei3 z ei3
zei4 z ei4
zei5 z ei5
ao1 au1
ao2 au2
ao3 au3
ao4 au4
ao5 au5
bao1 b au1
bao2 b au2
bao3 b au3
bao4 b au4
bao5 b au5
pao1 p au1
pao2 p au2
pao3 p au3
pao4 p au4
pao5 p au5
mao1 m au1
mao2 m au2
mao3 m au3
mao4 m au4
mao5 m au5
dao1 d au1
dao2 d au2
dao3 d au3
dao4 d au4
dao5 d au5
tao1 t au1
tao2 t au2
tao3 t au3
tao4 t au4
tao5 t au5
nao1 n au1
nao2 n au2
nao3 n au3
nao4 n au4
nao5 n au5
lao1 l au1
lao2 l au2
lao3 l au3
lao4 l au4
lao5 l au5
gao1 g au1
gao2 g au2
gao3 g au3
gao4 g au4
gao5 g au5
kao1 k au1
kao2 k au2
kao3 k au3
kao4 k au4
kao5 k au5
hao1 h au1
hao2 h au2
hao3 h au3
hao4 h au4
hao5 h au5
zhao1 zh au1
zhao2 zh au2
zhao3 zh au3
zhao4 zh au4
zhao5 zh au5
chao1 ch au1
chao2 ch au2
chao3 ch au3
chao4 ch au4
chao5 ch au5
shao1 sh au1
shao2 sh au2
shao3 sh au3
shao4 sh au4
shao5 sh au5
rao1 r au1
rao2 r au2
rao3 r au3
rao4 r au4
rao5 r au5
zao1 z au1
zao2 z au2
zao3 z au3
zao4 z au4
zao5 z au5
cao1 c au1
cao2 c au2
cao3 c au3
cao4 c au4
cao5 c au5
sao1 s au1
sao2 s au2
sao3 s au3
sao4 s au4
sao5 s au5
ou1 ou1
ou2 ou2
ou3 ou3
ou4 ou4
ou5 ou5
pou1 p ou1
pou2 p ou2
pou3 p ou3
pou4 p ou4
pou5 p ou5
mou1 m ou1
mou2 m ou2
mou3 m ou3
mou4 m ou4
mou5 m ou5
fou1 f ou1
fou2 f ou2
fou3 f ou3
fou4 f ou4
fou5 f ou5
dou1 d ou1
dou2 d ou2
dou3 d ou3
dou4 d ou4
dou5 d ou5
tou1 t ou1
tou2 t ou2
tou3 t ou3
tou4 t ou4
tou5 t ou5
nou1 n ou1
nou2 n ou2
nou3 n ou3
nou4 n ou4
nou5 n ou5
lou1 l ou1
lou2 l ou2
lou3 l ou3
lou4 l ou4
lou5 l ou5
gou1 g ou1
gou2 g ou2
gou3 g ou3
gou4 g ou4
gou5 g ou5
kou1 k ou1
kou2 k ou2
kou3 k ou3
kou4 k ou4
kou5 k ou5
hou1 h ou1
hou2 h ou2
hou3 h ou3
hou4 h ou4
hou5 h ou5
zhou1 zh ou1
zhou2 zh ou2
zhou3 zh ou3
zhou4 zh ou4
zhou5 zh ou5
chou1 ch ou1
chou2 ch ou2
chou3 ch ou3
chou4 ch ou4
chou5 ch ou5
shou1 sh ou1
shou2 sh ou2
shou3 sh ou3
shou4 sh ou4
shou5 sh ou5
rou1 r ou1
rou2 r ou2
rou3 r ou3
rou4 r ou4
rou5 r ou5
zou1 z ou1
zou2 z ou2
zou3 z ou3
zou4 z ou4
zou5 z ou5
cou1 c ou1
cou2 c ou2
cou3 c ou3
cou4 c ou4
cou5 c ou5
sou1 s ou1
sou2 s ou2
sou3 s ou3
sou4 s ou4
sou5 s ou5
an1 an1
an2 an2
an3 an3
an4 an4
an5 an5
ban1 b an1
ban2 b an2
ban3 b an3
ban4 b an4
ban5 b an5
pan1 p an1
pan2 p an2
pan3 p an3
pan4 p an4
pan5 p an5
man1 m an1
man2 m an2
man3 m an3
man4 m an4
man5 m an5
fan1 f an1
fan2 f an2
fan3 f an3
fan4 f an4
fan5 f an5
dan1 d an1
dan2 d an2
dan3 d an3
dan4 d an4
dan5 d an5
tan1 t an1
tan2 t an2
tan3 t an3
tan4 t an4
tan5 t an5
nan1 n an1
nan2 n an2
nan3 n an3
nan4 n an4
nan5 n an5
lan1 l an1
lan2 l an2
lan3 l an3
lan4 l an4
lan5 l an5
gan1 g an1
gan2 g an2
gan3 g an3
gan4 g an4
gan5 g an5
kan1 k an1
kan2 k an2
kan3 k an3
kan4 k an4
kan5 k an5
han1 h an1
han2 h an2
han3 h an3
han4 h an4
han5 h an5
zhan1 zh an1
zhan2 zh an2
zhan3 zh an3
zhan4 zh an4
zhan5 zh an5
chan1 ch an1
chan2 ch an2
chan3 ch an3
chan4 ch an4
chan5 ch an5
shan1 sh an1
shan2 sh an2
shan3 sh an3
shan4 sh an4
shan5 sh an5
ran1 r an1
ran2 r an2
ran3 r an3
ran4 r an4
ran5 r an5
zan1 z an1
zan2 z an2
zan3 z an3
zan4 z an4
zan5 z an5
can1 c an1
can2 c an2
can3 c an3
can4 c an4
can5 c an5
san1 s an1
san2 s an2
san3 s an3
san4 s an4
san5 s an5
en1 en1
en2 en2
en3 en3
en4 en4
en5 en5
ben1 b en1
ben2 b en2
ben3 b en3
ben4 b en4
ben5 b en5
pen1 p en1
pen2 p en2
pen3 p en3
pen4 p en4
pen5 p en5
men1 m en1
men2 m en2
men3 m en3
men4 m en4
men5 m en5
fen1 f en1
fen2 f en2
fen3 f en3
fen4 f en4
fen5 f en5
den1 d en1
den2 d en2
den3 d en3
den4 d en4
den5 d en5
nen1 n en1
nen2 n en2
nen3 n en3
nen4 n en4
nen5 n en5
gen1 g en1
gen2 g en2
gen3 g en3
gen4 g en4
gen5 g en5
ken1 k en1
ken2 k en2
ken3 k en3
ken4 k en4
ken5 k en5
hen1 h en1
hen2 h en2
hen3 h en3
hen4 h en4
hen5 h en5
zhen1 zh en1
zhen2 zh en2
zhen3 zh en3
zhen4 zh en4
zhen5 zh en5
chen1 ch en1
chen2 ch en2
chen3 ch en3
chen4 ch en4
chen5 ch en5
shen1 sh en1
shen2 sh en2
shen3 sh en3
shen4 sh en4
shen5 sh en5
ren1 r en1
ren2 r en2
ren3 r en3
ren4 r en4
ren5 r en5
zen1 z en1
zen2 z en2
zen3 z en3
zen4 z en4
zen5 z en5
cen1 c en1
cen2 c en2
cen3 c en3
cen4 c en4
cen5 c en5
sen1 s en1
sen2 s en2
sen3 s en3
sen4 s en4
sen5 s en5
ang1 ang1
ang2 ang2
ang3 ang3
ang4 ang4
ang5 ang5
bang1 b ang1
bang2 b ang2
bang3 b ang3
bang4 b ang4
bang5 b ang5
pang1 p ang1
pang2 p ang2
pang3 p ang3
pang4 p ang4
pang5 p ang5
mang1 m ang1
mang2 m ang2
mang3 m ang3
mang4 m ang4
mang5 m ang5
fang1 f ang1
fang2 f ang2
fang3 f ang3
fang4 f ang4
fang5 f ang5
dang1 d ang1
dang2 d ang2
dang3 d ang3
dang4 d ang4
dang5 d ang5
tang1 t ang1
tang2 t ang2
tang3 t ang3
tang4 t ang4
tang5 t ang5
nang1 n ang1
nang2 n ang2
nang3 n ang3
nang4 n ang4
nang5 n ang5
lang1 l ang1
lang2 l ang2
lang3 l ang3
lang4 l ang4
lang5 l ang5
gang1 g ang1
gang2 g ang2
gang3 g ang3
gang4 g ang4
gang5 g ang5
kang1 k ang1
kang2 k ang2
kang3 k ang3
kang4 k ang4
kang5 k ang5
hang1 h ang1
hang2 h ang2
hang3 h ang3
hang4 h ang4
hang5 h ang5
zhang1 zh ang1
zhang2 zh ang2
zhang3 zh ang3
zhang4 zh ang4
zhang5 zh ang5
chang1 ch ang1
chang2 ch ang2
chang3 ch ang3
chang4 ch ang4
chang5 ch ang5
shang1 sh ang1
shang2 sh ang2
shang3 sh ang3
shang4 sh ang4
shang5 sh ang5
rang1 r ang1
rang2 r ang2
rang3 r ang3
rang4 r ang4
rang5 r ang5
zang1 z ang1
zang2 z ang2
zang3 z ang3
zang4 z ang4
zang5 z ang5
cang1 c ang1
cang2 c ang2
cang3 c ang3
cang4 c ang4
cang5 c ang5
sang1 s ang1
sang2 s ang2
sang3 s ang3
sang4 s ang4
sang5 s ang5
eng1 eng1
eng2 eng2
eng3 eng3
eng4 eng4
eng5 eng5
beng1 b eng1
beng2 b eng2
beng3 b eng3
beng4 b eng4
beng5 b eng5
peng1 p eng1
peng2 p eng2
peng3 p eng3
peng4 p eng4
peng5 p eng5
meng1 m eng1
meng2 m eng2
meng3 m eng3
meng4 m eng4
meng5 m eng5
feng1 f eng1
feng2 f eng2
feng3 f eng3
feng4 f eng4
feng5 f eng5
deng1 d eng1
deng2 d eng2
deng3 d eng3
deng4 d eng4
deng5 d eng5
teng1 t eng1
teng2 t eng2
teng3 t eng3
teng4 t eng4
teng5 t eng5
neng1 n eng1
neng2 n eng2
neng3 n eng3
neng4 n eng4
neng5 n eng5
leng1 l eng1
leng2 l eng2
leng3 l eng3
leng4 l eng4
leng5 l eng5
geng1 g eng1
geng2 g eng2
geng3 g eng3
geng4 g eng4
geng5 g eng5
keng1 k eng1
keng2 k eng2
keng3 k eng3
keng4 k eng4
keng5 k eng5
heng1 h eng1
heng2 h eng2
heng3 h eng3
heng4 h eng4
heng5 h eng5
zheng1 zh eng1
zheng2 zh eng2
zheng3 zh eng3
zheng4 zh eng4
zheng5 zh eng5
cheng1 ch eng1
cheng2 ch eng2
cheng3 ch eng3
cheng4 ch eng4
cheng5 ch eng5
sheng1 sh eng1
sheng2 sh eng2
sheng3 sh eng3
sheng4 sh eng4
sheng5 sh eng5
reng1 r eng1
reng2 r eng2
reng3 r eng3
reng4 r eng4
reng5 r eng5
zeng1 z eng1
zeng2 z eng2
zeng3 z eng3
zeng4 z eng4
zeng5 z eng5
ceng1 c eng1
ceng2 c eng2
ceng3 c eng3
ceng4 c eng4
ceng5 c eng5
seng1 s eng1
seng2 s eng2
seng3 s eng3
seng4 s eng4
seng5 s eng5
er1 er1
er2 er2
er3 er3
er4 er4
er5 er5
yi1 y i1
yi2 y i2
yi3 y i3
yi4 y i4
yi5 y i5
bi1 b i1
bi2 b i2
bi3 b i3
bi4 b i4
bi5 b i5
pi1 p i1
pi2 p i2
pi3 p i3
pi4 p i4
pi5 p i5
mi1 m i1
mi2 m i2
mi3 m i3
mi4 m i4
mi5 m i5
di1 d i1
di2 d i2
di3 d i3
di4 d i4
di5 d i5
ti1 t i1
ti2 t i2
ti3 t i3
ti4 t i4
ti5 t i5
ni1 n i1
ni2 n i2
ni3 n i3
ni4 n i4
ni5 n i5
li1 l i1
li2 l i2
li3 l i3
li4 l i4
li5 l i5
ji1 j i1
ji2 j i2
ji3 j i3
ji4 j i4
ji5 j i5
qi1 q i1
qi2 q i2
qi3 q i3
qi4 q i4
qi5 q i5
xi1 x i1
xi2 x i2
xi3 x i3
xi4 x i4
xi5 x i5
ya1 y ia1
ya2 y ia2
ya3 y ia3
ya4 y ia4
ya5 y ia5
dia1 d ia1
dia2 d ia2
dia3 d ia3
dia4 d ia4
dia5 d ia5
lia1 l ia1
lia2 l ia2
lia3 l ia3
lia4 l ia4
lia5 l ia5
jia1 j ia1
jia2 j ia2
jia3 j ia3
jia4 j ia4
jia5 j ia5
qia1 q ia1
qia2 q ia2
qia3 q ia3
qia4 q ia4
qia5 q ia5
xia1 x ia1
xia2 x ia2
xia3 x ia3
xia4 x ia4
xia5 x ia5
yo1 y io1
yo2 y io2
yo3 y io3
yo4 y io4
yo5 y io5
ye1 y ie1
ye2 y ie2
ye3 y ie3
ye4 y ie4
ye5 y ie5
bie1 b ie1
bie2 b ie2
bie3 b ie3
bie4 b ie4
bie5 b ie5
pie1 p ie1
pie2 p ie2
pie3 p ie3
pie4 p ie4
pie5 p ie5
mie1 m ie1
mie2 m ie2
mie3 m ie3
mie4 m ie4
mie5 m ie5
die1 d ie1
die2 d ie2
die3 d ie3
die4 d ie4
die5 d ie5
tie1 t ie1
tie2 t ie2
tie3 t ie3
tie4 t ie4
tie5 t ie5
nie1 n ie1
nie2 n ie2
nie3 n ie3
nie4 n ie4
nie5 n ie5
lie1 l ie1
lie2 l ie2
lie3 l ie3
lie4 l ie4
lie5 l ie5
jie1 j ie1
jie2 j ie2
jie3 j ie3
jie4 j ie4
jie5 j ie5
qie1 q ie1
qie2 q ie2
qie3 q ie3
qie4 q ie4
qie5 q ie5
xie1 x ie1
xie2 x ie2
xie3 x ie3
xie4 x ie4
xie5 x ie5
yai1 y ai1
yai2 y ai2
yai3 y ai3
yai4 y ai4
yai5 y ai5
yao1 y au1
yao2 y au2
yao3 y au3
yao4 y au4
yao5 y au5
biao1 b iau1
biao2 b iau2
biao3 b iau3
biao4 b iau4
biao5 b iau5
piao1 p iau1
piao2 p iau2
piao3 p iau3
piao4 p iau4
piao5 p iau5
miao1 m iau1
miao2 m iau2
miao3 m iau3
miao4 m iau4
miao5 m iau5
fiao1 f iau1
fiao2 f iau2
fiao3 f iau3
fiao4 f iau4
fiao5 f iau5
diao1 d iau1
diao2 d iau2
diao3 d iau3
diao4 d iau4
diao5 d iau5
tiao1 t iau1
tiao2 t iau2
tiao3 t iau3
tiao4 t iau4
tiao5 t iau5
niao1 n iau1
niao2 n iau2
niao3 n iau3
niao4 n iau4
niao5 n iau5
liao1 l iau1
liao2 l iau2
liao3 l iau3
liao4 l iau4
liao5 l iau5
jiao1 j iau1
jiao2 j iau2
jiao3 j iau3
jiao4 j iau4
jiao5 j iau5
qiao1 q iau1
qiao2 q iau2
qiao3 q iau3
qiao4 q iau4
qiao5 q iau5
xiao1 x iau1
xiao2 x iau2
xiao3 x iau3
xiao4 x iau4
xiao5 x iau5
you1 y iou1
you2 y iou2
you3 y iou3
you4 y iou4
you5 y iou5
miu1 m iou1
miu2 m iou2
miu3 m iou3
miu4 m iou4
miu5 m iou5
diu1 d iou1
diu2 d iou2
diu3 d iou3
diu4 d iou4
diu5 d iou5
niu1 n iou1
niu2 n iou2
niu3 n iou3
niu4 n iou4
niu5 n iou5
liu1 l iou1
liu2 l iou2
liu3 l iou3
liu4 l iou4
liu5 l iou5
jiu1 j iou1
jiu2 j iou2
jiu3 j iou3
jiu4 j iou4
jiu5 j iou5
qiu1 q iou1
qiu2 q iou2
qiu3 q iou3
qiu4 q iou4
qiu5 q iou5
xiu1 xiou1
xiu2 xiou2
xiu3 xiou3
xiu4 xiou4
xiu5 xiou5
yan1 y ian1
yan2 y ian2
yan3 y ian3
yan4 y ian4
yan5 y ian5
bian1 b ian1
bian2 b ian2
bian3 b ian3
bian4 b ian4
bian5 b ian5
pian1 p ian1
pian2 p ian2
pian3 p ian3
pian4 p ian4
pian5 p ian5
mian1 m ian1
mian2 m ian2
mian3 m ian3
mian4 m ian4
mian5 m ian5
dian1 d ian1
dian2 d ian2
dian3 d ian3
dian4 d ian4
dian5 d ian5
tian1 t ian1
tian2 t ian2
tian3 t ian3
tian4 t ian4
tian5 t ian5
nian1 n ian1
nian2 n ian2
nian3 n ian3
nian4 n ian4
nian5 n ian5
lian1 l ian1
lian2 l ian2
lian3 l ian3
lian4 l ian4
lian5 l ian5
jian1 j ian1
jian2 j ian2
jian3 j ian3
jian4 j ian4
jian5 j ian5
qian1 q ian1
qian2 q ian2
qian3 q ian3
qian4 q ian4
qian5 q ian5
xian1 x ian1
xian2 x ian2
xian3 x ian3
xian4 x ian4
xian5 x ian5
yin1 y in1
yin2 y in2
yin3 y in3
yin4 y in4
yin5 y in5
bin1 b in1
bin2 b in2
bin3 b in3
bin4 b in4
bin5 b in5
pin1 p in1
pin2 p in2
pin3 p in3
pin4 p in4
pin5 p in5
min1 m in1
min2 m in2
min3 m in3
min4 m in4
min5 m in5
din1 d in1
din2 d in2
din3 d in3
din4 d in4
din5 d in5
nin1 n in1
nin2 n in2
nin3 n in3
nin4 n in4
nin5 n in5
lin1 l in1
lin2 l in2
lin3 l in3
lin4 l in4
lin5 l in5
jin1 j in1
jin2 j in2
jin3 j in3
jin4 j in4
jin5 j in5
qin1 q in1
qin2 q in2
qin3 q in3
qin4 q in4
qin5 q in5
xin1 x in1
xin2 x in2
xin3 x in3
xin4 x in4
xin5 x in5
yang1 y iang1
yang2 y iang2
yang3 y iang3
yang4 y iang4
yang5 y iang5
biang1 b iang1
biang2 b iang2
biang3 b iang3
biang4 b iang4
biang5 b iang5
niang1 n iang1
niang2 n iang2
niang3 n iang3
niang4 n iang4
niang5 n iang5
liang1 l iang1
liang2 l iang2
liang3 l iang3
liang4 l iang4
liang5 l iang5
jiang1 j iang1
jiang2 j iang2
jiang3 j iang3
jiang4 j iang4
jiang5 j iang5
qiang1 q iang1
qiang2 q iang2
qiang3 q iang3
qiang4 q iang4
qiang5 q iang5
xiang1 x iang1
xiang2 x iang2
xiang3 x iang3
xiang4 x iang4
xiang5 x iang5
ying1 y ing1
ying2 y ing2
ying3 y ing3
ying4 y ing4
ying5 y ing5
bing1 b ing1
bing2 b ing2
bing3 b ing3
bing4 b ing4
bing5 b ing5
ping1 p ing1
ping2 p ing2
ping3 p ing3
ping4 p ing4
ping5 p ing5
ming1 m ing1
ming2 m ing2
ming3 m ing3
ming4 m ing4
ming5 m ing5
ding1 d ing1
ding2 d ing2
ding3 d ing3
ding4 d ing4
ding5 d ing5
ting1 t ing1
ting2 t ing2
ting3 t ing3
ting4 t ing4
ting5 t ing5
ning1 n ing1
ning2 n ing2
ning3 n ing3
ning4 n ing4
ning5 n ing5
ling1 l ing1
ling2 l ing2
ling3 l ing3
ling4 l ing4
ling5 l ing5
jing1 j ing1
jing2 j ing2
jing3 j ing3
jing4 j ing4
jing5 j ing5
qing1 q ing1
qing2 q ing2
qing3 q ing3
qing4 q ing4
qing5 q ing5
xing1 x ing1
xing2 x ing2
xing3 x ing3
xing4 x ing4
xing5 x ing5
wu1 w u1
wu2 w u2
wu3 w u3
wu4 w u4
wu5 w u5
bu1 b u1
bu2 b u2
bu3 b u3
bu4 b u4
bu5 b u5
pu1 p u1
pu2 p u2
pu3 p u3
pu4 p u4
pu5 p u5
mu1 m u1
mu2 m u2
mu3 m u3
mu4 m u4
mu5 m u5
fu1 f u1
fu2 f u2
fu3 f u3
fu4 f u4
fu5 f u5
du1 d u1
du2 d u2
du3 d u3
du4 d u4
du5 d u5
tu1 t u1
tu2 t u2
tu3 t u3
tu4 t u4
tu5 t u5
nu1 n u1
nu2 n u2
nu3 n u3
nu4 n u4
nu5 n u5
lu1 l u1
lu2 l u2
lu3 l u3
lu4 l u4
lu5 l u5
gu1 g u1
gu2 g u2
gu3 g u3
gu4 g u4
gu5 g u5
ku1 k u1
ku2 k u2
ku3 k u3
ku4 k u4
ku5 k u5
hu1 h u1
hu2 h u2
hu3 h u3
hu4 h u4
hu5 h u5
zhu1 zh u1
zhu2 zh u2
zhu3 zh u3
zhu4 zh u4
zhu5 zh u5
chu1 ch u1
chu2 ch u2
chu3 ch u3
chu4 ch u4
chu5 ch u5
shu1 sh u1
shu2 sh u2
shu3 sh u3
shu4 sh u4
shu5 sh u5
ru1 r u1
ru2 r u2
ru3 r u3
ru4 r u4
ru5 r u5
zu1 z u1
zu2 z u2
zu3 z u3
zu4 z u4
zu5 z u5
cu1 c u1
cu2 c u2
cu3 c u3
cu4 c u4
cu5 c u5
su1 s u1
su2 s u2
su3 s u3
su4 s u4
su5 s u5
wa1 w ua1
wa2 w ua2
wa3 w ua3
wa4 w ua4
wa5 w ua5
gua1 g ua1
gua2 g ua2
gua3 g ua3
gua4 g ua4
gua5 g ua5
kua1 k ua1
kua2 k ua2
kua3 k ua3
kua4 k ua4
kua5 k ua5
hua1 h ua1
hua2 h ua2
hua3 h ua3
hua4 h ua4
hua5 h ua5
zhua1 zh ua1
zhua2 zh ua2
zhua3 zh ua3
zhua4 zh ua4
zhua5 zh ua5
chua1 ch ua1
chua2 ch ua2
chua3 ch ua3
chua4 ch ua4
chua5 ch ua5
shua1 sh ua1
shua2 sh ua2
shua3 sh ua3
shua4 sh ua4
shua5 sh ua5
wo1 w uo1
wo2 w uo2
wo3 w uo3
wo4 w uo4
wo5 w uo5
duo1 d uo1
duo2 d uo2
duo3 d uo3
duo4 d uo4
duo5 d uo5
tuo1 t uo1
tuo2 t uo2
tuo3 t uo3
tuo4 t uo4
tuo5 t uo5
nuo1 n uo1
nuo2 n uo2
nuo3 n uo3
nuo4 n uo4
nuo5 n uo5
luo1 l uo1
luo2 l uo2
luo3 l uo3
luo4 l uo4
luo5 l uo5
guo1 g uo1
guo2 g uo2
guo3 g uo3
guo4 g uo4
guo5 g uo5
kuo1 k uo1
kuo2 k uo2
kuo3 k uo3
kuo4 k uo4
kuo5 k uo5
huo1 h uo1
huo2 h uo2
huo3 h uo3
huo4 h uo4
huo5 h uo5
zhuo1 zh uo1
zhuo2 zh uo2
zhuo3 zh uo3
zhuo4 zh uo4
zhuo5 zh uo5
chuo1 ch uo1
chuo2 ch uo2
chuo3 ch uo3
chuo4 ch uo4
chuo5 ch uo5
shuo1 sh uo1
shuo2 sh uo2
shuo3 sh uo3
shuo4 sh uo4
shuo5 sh uo5
ruo1 r uo1
ruo2 r uo2
ruo3 r uo3
ruo4 r uo4
ruo5 r uo5
zuo1 z uo1
zuo2 z uo2
zuo3 z uo3
zuo4 z uo4
zuo5 z uo5
cuo1 c uo1
cuo2 c uo2
cuo3 c uo3
cuo4 c uo4
cuo5 c uo5
suo1 s uo1
suo2 s uo2
suo3 s uo3
suo4 s uo4
suo5 s uo5
wai1 w uai1
wai2 w uai2
wai3 w uai3
wai4 w uai4
wai5 w uai5
guai1 g uai1
guai2 g uai2
guai3 g uai3
guai4 g uai4
guai5 g uai5
kuai1 k uai1
kuai2 k uai2
kuai3 k uai3
kuai4 k uai4
kuai5 k uai5
huai1 h uai1
huai2 h uai2
huai3 h uai3
huai4 h uai4
huai5 h uai5
zhuai1 zh uai1
zhuai2 zh uai2
zhuai3 zh uai3
zhuai4 zh uai4
zhuai5 zh uai5
chuai1 ch uai1
chuai2 ch uai2
chuai3 ch uai3
chuai4 ch uai4
chuai5 ch uai5
shuai1 sh uai1
shuai2 sh uai2
shuai3 sh uai3
shuai4 sh uai4
shuai5 sh uai5
wei1 w uei1
wei2 w uei2
wei3 w uei3
wei4 w uei4
wei5 w uei5
dui1 d uei1
dui2 d uei2
dui3 d uei3
dui4 d uei4
dui5 d uei5
tui1 t uei1
tui2 t uei2
tui3 t uei3
tui4 t uei4
tui5 t uei5
gui1 g uei1
gui2 g uei2
gui3 g uei3
gui4 g uei4
gui5 g uei5
kui1 k uei1
kui2 k uei2
kui3 k uei3
kui4 k uei4
kui5 k uei5
hui1 h uei1
hui2 h uei2
hui3 h uei3
hui4 h uei4
hui5 h uei5
zhui1 zh uei1
zhui2 zh uei2
zhui3 zh uei3
zhui4 zh uei4
zhui5 zh uei5
chui1 ch uei1
chui2 ch uei2
chui3 ch uei3
chui4 ch uei4
chui5 ch uei5
shui1 sh uei1
shui2 sh uei2
shui3 sh uei3
shui4 sh uei4
shui5 sh uei5
rui1 r uei1
rui2 r uei2
rui3 r uei3
rui4 r uei4
rui5 r uei5
zui1 z uei1
zui2 z uei2
zui3 z uei3
zui4 z uei4
zui5 z uei5
cui1 c uei1
cui2 c uei2
cui3 c uei3
cui4 c uei4
cui5 c uei5
sui1 s uei1
sui2 s uei2
sui3 s uei3
sui4 s uei4
sui5 s uei5
wan1 w uan1
wan2 w uan2
wan3 w uan3
wan4 w uan4
wan5 w uan5
duan1 d uan1
duan2 d uan2
duan3 d uan3
duan4 d uan4
duan5 d uan5
tuan1 t uan1
tuan2 t uan2
tuan3 t uan3
tuan4 t uan4
tuan5 t uan5
nuan1 n uan1
nuan2 n uan2
nuan3 n uan3
nuan4 n uan4
nuan5 n uan5
luan1 l uan1
luan2 l uan2
luan3 l uan3
luan4 l uan4
luan5 l uan5
guan1 g uan1
guan2 g uan2
guan3 g uan3
guan4 g uan4
guan5 g uan5
kuan1 k uan1
kuan2 k uan2
kuan3 k uan3
kuan4 k uan4
kuan5 k uan5
huan1 h uan1
huan2 h uan2
huan3 h uan3
huan4 h uan4
huan5 h uan5
zhuan1 zh uan1
zhuan2 zh uan2
zhuan3 zh uan3
zhuan4 zh uan4
zhuan5 zh uan5
chuan1 ch uan1
chuan2 ch uan2
chuan3 ch uan3
chuan4 ch uan4
chuan5 ch uan5
shuan1 sh uan1
shuan2 sh uan2
shuan3 sh uan3
shuan4 sh uan4
shuan5 sh uan5
ruan1 r uan1
ruan2 r uan2
ruan3 r uan3
ruan4 r uan4
ruan5 r uan5
zuan1 z uan1
zuan2 z uan2
zuan3 z uan3
zuan4 z uan4
zuan5 z uan5
cuan1 c uan1
cuan2 c uan2
cuan3 c uan3
cuan4 c uan4
cuan5 c uan5
suan1 s uan1
suan2 s uan2
suan3 s uan3
suan4 s uan4
suan5 s uan5
wen1 w uen1
wen2 w uen2
wen3 w uen3
wen4 w uen4
wen5 w uen5
dun1 d uen1
dun2 d uen2
dun3 d uen3
dun4 d uen4
dun5 d uen5
tun1 t uen1
tun2 t uen2
tun3 t uen3
tun4 t uen4
tun5 t uen5
nun1 n uen1
nun2 n uen2
nun3 n uen3
nun4 n uen4
nun5 n uen5
lun1 l uen1
lun2 l uen2
lun3 l uen3
lun4 l uen4
lun5 l uen5
gun1 g uen1
gun2 g uen2
gun3 g uen3
gun4 g uen4
gun5 g uen5
kun1 k uen1
kun2 k uen2
kun3 k uen3
kun4 k uen4
kun5 k uen5
hun1 h uen1
hun2 h uen2
hun3 h uen3
hun4 h uen4
hun5 h uen5
zhun1 zh uen1
zhun2 zh uen2
zhun3 zh uen3
zhun4 zh uen4
zhun5 zh uen5
chun1 ch uen1
chun2 ch uen2
chun3 ch uen3
chun4 ch uen4
chun5 ch uen5
shun1 sh uen1
shun2 sh uen2
shun3 sh uen3
shun4 sh uen4
shun5 sh uen5
run1 r uen1
run2 r uen2
run3 r uen3
run4 r uen4
run5 r uen5
zun1 z uen1
zun2 z uen2
zun3 z uen3
zun4 z uen4
zun5 z uen5
cun1 c uen1
cun2 c uen2
cun3 c uen3
cun4 c uen4
cun5 c uen5
sun1 s uen1
sun2 s uen2
sun3 s uen3
sun4 s uen4
sun5 s uen5
wang1 w uang1
wang2 w uang2
wang3 w uang3
wang4 w uang4
wang5 w uang5
guang1 g uang1
guang2 g uang2
guang3 g uang3
guang4 g uang4
guang5 g uang5
kuang1 k uang1
kuang2 k uang2
kuang3 k uang3
kuang4 k uang4
kuang5 k uang5
huang1 h uang1
huang2 h uang2
huang3 h uang3
huang4 h uang4
huang5 h uang5
zhuang1 zh uang1
zhuang2 zh uang2
zhuang3 zh uang3
zhuang4 zh uang4
zhuang5 zh uang5
chuang1 ch uang1
chuang2 ch uang2
chuang3 ch uang3
chuang4 ch uang4
chuang5 ch uang5
shuang1 sh uang1
shuang2 sh uang2
shuang3 sh uang3
shuang4 sh uang4
shuang5 sh uang5
weng1 w ung1
weng2 w ung2
weng3 w ung3
weng4 w ung4
weng5 w ung5
dong1 d ung1
dong2 d ung2
dong3 d ung3
dong4 d ung4
dong5 d ung5
tong1 t ung1
tong2 t ung2
tong3 t ung3
tong4 t ung4
tong5 t ung5
nong1 n ung1
nong2 n ung2
nong3 n ung3
nong4 n ung4
nong5 n ung5
long1 l ung1
long2 l ung2
long3 l ung3
long4 l ung4
long5 l ung5
gong1 g ung1
gong2 g ung2
gong3 g ung3
gong4 g ung4
gong5 g ung5
kong1 k ung1
kong2 k ung2
kong3 k ung3
kong4 k ung4
kong5 k ung5
hong1 h ung1
hong2 h ung2
hong3 h ung3
hong4 h ung4
hong5 h ung5
zhong1 zh ung1
zhong2 zh ung2
zhong3 zh ung3
zhong4 zh ung4
zhong5 zh ung5
chong1 ch ung1
chong2 ch ung2
chong3 ch ung3
chong4 ch ung4
chong5 ch ung5
rong1 r ung1
rong2 r ung2
rong3 r ung3
rong4 r ung4
rong5 r ung5
zong1 z ung1
zong2 z ung2
zong3 z ung3
zong4 z ung4
zong5 z ung5
cong1 c ung1
cong2 c ung2
cong3 c ung3
cong4 c ung4
cong5 c ung5
song1 s ung1
song2 s ung2
song3 s ung3
song4 s ung4
song5 s ung5
yu1 y v1
yu2 y v2
yu3 y v3
yu4 y v4
yu5 y v5
nv1 n v1
nv2 n v2
nv3 n v3
nv4 n v4
nv5 n v5
lv1 l v1
lv2 l v2
lv3 l v3
lv4 l v4
lv5 l v5
ju1 j v1
ju2 j v2
ju3 j v3
ju4 j v4
ju5 j v5
qu1 q v1
qu2 q v2
qu3 q v3
qu4 q v4
qu5 q v5
xu1 x v1
xu2 x v2
xu3 x v3
xu4 x v4
xu5 x v5
yue1 y ve1
yue2 y ve2
yue3 y ve3
yue4 y ve4
yue5 y ve5
nue1 n ve1
nue2 n ve2
nue3 n ve3
nue4 n ve4
nue5 n ve5
nve1 n ve1
nve2 n ve2
nve3 n ve3
nve4 n ve4
nve5 n ve5
lue1 l ve1
lue2 l ve2
lue3 l ve3
lue4 l ve4
lue5 l ve5
lve1 l ve1
lve2 l ve2
lve3 l ve3
lve4 l ve4
lve5 l ve5
jue1 j ve1
jue2 j ve2
jue3 j ve3
jue4 j ve4
jue5 j ve5
que1 q ve1
que2 q ve2
que3 q ve3
que4 q ve4
que5 q ve5
xue1 x ve1
xue2 x ve2
xue3 x ve3
xue4 x ve4
xue5 x ve5
yuan1 y van1
yuan2 y van2
yuan3 y van3
yuan4 y van4
yuan5 y van5
juan1 j van1
juan2 j van2
juan3 j van3
juan4 j van4
juan5 j van5
quan1 q van1
quan2 q van2
quan3 q van3
quan4 q van4
quan5 q van5
xuan1 x van1
xuan2 x van2
xuan3 x van3
xuan4 x van4
xuan5 x van5
yun1 y vn1
yun2 y vn2
yun3 y vn3
yun4 y vn4
yun5 y vn5
jun1 j vn1
jun2 j vn2
jun3 j vn3
jun4 j vn4
jun5 j vn5
qun1 q vn1
qun2 q vn2
qun3 q vn3
qun4 q vn4
qun5 q vn5
xun1 x vn1
xun2 x vn2
xun3 x vn3
xun4 x vn4
xun5 x vn5
yong1 y vng1
yong2 y vng2
yong3 y vng3
yong4 y vng4
yong5 y vng5
jiong1 j vng1
jiong2 j vng2
jiong3 j vng3
jiong4 j vng4
jiong5 j vng5
qiong1 q vng1
qiong2 q vng2
qiong3 q vng3
qiong4 q vng4
qiong5 q vng5
xiong1 x vng1
xiong2 x vng2
xiong3 x vng3
xiong4 x vng4
xiong5 x vng5
zhir1 zh iii1 &r
zhir2 zh iii2 &r
zhir3 zh iii3 &r
zhir4 zh iii4 &r
zhir5 zh iii5 &r
chir1 ch iii1 &r
chir2 ch iii2 &r
chir3 ch iii3 &r
chir4 ch iii4 &r
chir5 ch iii5 &r
shir1 sh iii1 &r
shir2 sh iii2 &r
shir3 sh iii3 &r
shir4 sh iii4 &r
shir5 sh iii5 &r
rir1 r iii1 &r
rir2 r iii2 &r
rir3 r iii3 &r
rir4 r iii4 &r
rir5 r iii5 &r
zir1 z ii1 &r
zir2 z ii2 &r
zir3 z ii3 &r
zir4 z ii4 &r
zir5 z ii5 &r
cir1 c ii1 &r
cir2 c ii2 &r
cir3 c ii3 &r
cir4 c ii4 &r
cir5 c ii5 &r
sir1 s ii1 &r
sir2 s ii2 &r
sir3 s ii3 &r
sir4 s ii4 &r
sir5 s ii5 &r
ar1 a1 &r
ar2 a2 &r
ar3 a3 &r
ar4 a4 &r
ar5 a5 &r
bar1 b a1 &r
bar2 b a2 &r
bar3 b a3 &r
bar4 b a4 &r
bar5 b a5 &r
par1 p a1 &r
par2 p a2 &r
par3 p a3 &r
par4 p a4 &r
par5 p a5 &r
mar1 m a1 &r
mar2 m a2 &r
mar3 m a3 &r
mar4 m a4 &r
mar5 m a5 &r
far1 f a1 &r
far2 f a2 &r
far3 f a3 &r
far4 f a4 &r
far5 f a5 &r
dar1 d a1 &r
dar2 d a2 &r
dar3 d a3 &r
dar4 d a4 &r
dar5 d a5 &r
tar1 t a1 &r
tar2 t a2 &r
tar3 t a3 &r
tar4 t a4 &r
tar5 t a5 &r
nar1 n a1 &r
nar2 n a2 &r
nar3 n a3 &r
nar4 n a4 &r
nar5 n a5 &r
lar1 l a1 &r
lar2 l a2 &r
lar3 l a3 &r
lar4 l a4 &r
lar5 l a5 &r
gar1 g a1 &r
gar2 g a2 &r
gar3 g a3 &r
gar4 g a4 &r
gar5 g a5 &r
kar1 k a1 &r
kar2 k a2 &r
kar3 k a3 &r
kar4 k a4 &r
kar5 k a5 &r
har1 h a1 &r
har2 h a2 &r
har3 h a3 &r
har4 h a4 &r
har5 h a5 &r
zhar1 zh a1 &r
zhar2 zh a2 &r
zhar3 zh a3 &r
zhar4 zh a4 &r
zhar5 zh a5 &r
char1 ch a1 &r
char2 ch a2 &r
char3 ch a3 &r
char4 ch a4 &r
char5 ch a5 &r
shar1 sh a1 &r
shar2 sh a2 &r
shar3 sh a3 &r
shar4 sh a4 &r
shar5 sh a5 &r
zar1 z a1 &r
zar2 z a2 &r
zar3 z a3 &r
zar4 z a4 &r
zar5 z a5 &r
car1 c a1 &r
car2 c a2 &r
car3 c a3 &r
car4 c a4 &r
car5 c a5 &r
sar1 s a1 &r
sar2 s a2 &r
sar3 s a3 &r
sar4 s a4 &r
sar5 s a5 &r
or1 o1 &r
or2 o2 &r
or3 o3 &r
or4 o4 &r
or5 o5 &r
bor1 b uo1 &r
bor2 b uo2 &r
bor3 b uo3 &r
bor4 b uo4 &r
bor5 b uo5 &r
por1 p uo1 &r
por2 p uo2 &r
por3 p uo3 &r
por4 p uo4 &r
por5 p uo5 &r
mor1 m uo1 &r
mor2 m uo2 &r
mor3 m uo3 &r
mor4 m uo4 &r
mor5 m uo5 &r
for1 f uo1 &r
for2 f uo2 &r
for3 f uo3 &r
for4 f uo4 &r
for5 f uo5 &r
lor1 l o1 &r
lor2 l o2 &r
lor3 l o3 &r
lor4 l o4 &r
lor5 l o5 &r
mer1 m e1 &r
mer2 m e2 &r
mer3 m e3 &r
mer4 m e4 &r
mer5 m e5 &r
der1 d e1 &r
der2 d e2 &r
der3 d e3 &r
der4 d e4 &r
der5 d e5 &r
ter1 t e1 &r
ter2 t e2 &r
ter3 t e3 &r
ter4 t e4 &r
ter5 t e5 &r
ner1 n e1 &r
ner2 n e2 &r
ner3 n e3 &r
ner4 n e4 &r
ner5 n e5 &r
ler1 l e1 &r
ler2 l e2 &r
ler3 l e3 &r
ler4 l e4 &r
ler5 l e5 &r
ger1 g e1 &r
ger2 g e2 &r
ger3 g e3 &r
ger4 g e4 &r
ger5 g e5 &r
ker1 k e1 &r
ker2 k e2 &r
ker3 k e3 &r
ker4 k e4 &r
ker5 k e5 &r
her1 h e1 &r
her2 h e2 &r
her3 h e3 &r
her4 h e4 &r
her5 h e5 &r
zher1 zh e1 &r
zher2 zh e2 &r
zher3 zh e3 &r
zher4 zh e4 &r
zher5 zh e5 &r
cher1 ch e1 &r
cher2 ch e2 &r
cher3 ch e3 &r
cher4 ch e4 &r
cher5 ch e5 &r
sher1 sh e1 &r
sher2 sh e2 &r
sher3 sh e3 &r
sher4 sh e4 &r
sher5 sh e5 &r
rer1 r e1 &r
rer2 r e2 &r
rer3 r e3 &r
rer4 r e4 &r
rer5 r e5 &r
zer1 z e1 &r
zer2 z e2 &r
zer3 z e3 &r
zer4 z e4 &r
zer5 z e5 &r
cer1 c e1 &r
cer2 c e2 &r
cer3 c e3 &r
cer4 c e4 &r
cer5 c e5 &r
ser1 s e1 &r
ser2 s e2 &r
ser3 s e3 &r
ser4 s e4 &r
ser5 s e5 &r
air1 ai1 &r
air2 ai2 &r
air3 ai3 &r
air4 ai4 &r
air5 ai5 &r
bair1 b ai1 &r
bair2 b ai2 &r
bair3 b ai3 &r
bair4 b ai4 &r
bair5 b ai5 &r
pair1 p ai1 &r
pair2 p ai2 &r
pair3 p ai3 &r
pair4 p ai4 &r
pair5 p ai5 &r
mair1 m ai1 &r
mair2 m ai2 &r
mair3 m ai3 &r
mair4 m ai4 &r
mair5 m ai5 &r
dair1 d ai1 &r
dair2 d ai2 &r
dair3 d ai3 &r
dair4 d ai4 &r
dair5 d ai5 &r
tair1 t ai1 &r
tair2 t ai2 &r
tair3 t ai3 &r
tair4 t ai4 &r
tair5 t ai5 &r
nair1 n ai1 &r
nair2 n ai2 &r
nair3 n ai3 &r
nair4 n ai4 &r
nair5 n ai5 &r
lair1 l ai1 &r
lair2 l ai2 &r
lair3 l ai3 &r
lair4 l ai4 &r
lair5 l ai5 &r
gair1 g ai1 &r
gair2 g ai2 &r
gair3 g ai3 &r
gair4 g ai4 &r
gair5 g ai5 &r
kair1 k ai1 &r
kair2 k ai2 &r
kair3 k ai3 &r
kair4 k ai4 &r
kair5 k ai5 &r
hair1 h ai1 &r
hair2 h ai2 &r
hair3 h ai3 &r
hair4 h ai4 &r
hair5 h ai5 &r
zhair1 zh ai1 &r
zhair2 zh ai2 &r
zhair3 zh ai3 &r
zhair4 zh ai4 &r
zhair5 zh ai5 &r
chair1 ch ai1 &r
chair2 ch ai2 &r
chair3 ch ai3 &r
chair4 ch ai4 &r
chair5 ch ai5 &r
shair1 sh ai1 &r
shair2 sh ai2 &r
shair3 sh ai3 &r
shair4 sh ai4 &r
shair5 sh ai5 &r
zair1 z ai1 &r
zair2 z ai2 &r
zair3 z ai3 &r
zair4 z ai4 &r
zair5 z ai5 &r
cair1 c ai1 &r
cair2 c ai2 &r
cair3 c ai3 &r
cair4 c ai4 &r
cair5 c ai5 &r
sair1 s ai1 &r
sair2 s ai2 &r
sair3 s ai3 &r
sair4 s ai4 &r
sair5 s ai5 &r
beir1 b ei1 &r
beir2 b ei2 &r
beir3 b ei3 &r
beir4 b ei4 &r
beir5 b ei5 &r
peir1 p ei1 &r
peir2 p ei2 &r
peir3 p ei3 &r
peir4 p ei4 &r
peir5 p ei5 &r
meir1 m ei1 &r
meir2 m ei2 &r
meir3 m ei3 &r
meir4 m ei4 &r
meir5 m ei5 &r
feir1 f ei1 &r
feir2 f ei2 &r
feir3 f ei3 &r
feir4 f ei4 &r
feir5 f ei5 &r
deir1 d ei1 &r
deir2 d ei2 &r
deir3 d ei3 &r
deir4 d ei4 &r
deir5 d ei5 &r
teir1 t ei1 &r
teir2 t ei2 &r
teir3 t ei3 &r
teir4 t ei4 &r
teir5 t ei5 &r
neir1 n ei1 &r
neir2 n ei2 &r
neir3 n ei3 &r
neir4 n ei4 &r
neir5 n ei5 &r
leir1 l ei1 &r
leir2 l ei2 &r
leir3 l ei3 &r
leir4 l ei4 &r
leir5 l ei5 &r
geir1 g ei1 &r
geir2 g ei2 &r
geir3 g ei3 &r
geir4 g ei4 &r
geir5 g ei5 &r
keir1 k ei1 &r
keir2 k ei2 &r
keir3 k ei3 &r
keir4 k ei4 &r
keir5 k ei5 &r
heir1 h ei1 &r
heir2 h ei2 &r
heir3 h ei3 &r
heir4 h ei4 &r
heir5 h ei5 &r
zheir1 zh ei1 &r
zheir2 zh ei2 &r
zheir3 zh ei3 &r
zheir4 zh ei4 &r
zheir5 zh ei5 &r
sheir1 sh ei1 &r
sheir2 sh ei2 &r
sheir3 sh ei3 &r
sheir4 sh ei4 &r
sheir5 sh ei5 &r
zeir1 z ei1 &r
zeir2 z ei2 &r
zeir3 z ei3 &r
zeir4 z ei4 &r
zeir5 z ei5 &r
aor1 au1 &r
aor2 au2 &r
aor3 au3 &r
aor4 au4 &r
aor5 au5 &r
baor1 b au1 &r
baor2 b au2 &r
baor3 b au3 &r
baor4 b au4 &r
baor5 b au5 &r
paor1 p au1 &r
paor2 p au2 &r
paor3 p au3 &r
paor4 p au4 &r
paor5 p au5 &r
maor1 m au1 &r
maor2 m au2 &r
maor3 m au3 &r
maor4 m au4 &r
maor5 m au5 &r
daor1 d au1 &r
daor2 d au2 &r
daor3 d au3 &r
daor4 d au4 &r
daor5 d au5 &r
taor1 t au1 &r
taor2 t au2 &r
taor3 t au3 &r
taor4 t au4 &r
taor5 t au5 &r
naor1 n au1 &r
naor2 n au2 &r
naor3 n au3 &r
naor4 n au4 &r
naor5 n au5 &r
laor1 l au1 &r
laor2 l au2 &r
laor3 l au3 &r
laor4 l au4 &r
laor5 l au5 &r
gaor1 g au1 &r
gaor2 g au2 &r
gaor3 g au3 &r
gaor4 g au4 &r
gaor5 g au5 &r
kaor1 k au1 &r
kaor2 k au2 &r
kaor3 k au3 &r
kaor4 k au4 &r
kaor5 k au5 &r
haor1 h au1 &r
haor2 h au2 &r
haor3 h au3 &r
haor4 h au4 &r
haor5 h au5 &r
zhaor1 zh au1 &r
zhaor2 zh au2 &r
zhaor3 zh au3 &r
zhaor4 zh au4 &r
zhaor5 zh au5 &r
chaor1 ch au1 &r
chaor2 ch au2 &r
chaor3 ch au3 &r
chaor4 ch au4 &r
chaor5 ch au5 &r
shaor1 sh au1 &r
shaor2 sh au2 &r
shaor3 sh au3 &r
shaor4 sh au4 &r
shaor5 sh au5 &r
raor1 r au1 &r
raor2 r au2 &r
raor3 r au3 &r
raor4 r au4 &r
raor5 r au5 &r
zaor1 z au1 &r
zaor2 z au2 &r
zaor3 z au3 &r
zaor4 z au4 &r
zaor5 z au5 &r
caor1 c au1 &r
caor2 c au2 &r
caor3 c au3 &r
caor4 c au4 &r
caor5 c au5 &r
saor1 s au1 &r
saor2 s au2 &r
saor3 s au3 &r
saor4 s au4 &r
saor5 s au5 &r
our1 ou1 &r
our2 ou2 &r
our3 ou3 &r
our4 ou4 &r
our5 ou5 &r
pour1 p ou1 &r
pour2 p ou2 &r
pour3 p ou3 &r
pour4 p ou4 &r
pour5 p ou5 &r
mour1 m ou1 &r
mour2 m ou2 &r
mour3 m ou3 &r
mour4 m ou4 &r
mour5 m ou5 &r
four1 f ou1 &r
four2 f ou2 &r
four3 f ou3 &r
four4 f ou4 &r
four5 f ou5 &r
dour1 d ou1 &r
dour2 d ou2 &r
dour3 d ou3 &r
dour4 d ou4 &r
dour5 d ou5 &r
tour1 t ou1 &r
tour2 t ou2 &r
tour3 t ou3 &r
tour4 t ou4 &r
tour5 t ou5 &r
nour1 n ou1 &r
nour2 n ou2 &r
nour3 n ou3 &r
nour4 n ou4 &r
nour5 n ou5 &r
lour1 l ou1 &r
lour2 l ou2 &r
lour3 l ou3 &r
lour4 l ou4 &r
lour5 l ou5 &r
gour1 g ou1 &r
gour2 g ou2 &r
gour3 g ou3 &r
gour4 g ou4 &r
gour5 g ou5 &r
kour1 k ou1 &r
kour2 k ou2 &r
kour3 k ou3 &r
kour4 k ou4 &r
kour5 k ou5 &r
hour1 h ou1 &r
hour2 h ou2 &r
hour3 h ou3 &r
hour4 h ou4 &r
hour5 h ou5 &r
zhour1 zh ou1 &r
zhour2 zh ou2 &r
zhour3 zh ou3 &r
zhour4 zh ou4 &r
zhour5 zh ou5 &r
chour1 ch ou1 &r
chour2 ch ou2 &r
chour3 ch ou3 &r
chour4 ch ou4 &r
chour5 ch ou5 &r
shour1 sh ou1 &r
shour2 sh ou2 &r
shour3 sh ou3 &r
shour4 sh ou4 &r
shour5 sh ou5 &r
rour1 r ou1 &r
rour2 r ou2 &r
rour3 r ou3 &r
rour4 r ou4 &r
rour5 r ou5 &r
zour1 z ou1 &r
zour2 z ou2 &r
zour3 z ou3 &r
zour4 z ou4 &r
zour5 z ou5 &r
cour1 c ou1 &r
cour2 c ou2 &r
cour3 c ou3 &r
cour4 c ou4 &r
cour5 c ou5 &r
sour1 s ou1 &r
sour2 s ou2 &r
sour3 s ou3 &r
sour4 s ou4 &r
sour5 s ou5 &r
anr1 an1 &r
anr2 an2 &r
anr3 an3 &r
anr4 an4 &r
anr5 an5 &r
banr1 b an1 &r
banr2 b an2 &r
banr3 b an3 &r
banr4 b an4 &r
banr5 b an5 &r
panr1 p an1 &r
panr2 p an2 &r
panr3 p an3 &r
panr4 p an4 &r
panr5 p an5 &r
manr1 m an1 &r
manr2 m an2 &r
manr3 m an3 &r
manr4 m an4 &r
manr5 m an5 &r
fanr1 f an1 &r
fanr2 f an2 &r
fanr3 f an3 &r
fanr4 f an4 &r
fanr5 f an5 &r
danr1 d an1 &r
danr2 d an2 &r
danr3 d an3 &r
danr4 d an4 &r
danr5 d an5 &r
tanr1 t an1 &r
tanr2 t an2 &r
tanr3 t an3 &r
tanr4 t an4 &r
tanr5 t an5 &r
nanr1 n an1 &r
nanr2 n an2 &r
nanr3 n an3 &r
nanr4 n an4 &r
nanr5 n an5 &r
lanr1 l an1 &r
lanr2 l an2 &r
lanr3 l an3 &r
lanr4 l an4 &r
lanr5 l an5 &r
ganr1 g an1 &r
ganr2 g an2 &r
ganr3 g an3 &r
ganr4 g an4 &r
ganr5 g an5 &r
kanr1 k an1 &r
kanr2 k an2 &r
kanr3 k an3 &r
kanr4 k an4 &r
kanr5 k an5 &r
hanr1 h an1 &r
hanr2 h an2 &r
hanr3 h an3 &r
hanr4 h an4 &r
hanr5 h an5 &r
zhanr1 zh an1 &r
zhanr2 zh an2 &r
zhanr3 zh an3 &r
zhanr4 zh an4 &r
zhanr5 zh an5 &r
chanr1 ch an1 &r
chanr2 ch an2 &r
chanr3 ch an3 &r
chanr4 ch an4 &r
chanr5 ch an5 &r
shanr1 sh an1 &r
shanr2 sh an2 &r
shanr3 sh an3 &r
shanr4 sh an4 &r
shanr5 sh an5 &r
ranr1 r an1 &r
ranr2 r an2 &r
ranr3 r an3 &r
ranr4 r an4 &r
ranr5 r an5 &r
zanr1 z an1 &r
zanr2 z an2 &r
zanr3 z an3 &r
zanr4 z an4 &r
zanr5 z an5 &r
canr1 c an1 &r
canr2 c an2 &r
canr3 c an3 &r
canr4 c an4 &r
canr5 c an5 &r
sanr1 s an1 &r
sanr2 s an2 &r
sanr3 s an3 &r
sanr4 s an4 &r
sanr5 s an5 &r
benr1 b en1 &r
benr2 b en2 &r
benr3 b en3 &r
benr4 b en4 &r
benr5 b en5 &r
penr1 p en1 &r
penr2 p en2 &r
penr3 p en3 &r
penr4 p en4 &r
penr5 p en5 &r
menr1 m en1 &r
menr2 m en2 &r
menr3 m en3 &r
menr4 m en4 &r
menr5 m en5 &r
fenr1 f en1 &r
fenr2 f en2 &r
fenr3 f en3 &r
fenr4 f en4 &r
fenr5 f en5 &r
denr1 d en1 &r
denr2 d en2 &r
denr3 d en3 &r
denr4 d en4 &r
denr5 d en5 &r
nenr1 n en1 &r
nenr2 n en2 &r
nenr3 n en3 &r
nenr4 n en4 &r
nenr5 n en5 &r
genr1 g en1 &r
genr2 g en2 &r
genr3 g en3 &r
genr4 g en4 &r
genr5 g en5 &r
kenr1 k en1 &r
kenr2 k en2 &r
kenr3 k en3 &r
kenr4 k en4 &r
kenr5 k en5 &r
henr1 h en1 &r
henr2 h en2 &r
henr3 h en3 &r
henr4 h en4 &r
henr5 h en5 &r
zhenr1 zh en1 &r
zhenr2 zh en2 &r
zhenr3 zh en3 &r
zhenr4 zh en4 &r
zhenr5 zh en5 &r
chenr1 ch en1 &r
chenr2 ch en2 &r
chenr3 ch en3 &r
chenr4 ch en4 &r
chenr5 ch en5 &r
shenr1 sh en1 &r
shenr2 sh en2 &r
shenr3 sh en3 &r
shenr4 sh en4 &r
shenr5 sh en5 &r
renr1 r en1 &r
renr2 r en2 &r
renr3 r en3 &r
renr4 r en4 &r
renr5 r en5 &r
zenr1 z en1 &r
zenr2 z en2 &r
zenr3 z en3 &r
zenr4 z en4 &r
zenr5 z en5 &r
cenr1 c en1 &r
cenr2 c en2 &r
cenr3 c en3 &r
cenr4 c en4 &r
cenr5 c en5 &r
senr1 s en1 &r
senr2 s en2 &r
senr3 s en3 &r
senr4 s en4 &r
senr5 s en5 &r
angr1 ang1 &r
angr2 ang2 &r
angr3 ang3 &r
angr4 ang4 &r
angr5 ang5 &r
bangr1 b ang1 &r
bangr2 b ang2 &r
bangr3 b ang3 &r
bangr4 b ang4 &r
bangr5 b ang5 &r
pangr1 p ang1 &r
pangr2 p ang2 &r
pangr3 p ang3 &r
pangr4 p ang4 &r
pangr5 p ang5 &r
mangr1 m ang1 &r
mangr2 m ang2 &r
mangr3 m ang3 &r
mangr4 m ang4 &r
mangr5 m ang5 &r
fangr1 f ang1 &r
fangr2 f ang2 &r
fangr3 f ang3 &r
fangr4 f ang4 &r
fangr5 f ang5 &r
dangr1 d ang1 &r
dangr2 d ang2 &r
dangr3 d ang3 &r
dangr4 d ang4 &r
dangr5 d ang5 &r
tangr1 t ang1 &r
tangr2 t ang2 &r
tangr3 t ang3 &r
tangr4 t ang4 &r
tangr5 t ang5 &r
nangr1 n ang1 &r
nangr2 n ang2 &r
nangr3 n ang3 &r
nangr4 n ang4 &r
nangr5 n ang5 &r
langr1 l ang1 &r
langr2 l ang2 &r
langr3 l ang3 &r
langr4 l ang4 &r
langr5 l ang5 &r
gangr1 g ang1 &r
gangr2 g ang2 &r
gangr3 g ang3 &r
gangr4 g ang4 &r
gangr5 g ang5 &r
kangr1 k ang1 &r
kangr2 k ang2 &r
kangr3 k ang3 &r
kangr4 k ang4 &r
kangr5 k ang5 &r
hangr1 h ang1 &r
hangr2 h ang2 &r
hangr3 h ang3 &r
hangr4 h ang4 &r
hangr5 h ang5 &r
zhangr1 zh ang1 &r
zhangr2 zh ang2 &r
zhangr3 zh ang3 &r
zhangr4 zh ang4 &r
zhangr5 zh ang5 &r
changr1 ch ang1 &r
changr2 ch ang2 &r
changr3 ch ang3 &r
changr4 ch ang4 &r
changr5 ch ang5 &r
shangr1 sh ang1 &r
shangr2 sh ang2 &r
shangr3 sh ang3 &r
shangr4 sh ang4 &r
shangr5 sh ang5 &r
rangr1 r ang1 &r
rangr2 r ang2 &r
rangr3 r ang3 &r
rangr4 r ang4 &r
rangr5 r ang5 &r
zangr1 z ang1 &r
zangr2 z ang2 &r
zangr3 z ang3 &r
zangr4 z ang4 &r
zangr5 z ang5 &r
cangr1 c ang1 &r
cangr2 c ang2 &r
cangr3 c ang3 &r
cangr4 c ang4 &r
cangr5 c ang5 &r
sangr1 s ang1 &r
sangr2 s ang2 &r
sangr3 s ang3 &r
sangr4 s ang4 &r
sangr5 s ang5 &r
bengr1 b eng1 &r
bengr2 b eng2 &r
bengr3 b eng3 &r
bengr4 b eng4 &r
bengr5 b eng5 &r
pengr1 p eng1 &r
pengr2 p eng2 &r
pengr3 p eng3 &r
pengr4 p eng4 &r
pengr5 p eng5 &r
mengr1 m eng1 &r
mengr2 m eng2 &r
mengr3 m eng3 &r
mengr4 m eng4 &r
mengr5 m eng5 &r
fengr1 f eng1 &r
fengr2 f eng2 &r
fengr3 f eng3 &r
fengr4 f eng4 &r
fengr5 f eng5 &r
dengr1 d eng1 &r
dengr2 d eng2 &r
dengr3 d eng3 &r
dengr4 d eng4 &r
dengr5 d eng5 &r
tengr1 t eng1 &r
tengr2 t eng2 &r
tengr3 t eng3 &r
tengr4 t eng4 &r
tengr5 t eng5 &r
nengr1 n eng1 &r
nengr2 n eng2 &r
nengr3 n eng3 &r
nengr4 n eng4 &r
nengr5 n eng5 &r
lengr1 l eng1 &r
lengr2 l eng2 &r
lengr3 l eng3 &r
lengr4 l eng4 &r
lengr5 l eng5 &r
gengr1 g eng1 &r
gengr2 g eng2 &r
gengr3 g eng3 &r
gengr4 g eng4 &r
gengr5 g eng5 &r
kengr1 k eng1 &r
kengr2 k eng2 &r
kengr3 k eng3 &r
kengr4 k eng4 &r
kengr5 k eng5 &r
hengr1 h eng1 &r
hengr2 h eng2 &r
hengr3 h eng3 &r
hengr4 h eng4 &r
hengr5 h eng5 &r
zhengr1 zh eng1 &r
zhengr2 zh eng2 &r
zhengr3 zh eng3 &r
zhengr4 zh eng4 &r
zhengr5 zh eng5 &r
chengr1 ch eng1 &r
chengr2 ch eng2 &r
chengr3 ch eng3 &r
chengr4 ch eng4 &r
chengr5 ch eng5 &r
shengr1 sh eng1 &r
shengr2 sh eng2 &r
shengr3 sh eng3 &r
shengr4 sh eng4 &r
shengr5 sh eng5 &r
rengr1 r eng1 &r
rengr2 r eng2 &r
rengr3 r eng3 &r
rengr4 r eng4 &r
rengr5 r eng5 &r
zengr1 z eng1 &r
zengr2 z eng2 &r
zengr3 z eng3 &r
zengr4 z eng4 &r
zengr5 z eng5 &r
cengr1 c eng1 &r
cengr2 c eng2 &r
cengr3 c eng3 &r
cengr4 c eng4 &r
cengr5 c eng5 &r
sengr1 s eng1 &r
sengr2 s eng2 &r
sengr3 s eng3 &r
sengr4 s eng4 &r
sengr5 s eng5 &r
yir1 y i1 &r
yir2 y i2 &r
yir3 y i3 &r
yir4 y i4 &r
yir5 y i5 &r
bir1 b i1 &r
bir2 b i2 &r
bir3 b i3 &r
bir4 b i4 &r
bir5 b i5 &r
pir1 p i1 &r
pir2 p i2 &r
pir3 p i3 &r
pir4 p i4 &r
pir5 p i5 &r
mir1 m i1 &r
mir2 m i2 &r
mir3 m i3 &r
mir4 m i4 &r
mir5 m i5 &r
dir1 d i1 &r
dir2 d i2 &r
dir3 d i3 &r
dir4 d i4 &r
dir5 d i5 &r
tir1 t i1 &r
tir2 t i2 &r
tir3 t i3 &r
tir4 t i4 &r
tir5 t i5 &r
nir1 n i1 &r
nir2 n i2 &r
nir3 n i3 &r
nir4 n i4 &r
nir5 n i5 &r
lir1 l i1 &r
lir2 l i2 &r
lir3 l i3 &r
lir4 l i4 &r
lir5 l i5 &r
jir1 j i1 &r
jir2 j i2 &r
jir3 j i3 &r
jir4 j i4 &r
jir5 j i5 &r
qir1 q i1 &r
qir2 q i2 &r
qir3 q i3 &r
qir4 q i4 &r
qir5 q i5 &r
xir1 x i1 &r
xir2 x i2 &r
xir3 x i3 &r
xir4 x i4 &r
xir5 x i5 &r
yar1 y ia1 &r
yar2 y ia2 &r
yar3 y ia3 &r
yar4 y ia4 &r
yar5 y ia5 &r
diar1 d ia1 &r
diar2 d ia2 &r
diar3 d ia3 &r
diar4 d ia4 &r
diar5 d ia5 &r
liar1 l ia1 &r
liar2 l ia2 &r
liar3 l ia3 &r
liar4 l ia4 &r
liar5 l ia5 &r
jiar1 j ia1 &r
jiar2 j ia2 &r
jiar3 j ia3 &r
jiar4 j ia4 &r
jiar5 j ia5 &r
qiar1 q ia1 &r
qiar2 q ia2 &r
qiar3 q ia3 &r
qiar4 q ia4 &r
qiar5 q ia5 &r
xiar1 x ia1 &r
xiar2 x ia2 &r
xiar3 x ia3 &r
xiar4 x ia4 &r
xiar5 x ia5 &r
yor1 y io1 &r
yor2 y io2 &r
yor3 y io3 &r
yor4 y io4 &r
yor5 y io5 &r
yer1 y ie1 &r
yer2 y ie2 &r
yer3 y ie3 &r
yer4 y ie4 &r
yer5 y ie5 &r
bier1 b ie1 &r
bier2 b ie2 &r
bier3 b ie3 &r
bier4 b ie4 &r
bier5 b ie5 &r
pier1 p ie1 &r
pier2 p ie2 &r
pier3 p ie3 &r
pier4 p ie4 &r
pier5 p ie5 &r
mier1 m ie1 &r
mier2 m ie2 &r
mier3 m ie3 &r
mier4 m ie4 &r
mier5 m ie5 &r
dier1 d ie1 &r
dier2 d ie2 &r
dier3 d ie3 &r
dier4 d ie4 &r
dier5 d ie5 &r
tier1 t ie1 &r
tier2 t ie2 &r
tier3 t ie3 &r
tier4 t ie4 &r
tier5 t ie5 &r
nier1 n ie1 &r
nier2 n ie2 &r
nier3 n ie3 &r
nier4 n ie4 &r
nier5 n ie5 &r
lier1 l ie1 &r
lier2 l ie2 &r
lier3 l ie3 &r
lier4 l ie4 &r
lier5 l ie5 &r
jier1 j ie1 &r
jier2 j ie2 &r
jier3 j ie3 &r
jier4 j ie4 &r
jier5 j ie5 &r
qier1 q ie1 &r
qier2 q ie2 &r
qier3 q ie3 &r
qier4 q ie4 &r
qier5 q ie5 &r
xier1 x ie1 &r
xier2 x ie2 &r
xier3 x ie3 &r
xier4 x ie4 &r
xier5 x ie5 &r
yair1 y ai1 &r
yair2 y ai2 &r
yair3 y ai3 &r
yair4 y ai4 &r
yair5 y ai5 &r
yaor1 y au1 &r
yaor2 y au2 &r
yaor3 y au3 &r
yaor4 y au4 &r
yaor5 y au5 &r
biaor1 b iau1 &r
biaor2 b iau2 &r
biaor3 b iau3 &r
biaor4 b iau4 &r
biaor5 b iau5 &r
piaor1 p iau1 &r
piaor2 p iau2 &r
piaor3 p iau3 &r
piaor4 p iau4 &r
piaor5 p iau5 &r
miaor1 m iau1 &r
miaor2 m iau2 &r
miaor3 m iau3 &r
miaor4 m iau4 &r
miaor5 m iau5 &r
fiaor1 f iau1 &r
fiaor2 f iau2 &r
fiaor3 f iau3 &r
fiaor4 f iau4 &r
fiaor5 f iau5 &r
diaor1 d iau1 &r
diaor2 d iau2 &r
diaor3 d iau3 &r
diaor4 d iau4 &r
diaor5 d iau5 &r
tiaor1 t iau1 &r
tiaor2 t iau2 &r
tiaor3 t iau3 &r
tiaor4 t iau4 &r
tiaor5 t iau5 &r
niaor1 n iau1 &r
niaor2 n iau2 &r
niaor3 n iau3 &r
niaor4 n iau4 &r
niaor5 n iau5 &r
liaor1 l iau1 &r
liaor2 l iau2 &r
liaor3 l iau3 &r
liaor4 l iau4 &r
liaor5 l iau5 &r
jiaor1 j iau1 &r
jiaor2 j iau2 &r
jiaor3 j iau3 &r
jiaor4 j iau4 &r
jiaor5 j iau5 &r
qiaor1 q iau1 &r
qiaor2 q iau2 &r
qiaor3 q iau3 &r
qiaor4 q iau4 &r
qiaor5 q iau5 &r
xiaor1 x iau1 &r
xiaor2 x iau2 &r
xiaor3 x iau3 &r
xiaor4 x iau4 &r
xiaor5 x iau5 &r
your1 y iou1 &r
your2 y iou2 &r
your3 y iou3 &r
your4 y iou4 &r
your5 y iou5 &r
miur1 m iou1 &r
miur2 m iou2 &r
miur3 m iou3 &r
miur4 m iou4 &r
miur5 m iou5 &r
diur1 d iou1 &r
diur2 d iou2 &r
diur3 d iou3 &r
diur4 d iou4 &r
diur5 d iou5 &r
niur1 n iou1 &r
niur2 n iou2 &r
niur3 n iou3 &r
niur4 n iou4 &r
niur5 n iou5 &r
liur1 l iou1 &r
liur2 l iou2 &r
liur3 l iou3 &r
liur4 l iou4 &r
liur5 l iou5 &r
jiur1 j iou1 &r
jiur2 j iou2 &r
jiur3 j iou3 &r
jiur4 j iou4 &r
jiur5 j iou5 &r
qiur1 q iou1 &r
qiur2 q iou2 &r
qiur3 q iou3 &r
qiur4 q iou4 &r
qiur5 q iou5 &r
xiur1 xiou1 &r
xiur2 xiou2 &r
xiur3 xiou3 &r
xiur4 xiou4 &r
xiur5 xiou5 &r
yanr1 y ian1 &r
yanr2 y ian2 &r
yanr3 y ian3 &r
yanr4 y ian4 &r
yanr5 y ian5 &r
bianr1 b ian1 &r
bianr2 b ian2 &r
bianr3 b ian3 &r
bianr4 b ian4 &r
bianr5 b ian5 &r
pianr1 p ian1 &r
pianr2 p ian2 &r
pianr3 p ian3 &r
pianr4 p ian4 &r
pianr5 p ian5 &r
mianr1 m ian1 &r
mianr2 m ian2 &r
mianr3 m ian3 &r
mianr4 m ian4 &r
mianr5 m ian5 &r
dianr1 d ian1 &r
dianr2 d ian2 &r
dianr3 d ian3 &r
dianr4 d ian4 &r
dianr5 d ian5 &r
tianr1 t ian1 &r
tianr2 t ian2 &r
tianr3 t ian3 &r
tianr4 t ian4 &r
tianr5 t ian5 &r
nianr1 n ian1 &r
nianr2 n ian2 &r
nianr3 n ian3 &r
nianr4 n ian4 &r
nianr5 n ian5 &r
lianr1 l ian1 &r
lianr2 l ian2 &r
lianr3 l ian3 &r
lianr4 l ian4 &r
lianr5 l ian5 &r
jianr1 j ian1 &r
jianr2 j ian2 &r
jianr3 j ian3 &r
jianr4 j ian4 &r
jianr5 j ian5 &r
qianr1 q ian1 &r
qianr2 q ian2 &r
qianr3 q ian3 &r
qianr4 q ian4 &r
qianr5 q ian5 &r
xianr1 x ian1 &r
xianr2 x ian2 &r
xianr3 x ian3 &r
xianr4 x ian4 &r
xianr5 x ian5 &r
yinr1 y in1 &r
yinr2 y in2 &r
yinr3 y in3 &r
yinr4 y in4 &r
yinr5 y in5 &r
binr1 b in1 &r
binr2 b in2 &r
binr3 b in3 &r
binr4 b in4 &r
binr5 b in5 &r
pinr1 p in1 &r
pinr2 p in2 &r
pinr3 p in3 &r
pinr4 p in4 &r
pinr5 p in5 &r
minr1 m in1 &r
minr2 m in2 &r
minr3 m in3 &r
minr4 m in4 &r
minr5 m in5 &r
dinr1 d in1 &r
dinr2 d in2 &r
dinr3 d in3 &r
dinr4 d in4 &r
dinr5 d in5 &r
ninr1 n in1 &r
ninr2 n in2 &r
ninr3 n in3 &r
ninr4 n in4 &r
ninr5 n in5 &r
linr1 l in1 &r
linr2 l in2 &r
linr3 l in3 &r
linr4 l in4 &r
linr5 l in5 &r
jinr1 j in1 &r
jinr2 j in2 &r
jinr3 j in3 &r
jinr4 j in4 &r
jinr5 j in5 &r
qinr1 q in1 &r
qinr2 q in2 &r
qinr3 q in3 &r
qinr4 q in4 &r
qinr5 q in5 &r
xinr1 x in1 &r
xinr2 x in2 &r
xinr3 x in3 &r
xinr4 x in4 &r
xinr5 x in5 &r
yangr1 y iang1 &r
yangr2 y iang2 &r
yangr3 y iang3 &r
yangr4 y iang4 &r
yangr5 y iang5 &r
biangr1 b iang1 &r
biangr2 b iang2 &r
biangr3 b iang3 &r
biangr4 b iang4 &r
biangr5 b iang5 &r
niangr1 n iang1 &r
niangr2 n iang2 &r
niangr3 n iang3 &r
niangr4 n iang4 &r
niangr5 n iang5 &r
liangr1 l iang1 &r
liangr2 l iang2 &r
liangr3 l iang3 &r
liangr4 l iang4 &r
liangr5 l iang5 &r
jiangr1 j iang1 &r
jiangr2 j iang2 &r
jiangr3 j iang3 &r
jiangr4 j iang4 &r
jiangr5 j iang5 &r
qiangr1 q iang1 &r
qiangr2 q iang2 &r
qiangr3 q iang3 &r
qiangr4 q iang4 &r
qiangr5 q iang5 &r
xiangr1 x iang1 &r
xiangr2 x iang2 &r
xiangr3 x iang3 &r
xiangr4 x iang4 &r
xiangr5 x iang5 &r
yingr1 y ing1 &r
yingr2 y ing2 &r
yingr3 y ing3 &r
yingr4 y ing4 &r
yingr5 y ing5 &r
bingr1 b ing1 &r
bingr2 b ing2 &r
bingr3 b ing3 &r
bingr4 b ing4 &r
bingr5 b ing5 &r
pingr1 p ing1 &r
pingr2 p ing2 &r
pingr3 p ing3 &r
pingr4 p ing4 &r
pingr5 p ing5 &r
mingr1 m ing1 &r
mingr2 m ing2 &r
mingr3 m ing3 &r
mingr4 m ing4 &r
mingr5 m ing5 &r
dingr1 d ing1 &r
dingr2 d ing2 &r
dingr3 d ing3 &r
dingr4 d ing4 &r
dingr5 d ing5 &r
tingr1 t ing1 &r
tingr2 t ing2 &r
tingr3 t ing3 &r
tingr4 t ing4 &r
tingr5 t ing5 &r
ningr1 n ing1 &r
ningr2 n ing2 &r
ningr3 n ing3 &r
ningr4 n ing4 &r
ningr5 n ing5 &r
lingr1 l ing1 &r
lingr2 l ing2 &r
lingr3 l ing3 &r
lingr4 l ing4 &r
lingr5 l ing5 &r
jingr1 j ing1 &r
jingr2 j ing2 &r
jingr3 j ing3 &r
jingr4 j ing4 &r
jingr5 j ing5 &r
qingr1 q ing1 &r
qingr2 q ing2 &r
qingr3 q ing3 &r
qingr4 q ing4 &r
qingr5 q ing5 &r
xingr1 x ing1 &r
xingr2 x ing2 &r
xingr3 x ing3 &r
xingr4 x ing4 &r
xingr5 x ing5 &r
wur1 w u1 &r
wur2 w u2 &r
wur3 w u3 &r
wur4 w u4 &r
wur5 w u5 &r
bur1 b u1 &r
bur2 b u2 &r
bur3 b u3 &r
bur4 b u4 &r
bur5 b u5 &r
pur1 p u1 &r
pur2 p u2 &r
pur3 p u3 &r
pur4 p u4 &r
pur5 p u5 &r
mur1 m u1 &r
mur2 m u2 &r
mur3 m u3 &r
mur4 m u4 &r
mur5 m u5 &r
fur1 f u1 &r
fur2 f u2 &r
fur3 f u3 &r
fur4 f u4 &r
fur5 f u5 &r
dur1 d u1 &r
dur2 d u2 &r
dur3 d u3 &r
dur4 d u4 &r
dur5 d u5 &r
tur1 t u1 &r
tur2 t u2 &r
tur3 t u3 &r
tur4 t u4 &r
tur5 t u5 &r
nur1 n u1 &r
nur2 n u2 &r
nur3 n u3 &r
nur4 n u4 &r
nur5 n u5 &r
lur1 l u1 &r
lur2 l u2 &r
lur3 l u3 &r
lur4 l u4 &r
lur5 l u5 &r
gur1 g u1 &r
gur2 g u2 &r
gur3 g u3 &r
gur4 g u4 &r
gur5 g u5 &r
kur1 k u1 &r
kur2 k u2 &r
kur3 k u3 &r
kur4 k u4 &r
kur5 k u5 &r
hur1 h u1 &r
hur2 h u2 &r
hur3 h u3 &r
hur4 h u4 &r
hur5 h u5 &r
zhur1 zh u1 &r
zhur2 zh u2 &r
zhur3 zh u3 &r
zhur4 zh u4 &r
zhur5 zh u5 &r
chur1 ch u1 &r
chur2 ch u2 &r
chur3 ch u3 &r
chur4 ch u4 &r
chur5 ch u5 &r
shur1 sh u1 &r
shur2 sh u2 &r
shur3 sh u3 &r
shur4 sh u4 &r
shur5 sh u5 &r
rur1 r u1 &r
rur2 r u2 &r
rur3 r u3 &r
rur4 r u4 &r
rur5 r u5 &r
zur1 z u1 &r
zur2 z u2 &r
zur3 z u3 &r
zur4 z u4 &r
zur5 z u5 &r
cur1 c u1 &r
cur2 c u2 &r
cur3 c u3 &r
cur4 c u4 &r
cur5 c u5 &r
sur1 s u1 &r
sur2 s u2 &r
sur3 s u3 &r
sur4 s u4 &r
sur5 s u5 &r
war1 w ua1 &r
war2 w ua2 &r
war3 w ua3 &r
war4 w ua4 &r
war5 w ua5 &r
guar1 g ua1 &r
guar2 g ua2 &r
guar3 g ua3 &r
guar4 g ua4 &r
guar5 g ua5 &r
kuar1 k ua1 &r
kuar2 k ua2 &r
kuar3 k ua3 &r
kuar4 k ua4 &r
kuar5 k ua5 &r
huar1 h ua1 &r
huar2 h ua2 &r
huar3 h ua3 &r
huar4 h ua4 &r
huar5 h ua5 &r
zhuar1 zh ua1 &r
zhuar2 zh ua2 &r
zhuar3 zh ua3 &r
zhuar4 zh ua4 &r
zhuar5 zh ua5 &r
chuar1 ch ua1 &r
chuar2 ch ua2 &r
chuar3 ch ua3 &r
chuar4 ch ua4 &r
chuar5 ch ua5 &r
shuar1 sh ua1 &r
shuar2 sh ua2 &r
shuar3 sh ua3 &r
shuar4 sh ua4 &r
shuar5 sh ua5 &r
wor1 w uo1 &r
wor2 w uo2 &r
wor3 w uo3 &r
wor4 w uo4 &r
wor5 w uo5 &r
duor1 d uo1 &r
duor2 d uo2 &r
duor3 d uo3 &r
duor4 d uo4 &r
duor5 d uo5 &r
tuor1 t uo1 &r
tuor2 t uo2 &r
tuor3 t uo3 &r
tuor4 t uo4 &r
tuor5 t uo5 &r
nuor1 n uo1 &r
nuor2 n uo2 &r
nuor3 n uo3 &r
nuor4 n uo4 &r
nuor5 n uo5 &r
luor1 l uo1 &r
luor2 l uo2 &r
luor3 l uo3 &r
luor4 l uo4 &r
luor5 l uo5 &r
guor1 g uo1 &r
guor2 g uo2 &r
guor3 g uo3 &r
guor4 g uo4 &r
guor5 g uo5 &r
kuor1 k uo1 &r
kuor2 k uo2 &r
kuor3 k uo3 &r
kuor4 k uo4 &r
kuor5 k uo5 &r
huor1 h uo1 &r
huor2 h uo2 &r
huor3 h uo3 &r
huor4 h uo4 &r
huor5 h uo5 &r
zhuor1 zh uo1 &r
zhuor2 zh uo2 &r
zhuor3 zh uo3 &r
zhuor4 zh uo4 &r
zhuor5 zh uo5 &r
chuor1 ch uo1 &r
chuor2 ch uo2 &r
chuor3 ch uo3 &r
chuor4 ch uo4 &r
chuor5 ch uo5 &r
shuor1 sh uo1 &r
shuor2 sh uo2 &r
shuor3 sh uo3 &r
shuor4 sh uo4 &r
shuor5 sh uo5 &r
ruor1 r uo1 &r
ruor2 r uo2 &r
ruor3 r uo3 &r
ruor4 r uo4 &r
ruor5 r uo5 &r
zuor1 z uo1 &r
zuor2 z uo2 &r
zuor3 z uo3 &r
zuor4 z uo4 &r
zuor5 z uo5 &r
cuor1 c uo1 &r
cuor2 c uo2 &r
cuor3 c uo3 &r
cuor4 c uo4 &r
cuor5 c uo5 &r
suor1 s uo1 &r
suor2 s uo2 &r
suor3 s uo3 &r
suor4 s uo4 &r
suor5 s uo5 &r
wair1 w uai1 &r
wair2 w uai2 &r
wair3 w uai3 &r
wair4 w uai4 &r
wair5 w uai5 &r
guair1 g uai1 &r
guair2 g uai2 &r
guair3 g uai3 &r
guair4 g uai4 &r
guair5 g uai5 &r
kuair1 k uai1 &r
kuair2 k uai2 &r
kuair3 k uai3 &r
kuair4 k uai4 &r
kuair5 k uai5 &r
huair1 h uai1 &r
huair2 h uai2 &r
huair3 h uai3 &r
huair4 h uai4 &r
huair5 h uai5 &r
zhuair1 zh uai1 &r
zhuair2 zh uai2 &r
zhuair3 zh uai3 &r
zhuair4 zh uai4 &r
zhuair5 zh uai5 &r
chuair1 ch uai1 &r
chuair2 ch uai2 &r
chuair3 ch uai3 &r
chuair4 ch uai4 &r
chuair5 ch uai5 &r
shuair1 sh uai1 &r
shuair2 sh uai2 &r
shuair3 sh uai3 &r
shuair4 sh uai4 &r
shuair5 sh uai5 &r
weir1 w uei1 &r
weir2 w uei2 &r
weir3 w uei3 &r
weir4 w uei4 &r
weir5 w uei5 &r
duir1 d uei1 &r
duir2 d uei2 &r
duir3 d uei3 &r
duir4 d uei4 &r
duir5 d uei5 &r
tuir1 t uei1 &r
tuir2 t uei2 &r
tuir3 t uei3 &r
tuir4 t uei4 &r
tuir5 t uei5 &r
guir1 g uei1 &r
guir2 g uei2 &r
guir3 g uei3 &r
guir4 g uei4 &r
guir5 g uei5 &r
kuir1 k uei1 &r
kuir2 k uei2 &r
kuir3 k uei3 &r
kuir4 k uei4 &r
kuir5 k uei5 &r
huir1 h uei1 &r
huir2 h uei2 &r
huir3 h uei3 &r
huir4 h uei4 &r
huir5 h uei5 &r
zhuir1 zh uei1 &r
zhuir2 zh uei2 &r
zhuir3 zh uei3 &r
zhuir4 zh uei4 &r
zhuir5 zh uei5 &r
chuir1 ch uei1 &r
chuir2 ch uei2 &r
chuir3 ch uei3 &r
chuir4 ch uei4 &r
chuir5 ch uei5 &r
shuir1 sh uei1 &r
shuir2 sh uei2 &r
shuir3 sh uei3 &r
shuir4 sh uei4 &r
shuir5 sh uei5 &r
ruir1 r uei1 &r
ruir2 r uei2 &r
ruir3 r uei3 &r
ruir4 r uei4 &r
ruir5 r uei5 &r
zuir1 z uei1 &r
zuir2 z uei2 &r
zuir3 z uei3 &r
zuir4 z uei4 &r
zuir5 z uei5 &r
cuir1 c uei1 &r
cuir2 c uei2 &r
cuir3 c uei3 &r
cuir4 c uei4 &r
cuir5 c uei5 &r
suir1 s uei1 &r
suir2 s uei2 &r
suir3 s uei3 &r
suir4 s uei4 &r
suir5 s uei5 &r
wanr1 w uan1 &r
wanr2 w uan2 &r
wanr3 w uan3 &r
wanr4 w uan4 &r
wanr5 w uan5 &r
duanr1 d uan1 &r
duanr2 d uan2 &r
duanr3 d uan3 &r
duanr4 d uan4 &r
duanr5 d uan5 &r
tuanr1 t uan1 &r
tuanr2 t uan2 &r
tuanr3 t uan3 &r
tuanr4 t uan4 &r
tuanr5 t uan5 &r
nuanr1 n uan1 &r
nuanr2 n uan2 &r
nuanr3 n uan3 &r
nuanr4 n uan4 &r
nuanr5 n uan5 &r
luanr1 l uan1 &r
luanr2 l uan2 &r
luanr3 l uan3 &r
luanr4 l uan4 &r
luanr5 l uan5 &r
guanr1 g uan1 &r
guanr2 g uan2 &r
guanr3 g uan3 &r
guanr4 g uan4 &r
guanr5 g uan5 &r
kuanr1 k uan1 &r
kuanr2 k uan2 &r
kuanr3 k uan3 &r
kuanr4 k uan4 &r
kuanr5 k uan5 &r
huanr1 h uan1 &r
huanr2 h uan2 &r
huanr3 h uan3 &r
huanr4 h uan4 &r
huanr5 h uan5 &r
zhuanr1 zh uan1 &r
zhuanr2 zh uan2 &r
zhuanr3 zh uan3 &r
zhuanr4 zh uan4 &r
zhuanr5 zh uan5 &r
chuanr1 ch uan1 &r
chuanr2 ch uan2 &r
chuanr3 ch uan3 &r
chuanr4 ch uan4 &r
chuanr5 ch uan5 &r
shuanr1 sh uan1 &r
shuanr2 sh uan2 &r
shuanr3 sh uan3 &r
shuanr4 sh uan4 &r
shuanr5 sh uan5 &r
ruanr1 r uan1 &r
ruanr2 r uan2 &r
ruanr3 r uan3 &r
ruanr4 r uan4 &r
ruanr5 r uan5 &r
zuanr1 z uan1 &r
zuanr2 z uan2 &r
zuanr3 z uan3 &r
zuanr4 z uan4 &r
zuanr5 z uan5 &r
cuanr1 c uan1 &r
cuanr2 c uan2 &r
cuanr3 c uan3 &r
cuanr4 c uan4 &r
cuanr5 c uan5 &r
suanr1 s uan1 &r
suanr2 s uan2 &r
suanr3 s uan3 &r
suanr4 s uan4 &r
suanr5 s uan5 &r
wenr1 w uen1 &r
wenr2 w uen2 &r
wenr3 w uen3 &r
wenr4 w uen4 &r
wenr5 w uen5 &r
dunr1 d uen1 &r
dunr2 d uen2 &r
dunr3 d uen3 &r
dunr4 d uen4 &r
dunr5 d uen5 &r
tunr1 t uen1 &r
tunr2 t uen2 &r
tunr3 t uen3 &r
tunr4 t uen4 &r
tunr5 t uen5 &r
nunr1 n uen1 &r
nunr2 n uen2 &r
nunr3 n uen3 &r
nunr4 n uen4 &r
nunr5 n uen5 &r
lunr1 l uen1 &r
lunr2 l uen2 &r
lunr3 l uen3 &r
lunr4 l uen4 &r
lunr5 l uen5 &r
gunr1 g uen1 &r
gunr2 g uen2 &r
gunr3 g uen3 &r
gunr4 g uen4 &r
gunr5 g uen5 &r
kunr1 k uen1 &r
kunr2 k uen2 &r
kunr3 k uen3 &r
kunr4 k uen4 &r
kunr5 k uen5 &r
hunr1 h uen1 &r
hunr2 h uen2 &r
hunr3 h uen3 &r
hunr4 h uen4 &r
hunr5 h uen5 &r
zhunr1 zh uen1 &r
zhunr2 zh uen2 &r
zhunr3 zh uen3 &r
zhunr4 zh uen4 &r
zhunr5 zh uen5 &r
chunr1 ch uen1 &r
chunr2 ch uen2 &r
chunr3 ch uen3 &r
chunr4 ch uen4 &r
chunr5 ch uen5 &r
shunr1 sh uen1 &r
shunr2 sh uen2 &r
shunr3 sh uen3 &r
shunr4 sh uen4 &r
shunr5 sh uen5 &r
runr1 r uen1 &r
runr2 r uen2 &r
runr3 r uen3 &r
runr4 r uen4 &r
runr5 r uen5 &r
zunr1 z uen1 &r
zunr2 z uen2 &r
zunr3 z uen3 &r
zunr4 z uen4 &r
zunr5 z uen5 &r
cunr1 c uen1 &r
cunr2 c uen2 &r
cunr3 c uen3 &r
cunr4 c uen4 &r
cunr5 c uen5 &r
sunr1 s uen1 &r
sunr2 s uen2 &r
sunr3 s uen3 &r
sunr4 s uen4 &r
sunr5 s uen5 &r
wangr1 w uang1 &r
wangr2 w uang2 &r
wangr3 w uang3 &r
wangr4 w uang4 &r
wangr5 w uang5 &r
guangr1 g uang1 &r
guangr2 g uang2 &r
guangr3 g uang3 &r
guangr4 g uang4 &r
guangr5 g uang5 &r
kuangr1 k uang1 &r
kuangr2 k uang2 &r
kuangr3 k uang3 &r
kuangr4 k uang4 &r
kuangr5 k uang5 &r
huangr1 h uang1 &r
huangr2 h uang2 &r
huangr3 h uang3 &r
huangr4 h uang4 &r
huangr5 h uang5 &r
zhuangr1 zh uang1 &r
zhuangr2 zh uang2 &r
zhuangr3 zh uang3 &r
zhuangr4 zh uang4 &r
zhuangr5 zh uang5 &r
chuangr1 ch uang1 &r
chuangr2 ch uang2 &r
chuangr3 ch uang3 &r
chuangr4 ch uang4 &r
chuangr5 ch uang5 &r
shuangr1 sh uang1 &r
shuangr2 sh uang2 &r
shuangr3 sh uang3 &r
shuangr4 sh uang4 &r
shuangr5 sh uang5 &r
wengr1 w ung1 &r
wengr2 w ung2 &r
wengr3 w ung3 &r
wengr4 w ung4 &r
wengr5 w ung5 &r
dongr1 d ung1 &r
dongr2 d ung2 &r
dongr3 d ung3 &r
dongr4 d ung4 &r
dongr5 d ung5 &r
tongr1 t ung1 &r
tongr2 t ung2 &r
tongr3 t ung3 &r
tongr4 t ung4 &r
tongr5 t ung5 &r
nongr1 n ung1 &r
nongr2 n ung2 &r
nongr3 n ung3 &r
nongr4 n ung4 &r
nongr5 n ung5 &r
longr1 l ung1 &r
longr2 l ung2 &r
longr3 l ung3 &r
longr4 l ung4 &r
longr5 l ung5 &r
gongr1 g ung1 &r
gongr2 g ung2 &r
gongr3 g ung3 &r
gongr4 g ung4 &r
gongr5 g ung5 &r
kongr1 k ung1 &r
kongr2 k ung2 &r
kongr3 k ung3 &r
kongr4 k ung4 &r
kongr5 k ung5 &r
hongr1 h ung1 &r
hongr2 h ung2 &r
hongr3 h ung3 &r
hongr4 h ung4 &r
hongr5 h ung5 &r
zhongr1 zh ung1 &r
zhongr2 zh ung2 &r
zhongr3 zh ung3 &r
zhongr4 zh ung4 &r
zhongr5 zh ung5 &r
chongr1 ch ung1 &r
chongr2 ch ung2 &r
chongr3 ch ung3 &r
chongr4 ch ung4 &r
chongr5 ch ung5 &r
rongr1 r ung1 &r
rongr2 r ung2 &r
rongr3 r ung3 &r
rongr4 r ung4 &r
rongr5 r ung5 &r
zongr1 z ung1 &r
zongr2 z ung2 &r
zongr3 z ung3 &r
zongr4 z ung4 &r
zongr5 z ung5 &r
congr1 c ung1 &r
congr2 c ung2 &r
congr3 c ung3 &r
congr4 c ung4 &r
congr5 c ung5 &r
songr1 s ung1 &r
songr2 s ung2 &r
songr3 s ung3 &r
songr4 s ung4 &r
songr5 s ung5 &r
yur1 y v1 &r
yur2 y v2 &r
yur3 y v3 &r
yur4 y v4 &r
yur5 y v5 &r
nvr1 n v1 &r
nvr2 n v2 &r
nvr3 n v3 &r
nvr4 n v4 &r
nvr5 n v5 &r
lvr1 l v1 &r
lvr2 l v2 &r
lvr3 l v3 &r
lvr4 l v4 &r
lvr5 l v5 &r
jur1 j v1 &r
jur2 j v2 &r
jur3 j v3 &r
jur4 j v4 &r
jur5 j v5 &r
qur1 q v1 &r
qur2 q v2 &r
qur3 q v3 &r
qur4 q v4 &r
qur5 q v5 &r
xur1 x v1 &r
xur2 x v2 &r
xur3 x v3 &r
xur4 x v4 &r
xur5 x v5 &r
yuer1 y ve1 &r
yuer2 y ve2 &r
yuer3 y ve3 &r
yuer4 y ve4 &r
yuer5 y ve5 &r
nuer1 n ve1 &r
nuer2 n ve2 &r
nuer3 n ve3 &r
nuer4 n ve4 &r
nuer5 n ve5 &r
nver1 n ve1 &r
nver2 n ve2 &r
nver3 n ve3 &r
nver4 n ve4 &r
nver5 n ve5 &r
luer1 l ve1 &r
luer2 l ve2 &r
luer3 l ve3 &r
luer4 l ve4 &r
luer5 l ve5 &r
lver1 l ve1 &r
lver2 l ve2 &r
lver3 l ve3 &r
lver4 l ve4 &r
lver5 l ve5 &r
juer1 j ve1 &r
juer2 j ve2 &r
juer3 j ve3 &r
juer4 j ve4 &r
juer5 j ve5 &r
quer1 q ve1 &r
quer2 q ve2 &r
quer3 q ve3 &r
quer4 q ve4 &r
quer5 q ve5 &r
xuer1 x ve1 &r
xuer2 x ve2 &r
xuer3 x ve3 &r
xuer4 x ve4 &r
xuer5 x ve5 &r
yuanr1 y van1 &r
yuanr2 y van2 &r
yuanr3 y van3 &r
yuanr4 y van4 &r
yuanr5 y van5 &r
juanr1 j van1 &r
juanr2 j van2 &r
juanr3 j van3 &r
juanr4 j van4 &r
juanr5 j van5 &r
quanr1 q van1 &r
quanr2 q van2 &r
quanr3 q van3 &r
quanr4 q van4 &r
quanr5 q van5 &r
xuanr1 x van1 &r
xuanr2 x van2 &r
xuanr3 x van3 &r
xuanr4 x van4 &r
xuanr5 x van5 &r
yunr1 y vn1 &r
yunr2 y vn2 &r
yunr3 y vn3 &r
yunr4 y vn4 &r
yunr5 y vn5 &r
junr1 j vn1 &r
junr2 j vn2 &r
junr3 j vn3 &r
junr4 j vn4 &r
junr5 j vn5 &r
qunr1 q vn1 &r
qunr2 q vn2 &r
qunr3 q vn3 &r
qunr4 q vn4 &r
qunr5 q vn5 &r
xunr1 x vn1 &r
xunr2 x vn2 &r
xunr3 x vn3 &r
xunr4 x vn4 &r
xunr5 x vn5 &r
yongr1 y vng1 &r
yongr2 y vng2 &r
yongr3 y vng3 &r
yongr4 y vng4 &r
yongr5 y vng5 &r
jiongr1 j vng1 &r
jiongr2 j vng2 &r
jiongr3 j vng3 &r
jiongr4 j vng4 &r
jiongr5 j vng5 &r
qiongr1 q vng1 &r
qiongr2 q vng2 &r
qiongr3 q vng3 &r
qiongr4 q vng4 &r
qiongr5 q vng5 &r
xiongr1 x vng1 &r
xiongr2 x vng2 &r
xiongr3 x vng3 &r
xiongr4 x vng4 &r
xiongr5 x vng5 &r
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import pickle
import re
from pathlib import Path
import tqdm
import yaml
zh_pattern = re.compile("[\u4e00-\u9fa5]")
_tones = {'<pad>', '<s>', '</s>', '0', '1', '2', '3', '4', '5'}
_pauses = {'%', '$'}
_initials = {
'b',
'p',
'm',
'f',
'd',
't',
'n',
'l',
'g',
'k',
'h',
'j',
'q',
'x',
'zh',
'ch',
'sh',
'r',
'z',
'c',
's',
}
_finals = {
'ii',
'iii',
'a',
'o',
'e',
'ea',
'ai',
'ei',
'ao',
'ou',
'an',
'en',
'ang',
'eng',
'er',
'i',
'ia',
'io',
'ie',
'iai',
'iao',
'iou',
'ian',
'ien',
'iang',
'ieng',
'u',
'ua',
'uo',
'uai',
'uei',
'uan',
'uen',
'uang',
'ueng',
'v',
've',
'van',
'ven',
'veng',
}
_ernized_symbol = {'&r'}
_specials = {'<pad>', '<unk>', '<s>', '</s>'}
_phones = _initials | _finals | _ernized_symbol | _specials | _pauses
def is_zh(word):
global zh_pattern
match = zh_pattern.search(word)
return match is not None
def ernized(syllable):
return syllable[:2] != "er" and syllable[-2] == 'r'
def convert(syllable):
# expansion of o -> uo
syllable = re.sub(r"([bpmf])o$", r"\1uo", syllable)
# syllable = syllable.replace("bo", "buo").replace("po", "puo").replace("mo", "muo").replace("fo", "fuo")
# expansion for iong, ong
syllable = syllable.replace("iong", "veng").replace("ong", "ueng")
# expansion for ing, in
syllable = syllable.replace("ing", "ieng").replace("in", "ien")
# expansion for un, ui, iu
syllable = syllable.replace("un", "uen").replace("ui",
"uei").replace("iu", "iou")
# rule for variants of i
syllable = syllable.replace("zi", "zii").replace("ci", "cii").replace("si", "sii")\
.replace("zhi", "zhiii").replace("chi", "chiii").replace("shi", "shiii")\
.replace("ri", "riii")
# rule for y preceding i, u
syllable = syllable.replace("yi", "i").replace("yu", "v").replace("y", "i")
# rule for w
syllable = syllable.replace("wu", "u").replace("w", "u")
# rule for v following j, q, x
syllable = syllable.replace("ju", "jv").replace("qu",
"qv").replace("xu", "xv")
return syllable
def split_syllable(syllable: str):
"""Split a syllable in pinyin into a list of phones and a list of tones.
Initials have no tone, represented by '0', while finals have tones from
'1,2,3,4,5'.
e.g.
zhang -> ['zh', 'ang'], ['0', '1']
"""
if syllable in _pauses:
# syllable, tone
return [syllable], ['0']
tone = syllable[-1]
syllable = convert(syllable[:-1])
phones = []
tones = []
global _initials
if syllable[:2] in _initials:
phones.append(syllable[:2])
tones.append('0')
phones.append(syllable[2:])
tones.append(tone)
elif syllable[0] in _initials:
phones.append(syllable[0])
tones.append('0')
phones.append(syllable[1:])
tones.append(tone)
else:
phones.append(syllable)
tones.append(tone)
return phones, tones
def load_aishell3_transcription(line: str):
sentence_id, pinyin, text = line.strip().split("|")
syllables = pinyin.strip().split()
results = []
for syllable in syllables:
if syllable in _pauses:
results.append(syllable)
elif not ernized(syllable):
results.append(syllable)
else:
results.append(syllable[:-2] + syllable[-1])
results.append('&r5')
phones = []
tones = []
for syllable in results:
p, t = split_syllable(syllable)
phones.extend(p)
tones.extend(t)
for p in phones:
assert p in _phones, p
return {
"sentence_id": sentence_id,
"text": text,
"syllables": results,
"phones": phones,
"tones": tones
}
def process_aishell3(dataset_root, output_dir):
dataset_root = Path(dataset_root).expanduser()
output_dir = Path(output_dir).expanduser()
output_dir.mkdir(parents=True, exist_ok=True)
prosody_label_path = dataset_root / "label_train-set.txt"
with open(prosody_label_path, 'rt') as f:
lines = [line.strip() for line in f]
records = lines[5:]
processed_records = []
for record in tqdm.tqdm(records):
new_record = load_aishell3_transcription(record)
processed_records.append(new_record)
print(new_record)
with open(output_dir / "metadata.pickle", 'wb') as f:
pickle.dump(processed_records, f)
with open(output_dir / "metadata.yaml", 'wt', encoding="utf-8") as f:
yaml.safe_dump(
processed_records, f, default_flow_style=None, allow_unicode=True)
print("metadata done!")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Preprocess transcription of AiShell3 and save them in a compact file(yaml and pickle)."
)
parser.add_argument(
"--input",
type=str,
default="~/datasets/aishell3/train",
help="path of the training dataset,(contains a label_train-set.txt).")
parser.add_argument(
"--output",
type=str,
help="the directory to save the processed transcription."
"If not provided, it would be the same as the input.")
args = parser.parse_args()
if args.output is None:
args.output = args.input
process_aishell3(args.input, args.output)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from functools import partial
from multiprocessing import Pool
from pathlib import Path
import librosa
import numpy as np
import soundfile as sf
from praatio import textgrid
from tqdm import tqdm
def get_valid_part(fpath):
f = textgrid.openTextgrid(fpath, includeEmptyIntervals=True)
start = 0
phone_entry_list = f.tierDict['phones'].entryList
first_entry = phone_entry_list[0]
if first_entry.label == "sil":
start = first_entry.end
last_entry = phone_entry_list[-1]
if last_entry.label == "sp":
end = last_entry.start
else:
end = last_entry.end
return start, end
def process_utterance(fpath, source_dir, target_dir, alignment_dir):
rel_path = fpath.relative_to(source_dir)
opath = target_dir / rel_path
apath = (alignment_dir / rel_path).with_suffix(".TextGrid")
opath.parent.mkdir(parents=True, exist_ok=True)
start, end = get_valid_part(apath)
wav, _ = librosa.load(fpath, sr=22050, offset=start, duration=end - start)
normalized_wav = wav / np.max(wav) * 0.999
sf.write(opath, normalized_wav, samplerate=22050, subtype='PCM_16')
# print(f"{fpath} => {opath}")
def preprocess_aishell3(source_dir, target_dir, alignment_dir):
source_dir = Path(source_dir).expanduser()
target_dir = Path(target_dir).expanduser()
alignment_dir = Path(alignment_dir).expanduser()
wav_paths = list(source_dir.rglob("*.wav"))
print(f"there are {len(wav_paths)} audio files in total")
fx = partial(
process_utterance,
source_dir=source_dir,
target_dir=target_dir,
alignment_dir=alignment_dir)
with Pool(16) as p:
list(
tqdm(p.imap(fx, wav_paths), total=len(wav_paths), unit="utterance"))
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Process audio in AiShell3, trim silence according to the alignment "
"files generated by MFA, and normalize volume by peak.")
parser.add_argument(
"--input",
type=str,
default="~/datasets/aishell3/train/wav",
help="path of the original audio folder in aishell3.")
parser.add_argument(
"--output",
type=str,
default="~/datasets/aishell3/train/normalized_wav",
help="path of the folder to save the processed audio files.")
parser.add_argument(
"--alignment",
type=str,
default="~/datasets/aishell3/train/alignment",
help="path of the alignment files.")
args = parser.parse_args()
preprocess_aishell3(args.input, args.output, args.alignment)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import time
from collections import defaultdict
from pathlib import Path
import numpy as np
import paddle
from matplotlib import pyplot as plt
from paddle import distributed as dist
from paddle.io import DataLoader
from paddle.io import DistributedBatchSampler
from paddlespeech.t2s.data import dataset
from paddlespeech.t2s.exps.voice_cloning.tacotron2_ge2e.aishell3 import AiShell3
from paddlespeech.t2s.exps.voice_cloning.tacotron2_ge2e.aishell3 import collate_aishell3_examples
from paddlespeech.t2s.exps.voice_cloning.tacotron2_ge2e.config import get_cfg_defaults
from paddlespeech.t2s.models.tacotron2 import Tacotron2
from paddlespeech.t2s.models.tacotron2 import Tacotron2Loss
from paddlespeech.t2s.training.cli import default_argument_parser
from paddlespeech.t2s.training.experiment import ExperimentBase
from paddlespeech.t2s.utils import display
from paddlespeech.t2s.utils import mp_tools
class Experiment(ExperimentBase):
def compute_losses(self, inputs, outputs):
texts, tones, mel_targets, utterance_embeds, text_lens, output_lens, stop_tokens = inputs
mel_outputs = outputs["mel_output"]
mel_outputs_postnet = outputs["mel_outputs_postnet"]
alignments = outputs["alignments"]
losses = self.criterion(mel_outputs, mel_outputs_postnet, mel_targets,
alignments, output_lens, text_lens)
return losses
def train_batch(self):
start = time.time()
batch = self.read_batch()
data_loader_time = time.time() - start
self.optimizer.clear_grad()
self.model.train()
texts, tones, mels, utterance_embeds, text_lens, output_lens, stop_tokens = batch
outputs = self.model(
texts,
text_lens,
mels,
output_lens,
tones=tones,
global_condition=utterance_embeds)
losses = self.compute_losses(batch, outputs)
loss = losses["loss"]
loss.backward()
self.optimizer.step()
iteration_time = time.time() - start
losses_np = {k: float(v) for k, v in losses.items()}
# logging
msg = "Rank: {}, ".format(dist.get_rank())
msg += "step: {}, ".format(self.iteration)
msg += "time: {:>.3f}s/{:>.3f}s, ".format(data_loader_time,
iteration_time)
msg += ', '.join('{}: {:>.6f}'.format(k, v)
for k, v in losses_np.items())
self.logger.info(msg)
if dist.get_rank() == 0:
for key, value in losses_np.items():
self.visualizer.add_scalar(f"train_loss/{key}", value,
self.iteration)
@mp_tools.rank_zero_only
@paddle.no_grad()
def valid(self):
valid_losses = defaultdict(list)
for i, batch in enumerate(self.valid_loader):
texts, tones, mels, utterance_embeds, text_lens, output_lens, stop_tokens = batch
outputs = self.model(
texts,
text_lens,
mels,
output_lens,
tones=tones,
global_condition=utterance_embeds)
losses = self.compute_losses(batch, outputs)
for key, value in losses.items():
valid_losses[key].append(float(value))
attention_weights = outputs["alignments"]
self.visualizer.add_figure(
f"valid_sentence_{i}_alignments",
display.plot_alignment(attention_weights[0].numpy().T),
self.iteration)
self.visualizer.add_figure(
f"valid_sentence_{i}_target_spectrogram",
display.plot_spectrogram(mels[0].numpy().T), self.iteration)
mel_pred = outputs['mel_outputs_postnet']
self.visualizer.add_figure(
f"valid_sentence_{i}_predicted_spectrogram",
display.plot_spectrogram(mel_pred[0].numpy().T), self.iteration)
# write visual log
valid_losses = {k: np.mean(v) for k, v in valid_losses.items()}
# logging
msg = "Valid: "
msg += "step: {}, ".format(self.iteration)
msg += ', '.join('{}: {:>.6f}'.format(k, v)
for k, v in valid_losses.items())
self.logger.info(msg)
for key, value in valid_losses.items():
self.visualizer.add_scalar(f"valid/{key}", value, self.iteration)
@mp_tools.rank_zero_only
@paddle.no_grad()
def eval(self):
"""Evaluation of Tacotron2 in autoregressive manner."""
self.model.eval()
mel_dir = Path(self.output_dir / ("eval_{}".format(self.iteration)))
mel_dir.mkdir(parents=True, exist_ok=True)
for i, batch in enumerate(self.test_loader):
texts, tones, mels, utterance_embeds, *_ = batch
outputs = self.model.infer(
texts, tones=tones, global_condition=utterance_embeds)
display.plot_alignment(outputs["alignments"][0].numpy().T)
plt.savefig(mel_dir / f"sentence_{i}.png")
plt.close()
np.save(mel_dir / f"sentence_{i}",
outputs["mel_outputs_postnet"][0].numpy().T)
print(f"sentence_{i}")
def setup_model(self):
config = self.config
model = Tacotron2(
vocab_size=config.model.vocab_size,
n_tones=config.model.n_tones,
d_mels=config.data.d_mels,
d_encoder=config.model.d_encoder,
encoder_conv_layers=config.model.encoder_conv_layers,
encoder_kernel_size=config.model.encoder_kernel_size,
d_prenet=config.model.d_prenet,
d_attention_rnn=config.model.d_attention_rnn,
d_decoder_rnn=config.model.d_decoder_rnn,
attention_filters=config.model.attention_filters,
attention_kernel_size=config.model.attention_kernel_size,
d_attention=config.model.d_attention,
d_postnet=config.model.d_postnet,
postnet_kernel_size=config.model.postnet_kernel_size,
postnet_conv_layers=config.model.postnet_conv_layers,
reduction_factor=config.model.reduction_factor,
p_encoder_dropout=config.model.p_encoder_dropout,
p_prenet_dropout=config.model.p_prenet_dropout,
p_attention_dropout=config.model.p_attention_dropout,
p_decoder_dropout=config.model.p_decoder_dropout,
p_postnet_dropout=config.model.p_postnet_dropout,
d_global_condition=config.model.d_global_condition,
use_stop_token=config.model.use_stop_token, )
if self.parallel:
model = paddle.DataParallel(model)
grad_clip = paddle.nn.ClipGradByGlobalNorm(
config.training.grad_clip_thresh)
optimizer = paddle.optimizer.Adam(
learning_rate=config.training.lr,
parameters=model.parameters(),
weight_decay=paddle.regularizer.L2Decay(
config.training.weight_decay),
grad_clip=grad_clip)
criterion = Tacotron2Loss(
use_stop_token_loss=config.model.use_stop_token,
use_guided_attention_loss=config.model.use_guided_attention_loss,
sigma=config.model.guided_attention_loss_sigma)
self.model = model
self.optimizer = optimizer
self.criterion = criterion
def setup_dataloader(self):
args = self.args
config = self.config
aishell3_dataset = AiShell3(args.data)
valid_set, train_set = dataset.split(aishell3_dataset,
config.data.valid_size)
batch_fn = collate_aishell3_examples
if not self.parallel:
self.train_loader = DataLoader(
train_set,
batch_size=config.data.batch_size,
shuffle=True,
drop_last=True,
collate_fn=batch_fn)
else:
sampler = DistributedBatchSampler(
train_set,
batch_size=config.data.batch_size,
shuffle=True,
drop_last=True)
self.train_loader = DataLoader(
train_set, batch_sampler=sampler, collate_fn=batch_fn)
self.valid_loader = DataLoader(
valid_set,
batch_size=config.data.batch_size,
shuffle=False,
drop_last=False,
collate_fn=batch_fn)
self.test_loader = DataLoader(
valid_set,
batch_size=1,
shuffle=False,
drop_last=False,
collate_fn=batch_fn)
def main_sp(config, args):
exp = Experiment(config, args)
exp.setup()
exp.resume_or_load()
if not args.test:
exp.run()
else:
exp.eval()
def main(config, args):
if args.ngpu > 1:
dist.spawn(main_sp, args=(config, args), nprocs=args.ngpu)
else:
main_sp(config, args)
if __name__ == "__main__":
config = get_cfg_defaults()
parser = default_argument_parser()
parser.add_argument("--test", action="store_true")
args = parser.parse_args()
if args.config:
config.merge_from_file(args.config)
if args.opts:
config.merge_from_list(args.opts)
config.freeze()
print(config)
print(args)
main(config, args)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
from matplotlib import pyplot as plt
from paddlespeech.t2s.exps.voice_cloning.tacotron2_ge2e.aishell3 import voc_phones
from paddlespeech.t2s.exps.voice_cloning.tacotron2_ge2e.aishell3 import voc_tones
from paddlespeech.t2s.exps.voice_cloning.tacotron2_ge2e.chinese_g2p import convert_sentence
from paddlespeech.t2s.models.tacotron2 import Tacotron2
from paddlespeech.t2s.models.waveflow import ConditionalWaveFlow
from paddlespeech.t2s.utils import display
from paddlespeech.vector.exps.ge2e.audio_processor import SpeakerVerificationPreprocessor
from paddlespeech.vector.models.lstm_speaker_encoder import LSTMSpeakerEncoder
def voice_cloning(args):
# speaker encoder
p = SpeakerVerificationPreprocessor(
sampling_rate=16000,
audio_norm_target_dBFS=-30,
vad_window_length=30,
vad_moving_average_width=8,
vad_max_silence_length=6,
mel_window_length=25,
mel_window_step=10,
n_mels=40,
partial_n_frames=160,
min_pad_coverage=0.75,
partial_overlap_ratio=0.5)
print("Audio Processor Done!")
speaker_encoder = LSTMSpeakerEncoder(
n_mels=40, num_layers=3, hidden_size=256, output_size=256)
speaker_encoder.set_state_dict(paddle.load(args.ge2e_params_path))
speaker_encoder.eval()
print("GE2E Done!")
synthesizer = Tacotron2(
vocab_size=68,
n_tones=10,
d_mels=80,
d_encoder=512,
encoder_conv_layers=3,
encoder_kernel_size=5,
d_prenet=256,
d_attention_rnn=1024,
d_decoder_rnn=1024,
attention_filters=32,
attention_kernel_size=31,
d_attention=128,
d_postnet=512,
postnet_kernel_size=5,
postnet_conv_layers=5,
reduction_factor=1,
p_encoder_dropout=0.5,
p_prenet_dropout=0.5,
p_attention_dropout=0.1,
p_decoder_dropout=0.1,
p_postnet_dropout=0.5,
d_global_condition=256,
use_stop_token=False, )
synthesizer.set_state_dict(paddle.load(args.tacotron2_params_path))
synthesizer.eval()
print("Tacotron2 Done!")
# vocoder
vocoder = ConditionalWaveFlow(
upsample_factors=[16, 16],
n_flows=8,
n_layers=8,
n_group=16,
channels=128,
n_mels=80,
kernel_size=[3, 3])
vocoder.set_state_dict(paddle.load(args.waveflow_params_path))
vocoder.eval()
print("WaveFlow Done!")
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
input_dir = Path(args.input_dir)
# 因为 AISHELL-3 数据集中使用 % 和 $ 表示韵律词和韵律短语的边界,它们大约对应着较短和较长的停顿,在文本中可以使用 % 和 $ 来调节韵律。
# 值得的注意的是,句子的有效字符集仅包含汉字和 %, $, 因此输入的句子只能包含这些字符。
sentence = "每当你觉得%想要批评什么人的时候$你切要记着%这个世界上的人%并非都具备你禀有的条件$"
phones, tones = convert_sentence(sentence)
phones = np.array(
[voc_phones.lookup(item) for item in phones], dtype=np.int64)
tones = np.array([voc_tones.lookup(item) for item in tones], dtype=np.int64)
phones = paddle.to_tensor(phones).unsqueeze(0)
tones = paddle.to_tensor(tones).unsqueeze(0)
for name in os.listdir(input_dir):
utt_id = name.split(".")[0]
ref_audio_path = input_dir / name
mel_sequences = p.extract_mel_partials(p.preprocess_wav(ref_audio_path))
print("mel_sequences: ", mel_sequences.shape)
with paddle.no_grad():
embed = speaker_encoder.embed_utterance(
paddle.to_tensor(mel_sequences))
print("embed shape: ", embed.shape)
utterance_embeds = paddle.unsqueeze(embed, 0)
outputs = synthesizer.infer(
phones, tones=tones, global_condition=utterance_embeds)
mel_input = paddle.transpose(outputs["mel_outputs_postnet"], [0, 2, 1])
alignment = outputs["alignments"][0].numpy().T
display.plot_alignment(alignment)
plt.savefig(str(output_dir / (utt_id + ".png")))
with paddle.no_grad():
wav = vocoder.infer(mel_input)
wav = wav.numpy()[0]
sf.write(str(output_dir / (utt_id + ".wav")), wav, samplerate=22050)
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(description="")
parser.add_argument(
"--ge2e_params_path", type=str, help="ge2e params path.")
parser.add_argument(
"--tacotron2_params_path", type=str, help="tacotron2 params path.")
parser.add_argument(
"--waveflow_params_path", type=str, help="waveflow params path.")
parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu=0, use cpu.")
parser.add_argument(
"--input-dir",
type=str,
help="input dir of *.wav, the sample rate will be resample to 16k.")
parser.add_argument("--output-dir", type=str, help="output dir.")
args = parser.parse_args()
if args.ngpu == 0:
paddle.set_device("cpu")
elif args.ngpu > 0:
paddle.set_device("gpu")
else:
print("ngpu should >= 0 !")
voice_cloning(args)
if __name__ == "__main__":
main()
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
from pathlib import Path
import jsonlines
import numpy as np
import paddle
import soundfile as sf
import yaml
from paddle import distributed as dist
from timer import timer
from yacs.config import CfgNode
from paddlespeech.t2s.datasets.data_table import DataTable
from paddlespeech.t2s.models.wavernn import WaveRNN
def main():
parser = argparse.ArgumentParser(description="Synthesize with WaveRNN.")
parser.add_argument("--config", type=str, help="GANVocoder config file.")
parser.add_argument("--checkpoint", type=str, help="snapshot to load.")
parser.add_argument("--test-metadata", type=str, help="dev data.")
parser.add_argument("--output-dir", type=str, help="output dir.")
parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
args = parser.parse_args()
with open(args.config) as f:
config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(config)
print(
f"master see the word size: {dist.get_world_size()}, from pid: {os.getpid()}"
)
if args.ngpu == 0:
paddle.set_device("cpu")
elif args.ngpu > 0:
paddle.set_device("gpu")
else:
print("ngpu should >= 0 !")
model = WaveRNN(
hop_length=config.n_shift, sample_rate=config.fs, **config["model"])
state_dict = paddle.load(args.checkpoint)
model.set_state_dict(state_dict["main_params"])
model.eval()
with jsonlines.open(args.test_metadata, 'r') as reader:
metadata = list(reader)
test_dataset = DataTable(
metadata,
fields=['utt_id', 'feats'],
converters={
'utt_id': None,
'feats': np.load,
})
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
N = 0
T = 0
for example in test_dataset:
utt_id = example['utt_id']
mel = example['feats']
mel = paddle.to_tensor(mel) # (T, C)
with timer() as t:
with paddle.no_grad():
wav = model.generate(
c=mel,
batched=config.inference.gen_batched,
target=config.inference.target,
overlap=config.inference.overlap,
mu_law=config.mu_law,
gen_display=True)
wav = wav.numpy()
N += wav.size
T += t.elapse
speed = wav.size / t.elapse
rtf = config.fs / speed
print(
f"{utt_id}, mel: {mel.shape}, wave: {wav.shape}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
)
sf.write(str(output_dir / (utt_id + ".wav")), wav, samplerate=config.fs)
print(f"generation speed: {N / T}Hz, RTF: {config.fs / (N / T) }")
if __name__ == "__main__":
main()
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import shutil
from pathlib import Path
import jsonlines
import numpy as np
import paddle
import yaml
from paddle import DataParallel
from paddle import distributed as dist
from paddle.io import DataLoader
from paddle.io import DistributedBatchSampler
from paddle.optimizer import Adam
from yacs.config import CfgNode
from paddlespeech.t2s.datasets.data_table import DataTable
from paddlespeech.t2s.datasets.vocoder_batch_fn import WaveRNNClip
from paddlespeech.t2s.models.wavernn import WaveRNN
from paddlespeech.t2s.models.wavernn import WaveRNNEvaluator
from paddlespeech.t2s.models.wavernn import WaveRNNUpdater
from paddlespeech.t2s.modules.losses import discretized_mix_logistic_loss
from paddlespeech.t2s.training.extensions.snapshot import Snapshot
from paddlespeech.t2s.training.extensions.visualizer import VisualDL
from paddlespeech.t2s.training.seeding import seed_everything
from paddlespeech.t2s.training.trainer import Trainer
def train_sp(args, config):
# decides device type and whether to run in parallel
# setup running environment correctly
world_size = paddle.distributed.get_world_size()
if (not paddle.is_compiled_with_cuda()) or args.ngpu == 0:
paddle.set_device("cpu")
else:
paddle.set_device("gpu")
if world_size > 1:
paddle.distributed.init_parallel_env()
# set the random seed, it is a must for multiprocess training
seed_everything(config.seed)
print(
f"rank: {dist.get_rank()}, pid: {os.getpid()}, parent_pid: {os.getppid()}",
)
# construct dataset for training and validation
with jsonlines.open(args.train_metadata, 'r') as reader:
train_metadata = list(reader)
train_dataset = DataTable(
data=train_metadata,
fields=["wave", "feats"],
converters={
"wave": np.load,
"feats": np.load,
}, )
with jsonlines.open(args.dev_metadata, 'r') as reader:
dev_metadata = list(reader)
dev_dataset = DataTable(
data=dev_metadata,
fields=["wave", "feats"],
converters={
"wave": np.load,
"feats": np.load,
}, )
batch_fn = WaveRNNClip(
mode=config.model.mode,
aux_context_window=config.model.aux_context_window,
hop_size=config.n_shift,
batch_max_steps=config.batch_max_steps,
bits=config.model.bits)
# collate function and dataloader
train_sampler = DistributedBatchSampler(
train_dataset,
batch_size=config.batch_size,
shuffle=True,
drop_last=True)
dev_sampler = DistributedBatchSampler(
dev_dataset,
batch_size=config.batch_size,
shuffle=False,
drop_last=False)
print("samplers done!")
train_dataloader = DataLoader(
train_dataset,
batch_sampler=train_sampler,
collate_fn=batch_fn,
num_workers=config.num_workers)
dev_dataloader = DataLoader(
dev_dataset,
collate_fn=batch_fn,
batch_sampler=dev_sampler,
num_workers=config.num_workers)
valid_generate_loader = DataLoader(dev_dataset, batch_size=1)
print("dataloaders done!")
model = WaveRNN(
hop_length=config.n_shift, sample_rate=config.fs, **config["model"])
if world_size > 1:
model = DataParallel(model)
print("model done!")
if config.model.mode == 'RAW':
criterion = paddle.nn.CrossEntropyLoss(axis=1)
elif config.model.mode == 'MOL':
criterion = discretized_mix_logistic_loss
else:
criterion = None
RuntimeError('Unknown model mode value - ', config.model.mode)
print("criterions done!")
clip = paddle.nn.ClipGradByGlobalNorm(config.grad_clip)
optimizer = Adam(
parameters=model.parameters(),
learning_rate=config.learning_rate,
grad_clip=clip)
print("optimizer done!")
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
if dist.get_rank() == 0:
config_name = args.config.split("/")[-1]
# copy conf to output_dir
shutil.copyfile(args.config, output_dir / config_name)
updater = WaveRNNUpdater(
model=model,
optimizer=optimizer,
criterion=criterion,
dataloader=train_dataloader,
output_dir=output_dir,
mode=config.model.mode)
evaluator = WaveRNNEvaluator(
model=model,
dataloader=dev_dataloader,
criterion=criterion,
output_dir=output_dir,
valid_generate_loader=valid_generate_loader,
config=config)
trainer = Trainer(
updater,
stop_trigger=(config.train_max_steps, "iteration"),
out=output_dir)
if dist.get_rank() == 0:
trainer.extend(
evaluator, trigger=(config.eval_interval_steps, 'iteration'))
trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
trainer.extend(
Snapshot(max_size=config.num_snapshots),
trigger=(config.save_interval_steps, 'iteration'))
print("Trainer Done!")
trainer.run()
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(description="Train a HiFiGAN model.")
parser.add_argument(
"--config", type=str, help="config file to overwrite default config.")
parser.add_argument("--train-metadata", type=str, help="training data.")
parser.add_argument("--dev-metadata", type=str, help="dev data.")
parser.add_argument("--output-dir", type=str, help="output dir.")
parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(config)
print(
f"master see the word size: {dist.get_world_size()}, from pid: {os.getpid()}"
)
# dispatch
if args.ngpu > 1:
dist.spawn(train_sp, (args, config), nprocs=args.ngpu)
else:
train_sp(args, config)
if __name__ == "__main__":
main()
...@@ -17,6 +17,6 @@ from .melgan import * ...@@ -17,6 +17,6 @@ from .melgan import *
from .new_tacotron2 import * from .new_tacotron2 import *
from .parallel_wavegan import * from .parallel_wavegan import *
from .speedyspeech import * from .speedyspeech import *
from .tacotron2 import *
from .transformer_tts import * from .transformer_tts import *
from .waveflow import * from .waveflow import *
from .wavernn import *
...@@ -432,6 +432,7 @@ class Tacotron2(nn.Layer): ...@@ -432,6 +432,7 @@ class Tacotron2(nn.Layer):
# inference # inference
h = self.enc.inference(x) h = self.enc.inference(x)
if self.spk_num is not None: if self.spk_num is not None:
sid_emb = self.sid_emb(spk_id.reshape([-1])) sid_emb = self.sid_emb(spk_id.reshape([-1]))
h = h + sid_emb h = h + sid_emb
...@@ -478,7 +479,7 @@ class Tacotron2(nn.Layer): ...@@ -478,7 +479,7 @@ class Tacotron2(nn.Layer):
elif self.spk_embed_integration_type == "concat": elif self.spk_embed_integration_type == "concat":
# concat hidden states with spk embeds # concat hidden states with spk embeds
spk_emb = F.normalize(spk_emb).unsqueeze(1).expand( spk_emb = F.normalize(spk_emb).unsqueeze(1).expand(
-1, paddle.shape(hs)[1], -1) shape=[-1, paddle.shape(hs)[1], -1])
hs = paddle.concat([hs, spk_emb], axis=-1) hs = paddle.concat([hs, spk_emb], axis=-1)
else: else:
raise NotImplementedError("support only add or concat.") raise NotImplementedError("support only add or concat.")
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import paddle
from paddle import nn
from paddle.fluid.layers import sequence_mask
from paddle.nn import functional as F
from paddle.nn import initializer as I
from tqdm import trange
from paddlespeech.t2s.modules.conv import Conv1dBatchNorm
from paddlespeech.t2s.modules.losses import guided_attention_loss
from paddlespeech.t2s.utils import checkpoint
__all__ = ["Tacotron2", "Tacotron2Loss"]
class LocationSensitiveAttention(nn.Layer):
"""Location Sensitive Attention module.
Reference: `Attention-Based Models for Speech Recognition <https://arxiv.org/pdf/1506.07503.pdf>`_
Parameters
-----------
d_query: int
The feature size of query.
d_key : int
The feature size of key.
d_attention : int
The feature size of dimension.
location_filters : int
Filter size of attention convolution.
location_kernel_size : int
Kernel size of attention convolution.
"""
def __init__(self,
d_query: int,
d_key: int,
d_attention: int,
location_filters: int,
location_kernel_size: int):
super().__init__()
self.query_layer = nn.Linear(d_query, d_attention, bias_attr=False)
self.key_layer = nn.Linear(d_key, d_attention, bias_attr=False)
self.value = nn.Linear(d_attention, 1, bias_attr=False)
# Location Layer
self.location_conv = nn.Conv1D(
2,
location_filters,
kernel_size=location_kernel_size,
padding=int((location_kernel_size - 1) / 2),
bias_attr=False,
data_format='NLC')
self.location_layer = nn.Linear(
location_filters, d_attention, bias_attr=False)
def forward(self,
query,
processed_key,
value,
attention_weights_cat,
mask=None):
"""Compute context vector and attention weights.
Parameters
-----------
query : Tensor [shape=(batch_size, d_query)]
The queries.
processed_key : Tensor [shape=(batch_size, time_steps_k, d_attention)]
The keys after linear layer.
value : Tensor [shape=(batch_size, time_steps_k, d_key)]
The values.
attention_weights_cat : Tensor [shape=(batch_size, time_step_k, 2)]
Attention weights concat.
mask : Tensor, optional
The mask. Shape should be (batch_size, times_steps_k, 1).
Defaults to None.
Returns
----------
attention_context : Tensor [shape=(batch_size, d_attention)]
The context vector.
attention_weights : Tensor [shape=(batch_size, time_steps_k)]
The attention weights.
"""
processed_query = self.query_layer(paddle.unsqueeze(query, axis=[1]))
processed_attention_weights = self.location_layer(
self.location_conv(attention_weights_cat))
# (B, T_enc, 1)
alignment = self.value(
paddle.tanh(processed_attention_weights + processed_key +
processed_query))
if mask is not None:
alignment = alignment + (1.0 - mask) * -1e9
attention_weights = F.softmax(alignment, axis=1)
attention_context = paddle.matmul(
attention_weights, value, transpose_x=True)
attention_weights = paddle.squeeze(attention_weights, axis=-1)
attention_context = paddle.squeeze(attention_context, axis=1)
return attention_context, attention_weights
class DecoderPreNet(nn.Layer):
"""Decoder prenet module for Tacotron2.
Parameters
----------
d_input: int
The input feature size.
d_hidden: int
The hidden size.
d_output: int
The output feature size.
dropout_rate: float
The droput probability.
"""
def __init__(self,
d_input: int,
d_hidden: int,
d_output: int,
dropout_rate: float):
super().__init__()
self.dropout_rate = dropout_rate
self.linear1 = nn.Linear(d_input, d_hidden, bias_attr=False)
self.linear2 = nn.Linear(d_hidden, d_output, bias_attr=False)
def forward(self, x):
"""Calculate forward propagation.
Parameters
----------
x: Tensor [shape=(B, T_mel, C)]
Batch of the sequences of padded mel spectrogram.
Returns
-------
output: Tensor [shape=(B, T_mel, C)]
Batch of the sequences of padded hidden state.
"""
x = F.dropout(F.relu(self.linear1(x)), self.dropout_rate, training=True)
output = F.dropout(
F.relu(self.linear2(x)), self.dropout_rate, training=True)
return output
class DecoderPostNet(nn.Layer):
"""Decoder postnet module for Tacotron2.
Parameters
----------
d_mels: int
The number of mel bands.
d_hidden: int
The hidden size of postnet.
kernel_size: int
The kernel size of the conv layer in postnet.
num_layers: int
The number of conv layers in postnet.
dropout: float
The droput probability.
"""
def __init__(self,
d_mels: int,
d_hidden: int,
kernel_size: int,
num_layers: int,
dropout: float):
super().__init__()
self.dropout = dropout
self.num_layers = num_layers
padding = int((kernel_size - 1) / 2)
self.conv_batchnorms = nn.LayerList()
k = math.sqrt(1.0 / (d_mels * kernel_size))
self.conv_batchnorms.append(
Conv1dBatchNorm(
d_mels,
d_hidden,
kernel_size=kernel_size,
padding=padding,
bias_attr=I.Uniform(-k, k),
data_format='NLC'))
k = math.sqrt(1.0 / (d_hidden * kernel_size))
self.conv_batchnorms.extend([
Conv1dBatchNorm(
d_hidden,
d_hidden,
kernel_size=kernel_size,
padding=padding,
bias_attr=I.Uniform(-k, k),
data_format='NLC') for i in range(1, num_layers - 1)
])
self.conv_batchnorms.append(
Conv1dBatchNorm(
d_hidden,
d_mels,
kernel_size=kernel_size,
padding=padding,
bias_attr=I.Uniform(-k, k),
data_format='NLC'))
def forward(self, x):
"""Calculate forward propagation.
Parameters
----------
x: Tensor [shape=(B, T_mel, C)]
Output sequence of features from decoder.
Returns
-------
output: Tensor [shape=(B, T_mel, C)]
Output sequence of features after postnet.
"""
for i in range(len(self.conv_batchnorms) - 1):
x = F.dropout(
F.tanh(self.conv_batchnorms[i](x)),
self.dropout,
training=self.training)
output = F.dropout(
self.conv_batchnorms[self.num_layers - 1](x),
self.dropout,
training=self.training)
return output
class Tacotron2Encoder(nn.Layer):
"""Tacotron2 encoder module for Tacotron2.
Parameters
----------
d_hidden: int
The hidden size in encoder module.
conv_layers: int
The number of conv layers.
kernel_size: int
The kernel size of conv layers.
p_dropout: float
The droput probability.
"""
def __init__(self,
d_hidden: int,
conv_layers: int,
kernel_size: int,
p_dropout: float):
super().__init__()
k = math.sqrt(1.0 / (d_hidden * kernel_size))
self.conv_batchnorms = nn.LayerList([
Conv1dBatchNorm(
d_hidden,
d_hidden,
kernel_size,
stride=1,
padding=int((kernel_size - 1) / 2),
bias_attr=I.Uniform(-k, k),
data_format='NLC') for i in range(conv_layers)
])
self.p_dropout = p_dropout
self.hidden_size = int(d_hidden / 2)
self.lstm = nn.LSTM(
d_hidden, self.hidden_size, direction="bidirectional")
def forward(self, x, input_lens=None):
"""Calculate forward propagation of tacotron2 encoder.
Parameters
----------
x: Tensor [shape=(B, T, C)]
Input embeddings.
text_lens: Tensor [shape=(B,)], optional
Batch of lengths of each text input batch. Defaults to None.
Returns
-------
output : Tensor [shape=(B, T, C)]
Batch of the sequences of padded hidden states.
"""
for conv_batchnorm in self.conv_batchnorms:
x = F.dropout(
F.relu(conv_batchnorm(x)),
self.p_dropout,
training=self.training)
output, _ = self.lstm(inputs=x, sequence_length=input_lens)
return output
class Tacotron2Decoder(nn.Layer):
"""Tacotron2 decoder module for Tacotron2.
Parameters
----------
d_mels: int
The number of mel bands.
reduction_factor: int
The reduction factor of tacotron.
d_encoder: int
The hidden size of encoder.
d_prenet: int
The hidden size in decoder prenet.
d_attention_rnn: int
The attention rnn layer hidden size.
d_decoder_rnn: int
The decoder rnn layer hidden size.
d_attention: int
The hidden size of the linear layer in location sensitive attention.
attention_filters: int
The filter size of the conv layer in location sensitive attention.
attention_kernel_size: int
The kernel size of the conv layer in location sensitive attention.
p_prenet_dropout: float
The droput probability in decoder prenet.
p_attention_dropout: float
The droput probability in location sensitive attention.
p_decoder_dropout: float
The droput probability in decoder.
use_stop_token: bool
Whether to use a binary classifier for stop token prediction.
Defaults to False
"""
def __init__(self,
d_mels: int,
reduction_factor: int,
d_encoder: int,
d_prenet: int,
d_attention_rnn: int,
d_decoder_rnn: int,
d_attention: int,
attention_filters: int,
attention_kernel_size: int,
p_prenet_dropout: float,
p_attention_dropout: float,
p_decoder_dropout: float,
use_stop_token: bool=False):
super().__init__()
self.d_mels = d_mels
self.reduction_factor = reduction_factor
self.d_encoder = d_encoder
self.d_attention_rnn = d_attention_rnn
self.d_decoder_rnn = d_decoder_rnn
self.p_attention_dropout = p_attention_dropout
self.p_decoder_dropout = p_decoder_dropout
self.prenet = DecoderPreNet(
d_mels * reduction_factor,
d_prenet,
d_prenet,
dropout_rate=p_prenet_dropout)
# attention_rnn takes attention's context vector has an
# auxiliary input
self.attention_rnn = nn.LSTMCell(d_prenet + d_encoder, d_attention_rnn)
self.attention_layer = LocationSensitiveAttention(
d_attention_rnn, d_encoder, d_attention, attention_filters,
attention_kernel_size)
# decoder_rnn takes prenet's output and attention_rnn's input
# as input
self.decoder_rnn = nn.LSTMCell(d_attention_rnn + d_encoder,
d_decoder_rnn)
self.linear_projection = nn.Linear(d_decoder_rnn + d_encoder,
d_mels * reduction_factor)
self.use_stop_token = use_stop_token
if use_stop_token:
self.stop_layer = nn.Linear(d_decoder_rnn + d_encoder, 1)
# states - temporary attributes
self.attention_hidden = None
self.attention_cell = None
self.decoder_hidden = None
self.decoder_cell = None
self.attention_weights = None
self.attention_weights_cum = None
self.attention_context = None
self.key = None
self.mask = None
self.processed_key = None
def _initialize_decoder_states(self, key):
"""init states be used in decoder
"""
batch_size, encoder_steps, _ = key.shape
self.attention_hidden = paddle.zeros(
shape=[batch_size, self.d_attention_rnn], dtype=key.dtype)
self.attention_cell = paddle.zeros(
shape=[batch_size, self.d_attention_rnn], dtype=key.dtype)
self.decoder_hidden = paddle.zeros(
shape=[batch_size, self.d_decoder_rnn], dtype=key.dtype)
self.decoder_cell = paddle.zeros(
shape=[batch_size, self.d_decoder_rnn], dtype=key.dtype)
self.attention_weights = paddle.zeros(
shape=[batch_size, encoder_steps], dtype=key.dtype)
self.attention_weights_cum = paddle.zeros(
shape=[batch_size, encoder_steps], dtype=key.dtype)
self.attention_context = paddle.zeros(
shape=[batch_size, self.d_encoder], dtype=key.dtype)
self.key = key # [B, T, C]
# pre-compute projected keys to improve efficiency
self.processed_key = self.attention_layer.key_layer(key) # [B, T, C]
def _decode(self, query):
"""decode one time step
"""
cell_input = paddle.concat([query, self.attention_context], axis=-1)
# The first lstm layer (or spec encoder lstm)
_, (self.attention_hidden, self.attention_cell) = self.attention_rnn(
cell_input, (self.attention_hidden, self.attention_cell))
self.attention_hidden = F.dropout(
self.attention_hidden,
self.p_attention_dropout,
training=self.training)
# Loaction sensitive attention
attention_weights_cat = paddle.stack(
[self.attention_weights, self.attention_weights_cum], axis=-1)
self.attention_context, self.attention_weights = self.attention_layer(
self.attention_hidden, self.processed_key, self.key,
attention_weights_cat, self.mask)
self.attention_weights_cum += self.attention_weights
# The second lstm layer (or spec decoder lstm)
decoder_input = paddle.concat(
[self.attention_hidden, self.attention_context], axis=-1)
_, (self.decoder_hidden, self.decoder_cell) = self.decoder_rnn(
decoder_input, (self.decoder_hidden, self.decoder_cell))
self.decoder_hidden = F.dropout(
self.decoder_hidden,
p=self.p_decoder_dropout,
training=self.training)
# decode output one step
decoder_hidden_attention_context = paddle.concat(
[self.decoder_hidden, self.attention_context], axis=-1)
decoder_output = self.linear_projection(
decoder_hidden_attention_context)
if self.use_stop_token:
stop_logit = self.stop_layer(decoder_hidden_attention_context)
return decoder_output, self.attention_weights, stop_logit
return decoder_output, self.attention_weights
def forward(self, keys, querys, mask):
"""Calculate forward propagation of tacotron2 decoder.
Parameters
----------
keys: Tensor[shape=(B, T_key, C)]
Batch of the sequences of padded output from encoder.
querys: Tensor[shape(B, T_query, C)]
Batch of the sequences of padded mel spectrogram.
mask: Tensor
Mask generated with text length. Shape should be (B, T_key, 1).
Returns
-------
mel_output: Tensor [shape=(B, T_query, C)]
Output sequence of features.
alignments: Tensor [shape=(B, T_query, T_key)]
Attention weights.
"""
self._initialize_decoder_states(keys)
self.mask = mask
querys = paddle.reshape(
querys,
[querys.shape[0], querys.shape[1] // self.reduction_factor, -1])
start_step = paddle.zeros(
shape=[querys.shape[0], 1, querys.shape[-1]], dtype=querys.dtype)
querys = paddle.concat([start_step, querys], axis=1)
querys = self.prenet(querys)
mel_outputs, alignments = [], []
stop_logits = []
# Ignore the last time step
while len(mel_outputs) < querys.shape[1] - 1:
query = querys[:, len(mel_outputs), :]
if self.use_stop_token:
mel_output, attention_weights, stop_logit = self._decode(query)
else:
mel_output, attention_weights = self._decode(query)
mel_outputs.append(mel_output)
alignments.append(attention_weights)
if self.use_stop_token:
stop_logits.append(stop_logit)
alignments = paddle.stack(alignments, axis=1)
mel_outputs = paddle.stack(mel_outputs, axis=1)
if self.use_stop_token:
stop_logits = paddle.concat(stop_logits, axis=1)
return mel_outputs, alignments, stop_logits
return mel_outputs, alignments
def infer(self, key, max_decoder_steps=1000):
"""Calculate forward propagation of tacotron2 decoder.
Parameters
----------
keys: Tensor [shape=(B, T_key, C)]
Batch of the sequences of padded output from encoder.
max_decoder_steps: int, optional
Number of max step when synthesize. Defaults to 1000.
Returns
-------
mel_output: Tensor [shape=(B, T_mel, C)]
Output sequence of features.
alignments: Tensor [shape=(B, T_mel, T_key)]
Attention weights.
"""
self._initialize_decoder_states(key)
self.mask = None # mask is not needed for single instance inference
encoder_steps = key.shape[1]
# [B, C]
start_step = paddle.zeros(
shape=[key.shape[0], self.d_mels * self.reduction_factor],
dtype=key.dtype)
query = start_step # [B, C]
first_hit_end = None
mel_outputs, alignments = [], []
stop_logits = []
for i in trange(max_decoder_steps):
query = self.prenet(query)
if self.use_stop_token:
mel_output, alignment, stop_logit = self._decode(query)
else:
mel_output, alignment = self._decode(query)
mel_outputs.append(mel_output)
alignments.append(alignment) # (B=1, T)
if self.use_stop_token:
stop_logits.append(stop_logit)
if self.use_stop_token:
if F.sigmoid(stop_logit) > 0.5:
print("hit stop condition!")
break
else:
if int(paddle.argmax(alignment[0])) == encoder_steps - 1:
if first_hit_end is None:
first_hit_end = i
elif i > (first_hit_end + 20):
print("content exhausted!")
break
if len(mel_outputs) == max_decoder_steps:
print("Warning! Reached max decoder steps!!!")
break
query = mel_output
alignments = paddle.stack(alignments, axis=1)
mel_outputs = paddle.stack(mel_outputs, axis=1)
if self.use_stop_token:
stop_logits = paddle.concat(stop_logits, axis=1)
return mel_outputs, alignments, stop_logits
return mel_outputs, alignments
class Tacotron2(nn.Layer):
"""Tacotron2 model for end-to-end text-to-speech (E2E-TTS).
This is a model of Spectrogram prediction network in Tacotron2 described
in `Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram
Predictions <https://arxiv.org/abs/1712.05884>`_,
which converts the sequence of characters
into the sequence of mel spectrogram.
Parameters
----------
vocab_size : int
Vocabulary size of phons of the model.
n_tones: int
Vocabulary size of tones of the model. Defaults to None. If provided,
the model has an extra tone embedding.
d_mels: int
Number of mel bands.
d_encoder: int
Hidden size in encoder module.
encoder_conv_layers: int
Number of conv layers in encoder.
encoder_kernel_size: int
Kernel size of conv layers in encoder.
d_prenet: int
Hidden size in decoder prenet.
d_attention_rnn: int
Attention rnn layer hidden size in decoder.
d_decoder_rnn: int
Decoder rnn layer hidden size in decoder.
attention_filters: int
Filter size of the conv layer in location sensitive attention.
attention_kernel_size: int
Kernel size of the conv layer in location sensitive attention.
d_attention: int
Hidden size of the linear layer in location sensitive attention.
d_postnet: int
Hidden size of postnet.
postnet_kernel_size: int
Kernel size of the conv layer in postnet.
postnet_conv_layers: int
Number of conv layers in postnet.
reduction_factor: int
Reduction factor of tacotron2.
p_encoder_dropout: float
Droput probability in encoder.
p_prenet_dropout: float
Droput probability in decoder prenet.
p_attention_dropout: float
Droput probability in location sensitive attention.
p_decoder_dropout: float
Droput probability in decoder.
p_postnet_dropout: float
Droput probability in postnet.
d_global_condition: int
Feature size of global condition. Defaults to None. If provided, The
model assumes a global condition that is concatenated to the encoder
outputs.
"""
def __init__(self,
vocab_size,
n_tones=None,
d_mels: int=80,
d_encoder: int=512,
encoder_conv_layers: int=3,
encoder_kernel_size: int=5,
d_prenet: int=256,
d_attention_rnn: int=1024,
d_decoder_rnn: int=1024,
attention_filters: int=32,
attention_kernel_size: int=31,
d_attention: int=128,
d_postnet: int=512,
postnet_kernel_size: int=5,
postnet_conv_layers: int=5,
reduction_factor: int=1,
p_encoder_dropout: float=0.5,
p_prenet_dropout: float=0.5,
p_attention_dropout: float=0.1,
p_decoder_dropout: float=0.1,
p_postnet_dropout: float=0.5,
d_global_condition=None,
use_stop_token=False):
super().__init__()
std = math.sqrt(2.0 / (vocab_size + d_encoder))
val = math.sqrt(3.0) * std # uniform bounds for std
self.embedding = nn.Embedding(
vocab_size, d_encoder, weight_attr=I.Uniform(-val, val))
if n_tones:
self.embedding_tones = nn.Embedding(
n_tones,
d_encoder,
padding_idx=0,
weight_attr=I.Uniform(-0.1 * val, 0.1 * val))
self.toned = n_tones is not None
self.encoder = Tacotron2Encoder(d_encoder, encoder_conv_layers,
encoder_kernel_size, p_encoder_dropout)
# input augmentation scheme: concat global condition to the encoder output
if d_global_condition is not None:
d_encoder += d_global_condition
self.decoder = Tacotron2Decoder(
d_mels,
reduction_factor,
d_encoder,
d_prenet,
d_attention_rnn,
d_decoder_rnn,
d_attention,
attention_filters,
attention_kernel_size,
p_prenet_dropout,
p_attention_dropout,
p_decoder_dropout,
use_stop_token=use_stop_token)
self.postnet = DecoderPostNet(
d_mels=d_mels * reduction_factor,
d_hidden=d_postnet,
kernel_size=postnet_kernel_size,
num_layers=postnet_conv_layers,
dropout=p_postnet_dropout)
def forward(self,
text_inputs,
text_lens,
mels,
output_lens=None,
tones=None,
global_condition=None):
"""Calculate forward propagation of tacotron2.
Parameters
----------
text_inputs: Tensor [shape=(B, T_text)]
Batch of the sequencees of padded character ids.
text_lens: Tensor [shape=(B,)]
Batch of lengths of each text input batch.
mels: Tensor [shape(B, T_mel, C)]
Batch of the sequences of padded mel spectrogram.
output_lens: Tensor [shape=(B,)], optional
Batch of lengths of each mels batch. Defaults to None.
tones: Tensor [shape=(B, T_text)]
Batch of sequences of padded tone ids.
global_condition: Tensor [shape(B, C)]
Batch of global conditions. Defaults to None. If the
`d_global_condition` of the model is not None, this input should be
provided.
use_stop_token: bool
Whether to include a binary classifier to predict the stop token.
Defaults to False.
Returns
-------
outputs : Dict[str, Tensor]
mel_output: output sequence of features (B, T_mel, C);
mel_outputs_postnet: output sequence of features after postnet (B, T_mel, C);
alignments: attention weights (B, T_mel, T_text);
stop_logits: output sequence of stop logits (B, T_mel)
"""
# input of embedding must be int64
text_inputs = paddle.cast(text_inputs, 'int64')
embedded_inputs = self.embedding(text_inputs)
if self.toned:
embedded_inputs += self.embedding_tones(tones)
encoder_outputs = self.encoder(embedded_inputs, text_lens)
if global_condition is not None:
global_condition = global_condition.unsqueeze(1)
global_condition = paddle.expand(global_condition,
[-1, encoder_outputs.shape[1], -1])
encoder_outputs = paddle.concat([encoder_outputs, global_condition],
-1)
# [B, T_enc, 1]
mask = sequence_mask(
text_lens, dtype=encoder_outputs.dtype).unsqueeze(-1)
if self.decoder.use_stop_token:
mel_outputs, alignments, stop_logits = self.decoder(
encoder_outputs, mels, mask=mask)
else:
mel_outputs, alignments = self.decoder(
encoder_outputs, mels, mask=mask)
mel_outputs_postnet = self.postnet(mel_outputs)
mel_outputs_postnet = mel_outputs + mel_outputs_postnet
if output_lens is not None:
# [B, T_dec, 1]
mask = sequence_mask(output_lens).unsqueeze(-1)
mel_outputs = mel_outputs * mask # [B, T, C]
mel_outputs_postnet = mel_outputs_postnet * mask # [B, T, C]
outputs = {
"mel_output": mel_outputs,
"mel_outputs_postnet": mel_outputs_postnet,
"alignments": alignments
}
if self.decoder.use_stop_token:
outputs["stop_logits"] = stop_logits
return outputs
@paddle.no_grad()
def infer(self,
text_inputs,
max_decoder_steps=1000,
tones=None,
global_condition=None):
"""Generate the mel sepctrogram of features given the sequences of character ids.
Parameters
----------
text_inputs: Tensor [shape=(B, T_text)]
Batch of the sequencees of padded character ids.
max_decoder_steps: int, optional
Number of max step when synthesize. Defaults to 1000.
Returns
-------
outputs : Dict[str, Tensor]
mel_output: output sequence of sepctrogram (B, T_mel, C);
mel_outputs_postnet: output sequence of sepctrogram after postnet (B, T_mel, C);
stop_logits: output sequence of stop logits (B, T_mel);
alignments: attention weights (B, T_mel, T_text). This key is only
present when `use_stop_token` is True.
"""
# input of embedding must be int64
text_inputs = paddle.cast(text_inputs, 'int64')
embedded_inputs = self.embedding(text_inputs)
if self.toned:
embedded_inputs += self.embedding_tones(tones)
encoder_outputs = self.encoder(embedded_inputs)
if global_condition is not None:
global_condition = global_condition.unsqueeze(1)
global_condition = paddle.expand(global_condition,
[-1, encoder_outputs.shape[1], -1])
encoder_outputs = paddle.concat([encoder_outputs, global_condition],
-1)
if self.decoder.use_stop_token:
mel_outputs, alignments, stop_logits = self.decoder.infer(
encoder_outputs, max_decoder_steps=max_decoder_steps)
else:
mel_outputs, alignments = self.decoder.infer(
encoder_outputs, max_decoder_steps=max_decoder_steps)
mel_outputs_postnet = self.postnet(mel_outputs)
mel_outputs_postnet = mel_outputs + mel_outputs_postnet
outputs = {
"mel_output": mel_outputs,
"mel_outputs_postnet": mel_outputs_postnet,
"alignments": alignments
}
if self.decoder.use_stop_token:
outputs["stop_logits"] = stop_logits
return outputs
@classmethod
def from_pretrained(cls, config, checkpoint_path):
"""Build a Tacotron2 model from a pretrained model.
Parameters
----------
config: yacs.config.CfgNode
model configs
checkpoint_path: Path or str
the path of pretrained model checkpoint, without extension name
Returns
-------
ConditionalWaveFlow
The model built from pretrained result.
"""
model = cls(vocab_size=config.model.vocab_size,
n_tones=config.model.n_tones,
d_mels=config.data.n_mels,
d_encoder=config.model.d_encoder,
encoder_conv_layers=config.model.encoder_conv_layers,
encoder_kernel_size=config.model.encoder_kernel_size,
d_prenet=config.model.d_prenet,
d_attention_rnn=config.model.d_attention_rnn,
d_decoder_rnn=config.model.d_decoder_rnn,
attention_filters=config.model.attention_filters,
attention_kernel_size=config.model.attention_kernel_size,
d_attention=config.model.d_attention,
d_postnet=config.model.d_postnet,
postnet_kernel_size=config.model.postnet_kernel_size,
postnet_conv_layers=config.model.postnet_conv_layers,
reduction_factor=config.model.reduction_factor,
p_encoder_dropout=config.model.p_encoder_dropout,
p_prenet_dropout=config.model.p_prenet_dropout,
p_attention_dropout=config.model.p_attention_dropout,
p_decoder_dropout=config.model.p_decoder_dropout,
p_postnet_dropout=config.model.p_postnet_dropout,
d_global_condition=config.model.d_global_condition,
use_stop_token=config.model.use_stop_token)
checkpoint.load_parameters(model, checkpoint_path=checkpoint_path)
return model
class Tacotron2Loss(nn.Layer):
""" Tacotron2 Loss module
"""
def __init__(self,
use_stop_token_loss=True,
use_guided_attention_loss=False,
sigma=0.2):
"""Tacotron 2 Criterion.
Args:
use_stop_token_loss (bool, optional): Whether to use a loss for stop token prediction. Defaults to True.
use_guided_attention_loss (bool, optional): Whether to use a loss for attention weights. Defaults to False.
sigma (float, optional): Hyper-parameter sigma for guided attention loss. Defaults to 0.2.
"""
super().__init__()
self.spec_criterion = nn.MSELoss()
self.use_stop_token_loss = use_stop_token_loss
self.use_guided_attention_loss = use_guided_attention_loss
self.attn_criterion = guided_attention_loss
self.stop_criterion = nn.BCEWithLogitsLoss()
self.sigma = sigma
def forward(self,
mel_outputs,
mel_outputs_postnet,
mel_targets,
attention_weights=None,
slens=None,
plens=None,
stop_logits=None):
"""Calculate tacotron2 loss.
Parameters
----------
mel_outputs: Tensor [shape=(B, T_mel, C)]
Output mel spectrogram sequence.
mel_outputs_postnet: Tensor [shape(B, T_mel, C)]
Output mel spectrogram sequence after postnet.
mel_targets: Tensor [shape=(B, T_mel, C)]
Target mel spectrogram sequence.
attention_weights: Tensor [shape=(B, T_mel, T_enc)]
Attention weights. This should be provided when
`use_guided_attention_loss` is True.
slens: Tensor [shape=(B,)]
Number of frames of mel spectrograms. This should be provided when
`use_guided_attention_loss` is True.
plens: Tensor [shape=(B, )]
Number of text or phone ids of each utterance. This should be
provided when `use_guided_attention_loss` is True.
stop_logits: Tensor [shape=(B, T_mel)]
Stop logits of each mel spectrogram frame. This should be provided
when `use_stop_token_loss` is True.
Returns
-------
losses : Dict[str, Tensor]
loss: the sum of the other three losses;
mel_loss: MSE loss compute by mel_targets and mel_outputs;
post_mel_loss: MSE loss compute by mel_targets and mel_outputs_postnet;
guided_attn_loss: Guided attention loss for attention weights;
stop_loss: Binary cross entropy loss for stop token prediction.
"""
mel_loss = self.spec_criterion(mel_outputs, mel_targets)
post_mel_loss = self.spec_criterion(mel_outputs_postnet, mel_targets)
total_loss = mel_loss + post_mel_loss
if self.use_guided_attention_loss:
gal_loss = self.attn_criterion(attention_weights, slens, plens,
self.sigma)
total_loss += gal_loss
if self.use_stop_token_loss:
T_dec = mel_targets.shape[1]
stop_labels = F.one_hot(slens - 1, num_classes=T_dec)
stop_token_loss = self.stop_criterion(stop_logits, stop_labels)
total_loss += stop_token_loss
losses = {
"loss": total_loss,
"mel_loss": mel_loss,
"post_mel_loss": post_mel_loss
}
if self.use_guided_attention_loss:
losses["guided_attn_loss"] = gal_loss
if self.use_stop_token_loss:
losses["stop_loss"] = stop_token_loss
return losses
...@@ -11,3 +11,5 @@ ...@@ -11,3 +11,5 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from .wavernn import *
from .wavernn_updater import *
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import time
from typing import List
import numpy as np
import paddle
from paddle import nn
from paddle.nn import functional as F
from paddlespeech.t2s.audio.codec import decode_mu_law
from paddlespeech.t2s.modules.losses import sample_from_discretized_mix_logistic
from paddlespeech.t2s.modules.nets_utils import initialize
from paddlespeech.t2s.modules.upsample import Stretch2D
class ResBlock(nn.Layer):
def __init__(self, dims):
super().__init__()
self.conv1 = nn.Conv1D(dims, dims, kernel_size=1, bias_attr=False)
self.conv2 = nn.Conv1D(dims, dims, kernel_size=1, bias_attr=False)
self.batch_norm1 = nn.BatchNorm1D(dims)
self.batch_norm2 = nn.BatchNorm1D(dims)
def forward(self, x):
'''
conv -> bn -> relu -> conv -> bn + residual connection
'''
residual = x
x = self.conv1(x)
x = self.batch_norm1(x)
x = F.relu(x)
x = self.conv2(x)
x = self.batch_norm2(x)
return x + residual
class MelResNet(nn.Layer):
def __init__(self,
res_blocks: int=10,
compute_dims: int=128,
res_out_dims: int=128,
aux_channels: int=80,
aux_context_window: int=0):
super().__init__()
k_size = aux_context_window * 2 + 1
# pay attention here, the dim reduces aux_context_window * 2
self.conv_in = nn.Conv1D(
aux_channels, compute_dims, kernel_size=k_size, bias_attr=False)
self.batch_norm = nn.BatchNorm1D(compute_dims)
self.layers = nn.LayerList()
for _ in range(res_blocks):
self.layers.append(ResBlock(compute_dims))
self.conv_out = nn.Conv1D(compute_dims, res_out_dims, kernel_size=1)
def forward(self, x):
'''
Parameters
----------
x : Tensor
Input tensor (B, in_dims, T).
Returns
----------
Tensor
Output tensor (B, res_out_dims, T).
'''
x = self.conv_in(x)
x = self.batch_norm(x)
x = F.relu(x)
for f in self.layers:
x = f(x)
x = self.conv_out(x)
return x
class UpsampleNetwork(nn.Layer):
def __init__(self,
aux_channels: int=80,
upsample_scales: List[int]=[4, 5, 3, 5],
compute_dims: int=128,
res_blocks: int=10,
res_out_dims: int=128,
aux_context_window: int=2):
super().__init__()
# total_scale is the total Up sampling multiple
total_scale = np.prod(upsample_scales)
# TODO pad*total_scale is numpy.int64
self.indent = int(aux_context_window * total_scale)
self.resnet = MelResNet(
res_blocks=res_blocks,
aux_channels=aux_channels,
compute_dims=compute_dims,
res_out_dims=res_out_dims,
aux_context_window=aux_context_window)
self.resnet_stretch = Stretch2D(total_scale, 1)
self.up_layers = nn.LayerList()
for scale in upsample_scales:
k_size = (1, scale * 2 + 1)
padding = (0, scale)
stretch = Stretch2D(scale, 1)
conv = nn.Conv2D(
1, 1, kernel_size=k_size, padding=padding, bias_attr=False)
weight_ = paddle.full_like(conv.weight, 1. / k_size[1])
conv.weight.set_value(weight_)
self.up_layers.append(stretch)
self.up_layers.append(conv)
def forward(self, m):
'''
Parameters
----------
c : Tensor
Input tensor (B, C_aux, T).
Returns
----------
Tensor
Output tensor (B, (T - 2 * pad) * prob(upsample_scales), C_aux).
Tensor
Output tensor (B, (T - 2 * pad) * prob(upsample_scales), res_out_dims).
'''
# aux: [B, C_aux, T]
# -> [B, res_out_dims, T - 2 * aux_context_window]
# -> [B, 1, res_out_dims, T - 2 * aux_context_window]
aux = self.resnet(m).unsqueeze(1)
# aux: [B, 1, res_out_dims, T - 2 * aux_context_window]
# -> [B, 1, res_out_dims, (T - 2 * pad) * prob(upsample_scales)]
aux = self.resnet_stretch(aux)
# aux: [B, 1, res_out_dims, T * prob(upsample_scales)]
# -> [B, res_out_dims, T * prob(upsample_scales)]
aux = aux.squeeze(1)
# m: [B, C_aux, T] -> [B, 1, C_aux, T]
m = m.unsqueeze(1)
for f in self.up_layers:
m = f(m)
# m: [B, 1, C_aux, T*prob(upsample_scales)]
# -> [B, C_aux, T * prob(upsample_scales)]
# -> [B, C_aux, (T - 2 * pad) * prob(upsample_scales)]
m = m.squeeze(1)[:, :, self.indent:-self.indent]
# m: [B, (T - 2 * pad) * prob(upsample_scales), C_aux]
# aux: [B, (T - 2 * pad) * prob(upsample_scales), res_out_dims]
return m.transpose([0, 2, 1]), aux.transpose([0, 2, 1])
class WaveRNN(nn.Layer):
def __init__(
self,
rnn_dims: int=512,
fc_dims: int=512,
bits: int=9,
aux_context_window: int=2,
upsample_scales: List[int]=[4, 5, 3, 5],
aux_channels: int=80,
compute_dims: int=128,
res_out_dims: int=128,
res_blocks: int=10,
hop_length: int=300,
sample_rate: int=24000,
mode='RAW',
init_type: str="xavier_uniform", ):
'''
Parameters
----------
rnn_dims : int, optional
Hidden dims of RNN Layers.
fc_dims : int, optional
Dims of FC Layers.
bits : int, optional
bit depth of signal.
aux_context_window : int, optional
The context window size of the first convolution applied to the
auxiliary input, by default 2
upsample_scales : List[int], optional
Upsample scales of the upsample network.
aux_channels : int, optional
Auxiliary channel of the residual blocks.
compute_dims : int, optional
Dims of Conv1D in MelResNet.
res_out_dims : int, optional
Dims of output in MelResNet.
res_blocks : int, optional
Number of residual blocks.
mode : str, optional
Output mode of the WaveRNN vocoder. `MOL` for Mixture of Logistic Distribution,
and `RAW` for quantized bits as the model's output.
init_type : str
How to initialize parameters.
'''
super().__init__()
self.mode = mode
self.aux_context_window = aux_context_window
if self.mode == 'RAW':
self.n_classes = 2**bits
elif self.mode == 'MOL':
self.n_classes = 10 * 3
else:
RuntimeError('Unknown model mode value - ', self.mode)
# List of rnns to call 'flatten_parameters()' on
self._to_flatten = []
self.rnn_dims = rnn_dims
self.aux_dims = res_out_dims // 4
self.hop_length = hop_length
self.sample_rate = sample_rate
# initialize parameters
initialize(self, init_type)
self.upsample = UpsampleNetwork(
aux_channels=aux_channels,
upsample_scales=upsample_scales,
compute_dims=compute_dims,
res_blocks=res_blocks,
res_out_dims=res_out_dims,
aux_context_window=aux_context_window)
self.I = nn.Linear(aux_channels + self.aux_dims + 1, rnn_dims)
self.rnn1 = nn.GRU(rnn_dims, rnn_dims)
self.rnn2 = nn.GRU(rnn_dims + self.aux_dims, rnn_dims)
self._to_flatten += [self.rnn1, self.rnn2]
self.fc1 = nn.Linear(rnn_dims + self.aux_dims, fc_dims)
self.fc2 = nn.Linear(fc_dims + self.aux_dims, fc_dims)
self.fc3 = nn.Linear(fc_dims, self.n_classes)
# Avoid fragmentation of RNN parameters and associated warning
self._flatten_parameters()
nn.initializer.set_global_initializer(None)
def forward(self, x, c):
'''
Parameters
----------
x : Tensor
wav sequence, [B, T]
c : Tensor
mel spectrogram [B, C_aux, T']
T = (T' - 2 * aux_context_window ) * hop_length
Returns
----------
Tensor
[B, T, n_classes]
'''
# Although we `_flatten_parameters()` on init, when using DataParallel
# the model gets replicated, making it no longer guaranteed that the
# weights are contiguous in GPU memory. Hence, we must call it again
self._flatten_parameters()
bsize = paddle.shape(x)[0]
h1 = paddle.zeros([1, bsize, self.rnn_dims])
h2 = paddle.zeros([1, bsize, self.rnn_dims])
# c: [B, T, C_aux]
# aux: [B, T, res_out_dims]
c, aux = self.upsample(c)
aux_idx = [self.aux_dims * i for i in range(5)]
a1 = aux[:, :, aux_idx[0]:aux_idx[1]]
a2 = aux[:, :, aux_idx[1]:aux_idx[2]]
a3 = aux[:, :, aux_idx[2]:aux_idx[3]]
a4 = aux[:, :, aux_idx[3]:aux_idx[4]]
x = paddle.concat([x.unsqueeze(-1), c, a1], axis=2)
x = self.I(x)
res = x
x, _ = self.rnn1(x, h1)
x = x + res
res = x
x = paddle.concat([x, a2], axis=2)
x, _ = self.rnn2(x, h2)
x = x + res
x = paddle.concat([x, a3], axis=2)
x = F.relu(self.fc1(x))
x = paddle.concat([x, a4], axis=2)
x = F.relu(self.fc2(x))
return self.fc3(x)
@paddle.no_grad()
def generate(self,
c,
batched: bool=True,
target: int=12000,
overlap: int=600,
mu_law: bool=True,
gen_display: bool=False):
"""
Parameters
----------
c : Tensor
input mels, (T', C_aux)
batched : bool
generate in batch or not
target : int
target number of samples to be generated in each batch entry
overlap : int
number of samples for crossfading between batches
mu_law : bool
use mu law or not
Returns
----------
wav sequence
Output (T' * prod(upsample_scales), out_channels, C_out).
"""
self.eval()
mu_law = mu_law if self.mode == 'RAW' else False
output = []
start = time.time()
# pseudo batch
# (T, C_aux) -> (1, C_aux, T)
c = paddle.transpose(c, [1, 0]).unsqueeze(0)
T = paddle.shape(c)[-1]
wave_len = T * self.hop_length
# TODO remove two transpose op by modifying function pad_tensor
c = self.pad_tensor(
c.transpose([0, 2, 1]), pad=self.aux_context_window,
side='both').transpose([0, 2, 1])
c, aux = self.upsample(c)
if batched:
# (num_folds, target + 2 * overlap, features)
c = self.fold_with_overlap(c, target, overlap)
aux = self.fold_with_overlap(aux, target, overlap)
# for dygraph to static graph, if use seq_len of `b_size, seq_len, _ = paddle.shape(c)` in for
# will not get TensorArray
# see https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/04_dygraph_to_static/case_analysis_cn.html#list-lodtensorarray
# b_size, seq_len, _ = paddle.shape(c)
b_size = paddle.shape(c)[0]
seq_len = paddle.shape(c)[1]
h1 = paddle.zeros([b_size, self.rnn_dims])
h2 = paddle.zeros([b_size, self.rnn_dims])
x = paddle.zeros([b_size, 1])
d = self.aux_dims
aux_split = [aux[:, :, d * i:d * (i + 1)] for i in range(4)]
for i in range(seq_len):
m_t = c[:, i, :]
# for dygraph to static graph
# a1_t, a2_t, a3_t, a4_t = (a[:, i, :] for a in aux_split)
a1_t = aux_split[0][:, i, :]
a2_t = aux_split[1][:, i, :]
a3_t = aux_split[2][:, i, :]
a4_t = aux_split[3][:, i, :]
x = paddle.concat([x, m_t, a1_t], axis=1)
x = self.I(x)
# use GRUCell here
h1, _ = self.rnn1[0].cell(x, h1)
x = x + h1
inp = paddle.concat([x, a2_t], axis=1)
# use GRUCell here
h2, _ = self.rnn2[0].cell(inp, h2)
x = x + h2
x = paddle.concat([x, a3_t], axis=1)
x = F.relu(self.fc1(x))
x = paddle.concat([x, a4_t], axis=1)
x = F.relu(self.fc2(x))
logits = self.fc3(x)
if self.mode == 'MOL':
sample = sample_from_discretized_mix_logistic(
logits.unsqueeze(0).transpose([0, 2, 1]))
output.append(sample.reshape([-1]))
x = sample.transpose([1, 0, 2])
elif self.mode == 'RAW':
posterior = F.softmax(logits, axis=1)
distrib = paddle.distribution.Categorical(posterior)
# corresponding operate [np.floor((fx + 1) / 2 * mu + 0.5)] in enocde_mu_law
# distrib.sample([1])[0].cast('float32'): [0, 2**bits-1]
# sample: [-1, 1]
sample = 2 * distrib.sample([1])[0].cast('float32') / (
self.n_classes - 1.) - 1.
output.append(sample)
x = sample.unsqueeze(-1)
else:
raise RuntimeError('Unknown model mode value - ', self.mode)
if gen_display:
if i % 1000 == 0:
self.gen_display(i, int(seq_len), int(b_size), start)
output = paddle.stack(output).transpose([1, 0])
if mu_law:
output = decode_mu_law(output, self.n_classes, False)
if batched:
output = self.xfade_and_unfold(output, target, overlap)
else:
output = output[0]
# Fade-out at the end to avoid signal cutting out suddenly
fade_out = paddle.linspace(1, 0, 10 * self.hop_length)
output = output[:wave_len]
output[-10 * self.hop_length:] *= fade_out
self.train()
# 增加 C_out 维度
return output.unsqueeze(-1)
def _flatten_parameters(self):
[m.flatten_parameters() for m in self._to_flatten]
def pad_tensor(self, x, pad, side='both'):
'''
Parameters
----------
x : Tensor
mel, [1, n_frames, 80]
pad : int
side : str
'both', 'before' or 'after'
Returns
----------
Tensor
'''
b, t, _ = paddle.shape(x)
# for dygraph to static graph
c = x.shape[-1]
total = t + 2 * pad if side == 'both' else t + pad
padded = paddle.zeros([b, total, c])
if side == 'before' or side == 'both':
padded[:, pad:pad + t, :] = x
elif side == 'after':
padded[:, :t, :] = x
return padded
def fold_with_overlap(self, x, target, overlap):
'''
Fold the tensor with overlap for quick batched inference.
Overlap will be used for crossfading in xfade_and_unfold()
Parameters
----------
x : Tensor
Upsampled conditioning features. mels or aux
shape=(1, T, features)
mels: [1, T, 80]
aux: [1, T, 128]
target : int
Target timesteps for each index of batch
overlap : int
Timesteps for both xfade and rnn warmup
overlap = hop_length * 2
Returns
----------
Tensor
shape=(num_folds, target + 2 * overlap, features)
num_flods = (time_seq - overlap) // (target + overlap)
mel: [num_folds, target + 2 * overlap, 80]
aux: [num_folds, target + 2 * overlap, 128]
Details
----------
x = [[h1, h2, ... hn]]
Where each h is a vector of conditioning features
Eg: target=2, overlap=1 with x.size(1)=10
folded = [[h1, h2, h3, h4],
[h4, h5, h6, h7],
[h7, h8, h9, h10]]
'''
_, total_len, features = paddle.shape(x)
# Calculate variables needed
num_folds = (total_len - overlap) // (target + overlap)
extended_len = num_folds * (overlap + target) + overlap
remaining = total_len - extended_len
# Pad if some time steps poking out
if remaining != 0:
num_folds += 1
padding = target + 2 * overlap - remaining
x = self.pad_tensor(x, padding, side='after')
folded = paddle.zeros([num_folds, target + 2 * overlap, features])
# Get the values for the folded tensor
for i in range(num_folds):
start = i * (target + overlap)
end = start + target + 2 * overlap
folded[i] = x[0][start:end, :]
return folded
def xfade_and_unfold(self, y, target: int=12000, overlap: int=600):
''' Applies a crossfade and unfolds into a 1d array.
Parameters
----------
y : Tensor
Batched sequences of audio samples
shape=(num_folds, target + 2 * overlap)
dtype=paddle.float32
overlap : int
Timesteps for both xfade and rnn warmup
Returns
----------
Tensor
audio samples in a 1d array
shape=(total_len)
dtype=paddle.float32
Details
----------
y = [[seq1],
[seq2],
[seq3]]
Apply a gain envelope at both ends of the sequences
y = [[seq1_in, seq1_target, seq1_out],
[seq2_in, seq2_target, seq2_out],
[seq3_in, seq3_target, seq3_out]]
Stagger and add up the groups of samples:
[seq1_in, seq1_target, (seq1_out + seq2_in), seq2_target, ...]
'''
# num_folds = (total_len - overlap) // (target + overlap)
num_folds, length = paddle.shape(y)
target = length - 2 * overlap
total_len = num_folds * (target + overlap) + overlap
# Need some silence for the run warmup
slience_len = overlap // 2
fade_len = overlap - slience_len
slience = paddle.zeros([slience_len], dtype=paddle.float32)
linear = paddle.ones([fade_len], dtype=paddle.float32)
# Equal power crossfade
# fade_in increase from 0 to 1, fade_out reduces from 1 to 0
t = paddle.linspace(-1, 1, fade_len, dtype=paddle.float32)
fade_in = paddle.sqrt(0.5 * (1 + t))
fade_out = paddle.sqrt(0.5 * (1 - t))
# Concat the silence to the fades
fade_out = paddle.concat([linear, fade_out])
fade_in = paddle.concat([slience, fade_in])
# Apply the gain to the overlap samples
y[:, :overlap] *= fade_in
y[:, -overlap:] *= fade_out
unfolded = paddle.zeros([total_len], dtype=paddle.float32)
# Loop to add up all the samples
for i in range(num_folds):
start = i * (target + overlap)
end = start + target + 2 * overlap
unfolded[start:end] += y[i]
return unfolded
def gen_display(self, i, seq_len, b_size, start):
gen_rate = (i + 1) / (time.time() - start) * b_size / 1000
pbar = self.progbar(i, seq_len)
msg = f'| {pbar} {i*b_size}/{seq_len*b_size} | Batch Size: {b_size} | Gen Rate: {gen_rate:.1f}kHz | '
sys.stdout.write(f"\r{msg}")
def progbar(self, i, n, size=16):
done = int(i * size) // n
bar = ''
for i in range(size):
bar += '█' if i <= done else '░'
return bar
class WaveRNNInference(nn.Layer):
def __init__(self, normalizer, wavernn):
super().__init__()
self.normalizer = normalizer
self.wavernn = wavernn
def forward(self,
logmel,
batched: bool=True,
target: int=12000,
overlap: int=600,
mu_law: bool=True,
gen_display: bool=False):
normalized_mel = self.normalizer(logmel)
wav = self.wavernn.generate(
normalized_mel, )
# batched=batched,
# target=target,
# overlap=overlap,
# mu_law=mu_law,
# gen_display=gen_display)
return wav
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from pathlib import Path
import paddle
import soundfile as sf
from paddle import distributed as dist
from paddle.io import DataLoader
from paddle.nn import Layer
from paddle.optimizer import Optimizer
from paddlespeech.t2s.training.extensions.evaluator import StandardEvaluator
from paddlespeech.t2s.training.reporter import report
from paddlespeech.t2s.training.updaters.standard_updater import StandardUpdater
logging.basicConfig(
format='%(asctime)s [%(levelname)s] [%(filename)s:%(lineno)d] %(message)s',
datefmt='[%Y-%m-%d %H:%M:%S]')
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
def calculate_grad_norm(parameters, norm_type: str=2):
'''
calculate grad norm of mdoel's parameters
parameters:
model's parameters
norm_type: str
Returns
------------
Tensor
grad_norm
'''
grad_list = [
paddle.to_tensor(p.grad) for p in parameters if p.grad is not None
]
norm_list = paddle.stack(
[paddle.norm(grad, norm_type) for grad in grad_list])
total_norm = paddle.norm(norm_list)
return total_norm
# for save name in gen_valid_samples()
ITERATION = 0
class WaveRNNUpdater(StandardUpdater):
def __init__(self,
model: Layer,
optimizer: Optimizer,
criterion: Layer,
dataloader: DataLoader,
init_state=None,
output_dir: Path=None,
mode='RAW'):
super().__init__(model, optimizer, dataloader, init_state=None)
self.criterion = criterion
# self.scheduler = scheduler
log_file = output_dir / 'worker_{}.log'.format(dist.get_rank())
self.filehandler = logging.FileHandler(str(log_file))
logger.addHandler(self.filehandler)
self.logger = logger
self.msg = ""
self.mode = mode
def update_core(self, batch):
self.msg = "Rank: {}, ".format(dist.get_rank())
losses_dict = {}
# parse batch
self.model.train()
self.optimizer.clear_grad()
wav, y, mel = batch
y_hat = self.model(wav, mel)
if self.mode == 'RAW':
y_hat = y_hat.transpose([0, 2, 1]).unsqueeze(-1)
elif self.mode == 'MOL':
y_hat = paddle.cast(y, dtype='float32')
y = y.unsqueeze(-1)
loss = self.criterion(y_hat, y)
loss.backward()
grad_norm = float(
calculate_grad_norm(self.model.parameters(), norm_type=2))
self.optimizer.step()
report("train/loss", float(loss))
report("train/grad_norm", float(grad_norm))
losses_dict["loss"] = float(loss)
losses_dict["grad_norm"] = float(grad_norm)
self.msg += ', '.join('{}: {:>.6f}'.format(k, v)
for k, v in losses_dict.items())
global ITERATION
ITERATION = self.state.iteration + 1
class WaveRNNEvaluator(StandardEvaluator):
def __init__(self,
model: Layer,
criterion: Layer,
dataloader: Optimizer,
output_dir: Path=None,
valid_generate_loader=None,
config=None):
super().__init__(model, dataloader)
log_file = output_dir / 'worker_{}.log'.format(dist.get_rank())
self.filehandler = logging.FileHandler(str(log_file))
logger.addHandler(self.filehandler)
self.logger = logger
self.msg = ""
self.criterion = criterion
self.valid_generate_loader = valid_generate_loader
self.config = config
self.mode = config.model.mode
self.valid_samples_dir = output_dir / "valid_samples"
self.valid_samples_dir.mkdir(parents=True, exist_ok=True)
def evaluate_core(self, batch):
self.msg = "Evaluate: "
losses_dict = {}
# parse batch
wav, y, mel = batch
y_hat = self.model(wav, mel)
if self.mode == 'RAW':
y_hat = y_hat.transpose([0, 2, 1]).unsqueeze(-1)
elif self.mode == 'MOL':
y_hat = paddle.cast(y, dtype='float32')
y = y.unsqueeze(-1)
loss = self.criterion(y_hat, y)
report("eval/loss", float(loss))
losses_dict["loss"] = float(loss)
self.msg += ', '.join('{}: {:>.6f}'.format(k, v)
for k, v in losses_dict.items())
self.logger.info(self.msg)
def gen_valid_samples(self):
for i, item in enumerate(self.valid_generate_loader):
if i >= self.config.generate_num:
break
print(
'\n| Generating: {}/{}'.format(i + 1, self.config.generate_num))
mel = item['feats']
wav = item['wave']
wav = wav.squeeze(0)
origin_save_path = self.valid_samples_dir / '{}_steps_{}_target.wav'.format(
self.iteration, i)
sf.write(origin_save_path, wav.numpy(), samplerate=self.config.fs)
if self.config.inference.gen_batched:
batch_str = 'gen_batched_target{}_overlap{}'.format(
self.config.inference.target, self.config.inference.overlap)
else:
batch_str = 'gen_not_batched'
gen_save_path = str(self.valid_samples_dir /
'{}_steps_{}_{}.wav'.format(self.iteration, i,
batch_str))
# (1, T, C_aux) -> (T, C_aux)
mel = mel.squeeze(0)
gen_sample = self.model.generate(
mel, self.config.inference.gen_batched,
self.config.inference.target, self.config.inference.overlap,
self.config.mu_law)
sf.write(
gen_save_path, gen_sample.numpy(), samplerate=self.config.fs)
def __call__(self, trainer=None):
summary = self.evaluate()
for k, v in summary.items():
report(k, v)
# gen samples at then end of evaluate
self.iteration = ITERATION
if self.iteration % self.config.gen_eval_samples_interval_steps == 0:
self.gen_valid_samples()
...@@ -14,6 +14,7 @@ ...@@ -14,6 +14,7 @@
import math import math
import librosa import librosa
import numpy as np
import paddle import paddle
from paddle import nn from paddle import nn
from paddle.fluid.layers import sequence_mask from paddle.fluid.layers import sequence_mask
...@@ -23,6 +24,145 @@ from scipy import signal ...@@ -23,6 +24,145 @@ from scipy import signal
from paddlespeech.t2s.modules.nets_utils import make_non_pad_mask from paddlespeech.t2s.modules.nets_utils import make_non_pad_mask
# Losses for WaveRNN
def log_sum_exp(x):
""" numerically stable log_sum_exp implementation that prevents overflow """
# TF ordering
axis = len(x.shape) - 1
m = paddle.max(x, axis=axis)
m2 = paddle.max(x, axis=axis, keepdim=True)
return m + paddle.log(paddle.sum(paddle.exp(x - m2), axis=axis))
# It is adapted from https://github.com/r9y9/wavenet_vocoder/blob/master/wavenet_vocoder/mixture.py
def discretized_mix_logistic_loss(y_hat,
y,
num_classes=65536,
log_scale_min=None,
reduce=True):
if log_scale_min is None:
log_scale_min = float(np.log(1e-14))
y_hat = y_hat.transpose([0, 2, 1])
assert y_hat.dim() == 3
assert y_hat.shape[1] % 3 == 0
nr_mix = y_hat.shape[1] // 3
# (B x T x C)
y_hat = y_hat.transpose([0, 2, 1])
# unpack parameters. (B, T, num_mixtures) x 3
logit_probs = y_hat[:, :, :nr_mix]
means = y_hat[:, :, nr_mix:2 * nr_mix]
log_scales = paddle.clip(
y_hat[:, :, 2 * nr_mix:3 * nr_mix], min=log_scale_min)
# B x T x 1 -> B x T x num_mixtures
y = y.expand_as(means)
centered_y = paddle.cast(y, dtype=paddle.get_default_dtype()) - means
inv_stdv = paddle.exp(-log_scales)
plus_in = inv_stdv * (centered_y + 1. / (num_classes - 1))
cdf_plus = F.sigmoid(plus_in)
min_in = inv_stdv * (centered_y - 1. / (num_classes - 1))
cdf_min = F.sigmoid(min_in)
# log probability for edge case of 0 (before scaling)
# equivalent: torch.log(F.sigmoid(plus_in))
# softplus: log(1+ e^{-x})
log_cdf_plus = plus_in - F.softplus(plus_in)
# log probability for edge case of 255 (before scaling)
# equivalent: (1 - F.sigmoid(min_in)).log()
log_one_minus_cdf_min = -F.softplus(min_in)
# probability for all other cases
cdf_delta = cdf_plus - cdf_min
mid_in = inv_stdv * centered_y
# log probability in the center of the bin, to be used in extreme cases
# (not actually used in our code)
log_pdf_mid = mid_in - log_scales - 2. * F.softplus(mid_in)
# TODO: cdf_delta <= 1e-5 actually can happen. How can we choose the value
# for num_classes=65536 case? 1e-7? not sure..
inner_inner_cond = cdf_delta > 1e-5
inner_inner_cond = paddle.cast(
inner_inner_cond, dtype=paddle.get_default_dtype())
# inner_inner_out = inner_inner_cond * \
# paddle.log(paddle.clip(cdf_delta, min=1e-12)) + \
# (1. - inner_inner_cond) * (log_pdf_mid - np.log((num_classes - 1) / 2))
inner_inner_out = inner_inner_cond * paddle.log(
paddle.clip(cdf_delta, min=1e-12)) + (1. - inner_inner_cond) * (
log_pdf_mid - np.log((num_classes - 1) / 2))
inner_cond = y > 0.999
inner_cond = paddle.cast(inner_cond, dtype=paddle.get_default_dtype())
inner_out = inner_cond * log_one_minus_cdf_min + (1. - inner_cond
) * inner_inner_out
cond = y < -0.999
cond = paddle.cast(cond, dtype=paddle.get_default_dtype())
log_probs = cond * log_cdf_plus + (1. - cond) * inner_out
log_probs = log_probs + F.log_softmax(logit_probs, -1)
if reduce:
return -paddle.mean(log_sum_exp(log_probs))
else:
return -log_sum_exp(log_probs).unsqueeze(-1)
def sample_from_discretized_mix_logistic(y, log_scale_min=None):
"""
Sample from discretized mixture of logistic distributions
Parameters
----------
y : Tensor
(B, C, T)
log_scale_min : float
Log scale minimum value
Returns
----------
Tensor
sample in range of [-1, 1].
"""
if log_scale_min is None:
log_scale_min = float(np.log(1e-14))
assert y.shape[1] % 3 == 0
nr_mix = y.shape[1] // 3
# (B, T, C)
y = y.transpose([0, 2, 1])
logit_probs = y[:, :, :nr_mix]
# sample mixture indicator from softmax
temp = paddle.uniform(
logit_probs.shape, dtype=logit_probs.dtype, min=1e-5, max=1.0 - 1e-5)
temp = logit_probs - paddle.log(-paddle.log(temp))
argmax = paddle.argmax(temp, axis=-1)
# (B, T) -> (B, T, nr_mix)
one_hot = F.one_hot(argmax, nr_mix)
one_hot = paddle.cast(one_hot, dtype=paddle.get_default_dtype())
# select logistic parameters
means = paddle.sum(y[:, :, nr_mix:2 * nr_mix] * one_hot, axis=-1)
log_scales = paddle.clip(
paddle.sum(y[:, :, 2 * nr_mix:3 * nr_mix] * one_hot, axis=-1),
min=log_scale_min)
# sample from logistic & clip to interval
# we don't actually round to the nearest 8bit value when sampling
u = paddle.uniform(means.shape, min=1e-5, max=1.0 - 1e-5)
x = means + paddle.exp(log_scales) * (paddle.log(u) - paddle.log(1. - u))
x = paddle.clip(x, min=-1., max=-1.)
return x
# Loss for new Tacotron2 # Loss for new Tacotron2
class GuidedAttentionLoss(nn.Layer): class GuidedAttentionLoss(nn.Layer):
"""Guided attention loss function module. """Guided attention loss function module.
......
...@@ -157,7 +157,7 @@ class AttLoc(nn.Layer): ...@@ -157,7 +157,7 @@ class AttLoc(nn.Layer):
paddle.Tensor paddle.Tensor
previous attention weights (B, T_max) previous attention weights (B, T_max)
""" """
batch = len(enc_hs_pad) batch = paddle.shape(enc_hs_pad)[0]
# pre-compute all h outside the decoder loop # pre-compute all h outside the decoder loop
if self.pre_compute_enc_h is None or self.han_mode: if self.pre_compute_enc_h is None or self.han_mode:
# (utt, frame, hdim) # (utt, frame, hdim)
...@@ -172,33 +172,30 @@ class AttLoc(nn.Layer): ...@@ -172,33 +172,30 @@ class AttLoc(nn.Layer):
dec_z = dec_z.reshape([batch, self.dunits]) dec_z = dec_z.reshape([batch, self.dunits])
# initialize attention weight with uniform dist. # initialize attention weight with uniform dist.
if att_prev is None: if paddle.sum(att_prev) == 0:
# if no bias, 0 0-pad goes 0 # if no bias, 0 0-pad goes 0
att_prev = 1.0 - make_pad_mask(enc_hs_len) att_prev = 1.0 - make_pad_mask(enc_hs_len)
att_prev = att_prev / enc_hs_len.unsqueeze(-1) att_prev = att_prev / enc_hs_len.unsqueeze(-1)
# att_prev: (utt, frame) -> (utt, 1, 1, frame) # att_prev: (utt, frame) -> (utt, 1, 1, frame)
# -> (utt, att_conv_chans, 1, frame) # -> (utt, att_conv_chans, 1, frame)
att_conv = self.loc_conv(att_prev.reshape([batch, 1, 1, self.h_length])) att_conv = self.loc_conv(att_prev.reshape([batch, 1, 1, self.h_length]))
# att_conv: (utt, att_conv_chans, 1, frame) -> (utt, frame, att_conv_chans) # att_conv: (utt, att_conv_chans, 1, frame) -> (utt, frame, att_conv_chans)
att_conv = att_conv.squeeze(2).transpose([0, 2, 1]) att_conv = att_conv.squeeze(2).transpose([0, 2, 1])
# att_conv: (utt, frame, att_conv_chans) -> (utt, frame, att_dim) # att_conv: (utt, frame, att_conv_chans) -> (utt, frame, att_dim)
att_conv = self.mlp_att(att_conv) att_conv = self.mlp_att(att_conv)
# dec_z_tiled: (utt, frame, att_dim) # dec_z_tiled: (utt, frame, att_dim)
dec_z_tiled = self.mlp_dec(dec_z).reshape([batch, 1, self.att_dim]) dec_z_tiled = self.mlp_dec(dec_z).reshape([batch, 1, self.att_dim])
# dot with gvec # dot with gvec
# (utt, frame, att_dim) -> (utt, frame) # (utt, frame, att_dim) -> (utt, frame)
e = self.gvec( e = paddle.tanh(att_conv + self.pre_compute_enc_h + dec_z_tiled)
paddle.tanh(att_conv + self.pre_compute_enc_h + e = self.gvec(e).squeeze(2)
dec_z_tiled)).squeeze(2)
# NOTE: consider zero padding when compute w. # NOTE: consider zero padding when compute w.
if self.mask is None: if self.mask is None:
self.mask = make_pad_mask(enc_hs_len) self.mask = make_pad_mask(enc_hs_len)
e = masked_fill(e, self.mask, -float("inf")) e = masked_fill(e, self.mask, -float("inf"))
# apply monotonic attention constraint (mainly for TTS) # apply monotonic attention constraint (mainly for TTS)
if last_attended_idx is not None: if last_attended_idx is not None:
...@@ -211,7 +208,6 @@ class AttLoc(nn.Layer): ...@@ -211,7 +208,6 @@ class AttLoc(nn.Layer):
# utt x hdim # utt x hdim
c = paddle.sum( c = paddle.sum(
self.enc_h * w.reshape([batch, self.h_length, 1]), axis=1) self.enc_h * w.reshape([batch, self.h_length, 1]), axis=1)
return c, w return c, w
......
...@@ -15,7 +15,6 @@ ...@@ -15,7 +15,6 @@
"""Tacotron2 decoder related modules.""" """Tacotron2 decoder related modules."""
import paddle import paddle
import paddle.nn.functional as F import paddle.nn.functional as F
import six
from paddle import nn from paddle import nn
from paddlespeech.t2s.modules.tacotron2.attentions import AttForwardTA from paddlespeech.t2s.modules.tacotron2.attentions import AttForwardTA
...@@ -59,7 +58,7 @@ class Prenet(nn.Layer): ...@@ -59,7 +58,7 @@ class Prenet(nn.Layer):
super().__init__() super().__init__()
self.dropout_rate = dropout_rate self.dropout_rate = dropout_rate
self.prenet = nn.LayerList() self.prenet = nn.LayerList()
for layer in six.moves.range(n_layers): for layer in range(n_layers):
n_inputs = idim if layer == 0 else n_units n_inputs = idim if layer == 0 else n_units
self.prenet.append( self.prenet.append(
nn.Sequential(nn.Linear(n_inputs, n_units), nn.ReLU())) nn.Sequential(nn.Linear(n_inputs, n_units), nn.ReLU()))
...@@ -78,7 +77,7 @@ class Prenet(nn.Layer): ...@@ -78,7 +77,7 @@ class Prenet(nn.Layer):
Batch of output tensors (B, ..., odim). Batch of output tensors (B, ..., odim).
""" """
for i in six.moves.range(len(self.prenet)): for i in range(len(self.prenet)):
# F.dropout 引入了随机, tacotron2 的 dropout 是不能去掉的 # F.dropout 引入了随机, tacotron2 的 dropout 是不能去掉的
x = F.dropout(self.prenet[i](x)) x = F.dropout(self.prenet[i](x))
return x return x
...@@ -129,7 +128,7 @@ class Postnet(nn.Layer): ...@@ -129,7 +128,7 @@ class Postnet(nn.Layer):
""" """
super().__init__() super().__init__()
self.postnet = nn.LayerList() self.postnet = nn.LayerList()
for layer in six.moves.range(n_layers - 1): for layer in range(n_layers - 1):
ichans = odim if layer == 0 else n_chans ichans = odim if layer == 0 else n_chans
ochans = odim if layer == n_layers - 1 else n_chans ochans = odim if layer == n_layers - 1 else n_chans
if use_batch_norm: if use_batch_norm:
...@@ -196,7 +195,7 @@ class Postnet(nn.Layer): ...@@ -196,7 +195,7 @@ class Postnet(nn.Layer):
Batch of padded output tensor. (B, odim, Tmax). Batch of padded output tensor. (B, odim, Tmax).
""" """
for i in six.moves.range(len(self.postnet)): for i in range(len(self.postnet)):
xs = self.postnet[i](xs) xs = self.postnet[i](xs)
return xs return xs
...@@ -360,7 +359,7 @@ class Decoder(nn.Layer): ...@@ -360,7 +359,7 @@ class Decoder(nn.Layer):
# define lstm network # define lstm network
prenet_units = prenet_units if prenet_layers != 0 else odim prenet_units = prenet_units if prenet_layers != 0 else odim
self.lstm = nn.LayerList() self.lstm = nn.LayerList()
for layer in six.moves.range(dlayers): for layer in range(dlayers):
iunits = idim + prenet_units if layer == 0 else dunits iunits = idim + prenet_units if layer == 0 else dunits
lstm = nn.LSTMCell(iunits, dunits) lstm = nn.LSTMCell(iunits, dunits)
if zoneout_rate > 0.0: if zoneout_rate > 0.0:
...@@ -437,47 +436,50 @@ class Decoder(nn.Layer): ...@@ -437,47 +436,50 @@ class Decoder(nn.Layer):
# initialize hidden states of decoder # initialize hidden states of decoder
c_list = [self._zero_state(hs)] c_list = [self._zero_state(hs)]
z_list = [self._zero_state(hs)] z_list = [self._zero_state(hs)]
for _ in six.moves.range(1, len(self.lstm)): for _ in range(1, len(self.lstm)):
c_list += [self._zero_state(hs)] c_list.append(self._zero_state(hs))
z_list += [self._zero_state(hs)] z_list.append(self._zero_state(hs))
prev_out = paddle.zeros([paddle.shape(hs)[0], self.odim]) prev_out = paddle.zeros([paddle.shape(hs)[0], self.odim])
# initialize attention # initialize attention
prev_att_w = None prev_att_ws = []
prev_att_w = paddle.zeros(paddle.shape(hlens))
prev_att_ws.append(prev_att_w)
self.att.reset() self.att.reset()
# loop for an output sequence # loop for an output sequence
outs, logits, att_ws = [], [], [] outs, logits, att_ws = [], [], []
for y in ys.transpose([1, 0, 2]): for y in ys.transpose([1, 0, 2]):
if self.use_att_extra_inputs: if self.use_att_extra_inputs:
att_c, att_w = self.att(hs, hlens, z_list[0], prev_att_w, att_c, att_w = self.att(hs, hlens, z_list[0], prev_att_ws[-1],
prev_out) prev_out)
else: else:
att_c, att_w = self.att(hs, hlens, z_list[0], prev_att_w) att_c, att_w = self.att(hs, hlens, z_list[0], prev_att_ws[-1])
prenet_out = self.prenet( prenet_out = self.prenet(
prev_out) if self.prenet is not None else prev_out prev_out) if self.prenet is not None else prev_out
xs = paddle.concat([att_c, prenet_out], axis=1) xs = paddle.concat([att_c, prenet_out], axis=1)
# we only use the second output of LSTMCell in paddle # we only use the second output of LSTMCell in paddle
_, next_hidden = self.lstm[0](xs, (z_list[0], c_list[0])) _, next_hidden = self.lstm[0](xs, (z_list[0], c_list[0]))
z_list[0], c_list[0] = next_hidden z_list[0], c_list[0] = next_hidden
for i in six.moves.range(1, len(self.lstm)): for i in range(1, len(self.lstm)):
# we only use the second output of LSTMCell in paddle # we only use the second output of LSTMCell in paddle
_, next_hidden = self.lstm[i](z_list[i - 1], _, next_hidden = self.lstm[i](z_list[i - 1],
(z_list[i], c_list[i])) (z_list[i], c_list[i]))
z_list[i], c_list[i] = next_hidden z_list[i], c_list[i] = next_hidden
zcs = (paddle.concat([z_list[-1], att_c], axis=1) zcs = (paddle.concat([z_list[-1], att_c], axis=1)
if self.use_concate else z_list[-1]) if self.use_concate else z_list[-1])
outs += [ outs.append(
self.feat_out(zcs).reshape([paddle.shape(hs)[0], self.odim, -1]) self.feat_out(zcs).reshape([paddle.shape(hs)[0], self.odim, -1
] ]))
logits += [self.prob_out(zcs)] logits.append(self.prob_out(zcs))
att_ws += [att_w] att_ws.append(att_w)
# teacher forcing # teacher forcing
prev_out = y prev_out = y
if self.cumulate_att_w and prev_att_w is not None: if self.cumulate_att_w and paddle.sum(prev_att_w) != 0:
prev_att_w = prev_att_w + att_w # Note: error when use += prev_att_w = prev_att_w + att_w # Note: error when use +=
else: else:
prev_att_w = att_w prev_att_w = att_w
prev_att_ws.append(prev_att_w)
# (B, Lmax) # (B, Lmax)
logits = paddle.concat(logits, axis=1) logits = paddle.concat(logits, axis=1)
# (B, odim, Lmax) # (B, odim, Lmax)
...@@ -552,6 +554,7 @@ class Decoder(nn.Layer): ...@@ -552,6 +554,7 @@ class Decoder(nn.Layer):
.. _`Deep Voice 3`: https://arxiv.org/abs/1710.07654 .. _`Deep Voice 3`: https://arxiv.org/abs/1710.07654
""" """
# setup # setup
assert len(paddle.shape(h)) == 2 assert len(paddle.shape(h)) == 2
hs = h.unsqueeze(0) hs = h.unsqueeze(0)
ilens = paddle.shape(h)[0] ilens = paddle.shape(h)[0]
...@@ -561,13 +564,16 @@ class Decoder(nn.Layer): ...@@ -561,13 +564,16 @@ class Decoder(nn.Layer):
# initialize hidden states of decoder # initialize hidden states of decoder
c_list = [self._zero_state(hs)] c_list = [self._zero_state(hs)]
z_list = [self._zero_state(hs)] z_list = [self._zero_state(hs)]
for _ in six.moves.range(1, len(self.lstm)): for _ in range(1, len(self.lstm)):
c_list += [self._zero_state(hs)] c_list.append(self._zero_state(hs))
z_list += [self._zero_state(hs)] z_list.append(self._zero_state(hs))
prev_out = paddle.zeros([1, self.odim]) prev_out = paddle.zeros([1, self.odim])
# initialize attention # initialize attention
prev_att_w = None prev_att_ws = []
prev_att_w = paddle.zeros([ilens])
prev_att_ws.append(prev_att_w)
self.att.reset() self.att.reset()
# setup for attention constraint # setup for attention constraint
...@@ -579,6 +585,7 @@ class Decoder(nn.Layer): ...@@ -579,6 +585,7 @@ class Decoder(nn.Layer):
# loop for an output sequence # loop for an output sequence
idx = 0 idx = 0
outs, att_ws, probs = [], [], [] outs, att_ws, probs = [], [], []
prob = paddle.zeros([1])
while True: while True:
# updated index # updated index
idx += self.reduction_factor idx += self.reduction_factor
...@@ -589,7 +596,7 @@ class Decoder(nn.Layer): ...@@ -589,7 +596,7 @@ class Decoder(nn.Layer):
hs, hs,
ilens, ilens,
z_list[0], z_list[0],
prev_att_w, prev_att_ws[-1],
prev_out, prev_out,
last_attended_idx=last_attended_idx, last_attended_idx=last_attended_idx,
backward_window=backward_window, backward_window=backward_window,
...@@ -599,19 +606,20 @@ class Decoder(nn.Layer): ...@@ -599,19 +606,20 @@ class Decoder(nn.Layer):
hs, hs,
ilens, ilens,
z_list[0], z_list[0],
prev_att_w, prev_att_ws[-1],
last_attended_idx=last_attended_idx, last_attended_idx=last_attended_idx,
backward_window=backward_window, backward_window=backward_window,
forward_window=forward_window, ) forward_window=forward_window, )
att_ws += [att_w] att_ws.append(att_w)
prenet_out = self.prenet( prenet_out = self.prenet(
prev_out) if self.prenet is not None else prev_out prev_out) if self.prenet is not None else prev_out
xs = paddle.concat([att_c, prenet_out], axis=1) xs = paddle.concat([att_c, prenet_out], axis=1)
# we only use the second output of LSTMCell in paddle # we only use the second output of LSTMCell in paddle
_, next_hidden = self.lstm[0](xs, (z_list[0], c_list[0])) _, next_hidden = self.lstm[0](xs, (z_list[0], c_list[0]))
z_list[0], c_list[0] = next_hidden z_list[0], c_list[0] = next_hidden
for i in six.moves.range(1, len(self.lstm)): for i in range(1, len(self.lstm)):
# we only use the second output of LSTMCell in paddle # we only use the second output of LSTMCell in paddle
_, next_hidden = self.lstm[i](z_list[i - 1], _, next_hidden = self.lstm[i](z_list[i - 1],
(z_list[i], c_list[i])) (z_list[i], c_list[i]))
...@@ -619,28 +627,29 @@ class Decoder(nn.Layer): ...@@ -619,28 +627,29 @@ class Decoder(nn.Layer):
zcs = (paddle.concat([z_list[-1], att_c], axis=1) zcs = (paddle.concat([z_list[-1], att_c], axis=1)
if self.use_concate else z_list[-1]) if self.use_concate else z_list[-1])
# [(1, odim, r), ...] # [(1, odim, r), ...]
outs += [self.feat_out(zcs).reshape([1, self.odim, -1])] outs.append(self.feat_out(zcs).reshape([1, self.odim, -1]))
prob = F.sigmoid(self.prob_out(zcs))[0]
probs.append(prob)
# [(r), ...]
probs += [F.sigmoid(self.prob_out(zcs))[0]]
if self.output_activation_fn is not None: if self.output_activation_fn is not None:
prev_out = self.output_activation_fn( prev_out = self.output_activation_fn(
outs[-1][:, :, -1]) # (1, odim) outs[-1][:, :, -1]) # (1, odim)
else: else:
prev_out = outs[-1][:, :, -1] # (1, odim) prev_out = outs[-1][:, :, -1] # (1, odim)
if self.cumulate_att_w and prev_att_w is not None: if self.cumulate_att_w and paddle.sum(prev_att_w) != 0:
prev_att_w = prev_att_w + att_w # Note: error when use += prev_att_w = prev_att_w + att_w # Note: error when use +=
else: else:
prev_att_w = att_w prev_att_w = att_w
prev_att_ws.append(prev_att_w)
if use_att_constraint: if use_att_constraint:
last_attended_idx = int(att_w.argmax()) last_attended_idx = int(att_w.argmax())
# check whether to finish generation if prob >= threshold or idx >= maxlen:
if sum(paddle.cast(probs[-1] >= threshold,
'int64')) > 0 or idx >= maxlen:
# check mininum length # check mininum length
if idx < minlen: if idx < minlen:
continue continue
break
# (1, odim, L) # (1, odim, L)
outs = paddle.concat(outs, axis=2) outs = paddle.concat(outs, axis=2)
if self.postnet is not None: if self.postnet is not None:
...@@ -650,7 +659,6 @@ class Decoder(nn.Layer): ...@@ -650,7 +659,6 @@ class Decoder(nn.Layer):
outs = outs.transpose([0, 2, 1]).squeeze(0) outs = outs.transpose([0, 2, 1]).squeeze(0)
probs = paddle.concat(probs, axis=0) probs = paddle.concat(probs, axis=0)
att_ws = paddle.concat(att_ws, axis=0) att_ws = paddle.concat(att_ws, axis=0)
break
if self.output_activation_fn is not None: if self.output_activation_fn is not None:
outs = self.output_activation_fn(outs) outs = self.output_activation_fn(outs)
...@@ -685,9 +693,9 @@ class Decoder(nn.Layer): ...@@ -685,9 +693,9 @@ class Decoder(nn.Layer):
# initialize hidden states of decoder # initialize hidden states of decoder
c_list = [self._zero_state(hs)] c_list = [self._zero_state(hs)]
z_list = [self._zero_state(hs)] z_list = [self._zero_state(hs)]
for _ in six.moves.range(1, len(self.lstm)): for _ in range(1, len(self.lstm)):
c_list += [self._zero_state(hs)] c_list.append(self._zero_state(hs))
z_list += [self._zero_state(hs)] z_list.append(self._zero_state(hs))
prev_out = paddle.zeros([paddle.shape(hs)[0], self.odim]) prev_out = paddle.zeros([paddle.shape(hs)[0], self.odim])
# initialize attention # initialize attention
...@@ -702,14 +710,14 @@ class Decoder(nn.Layer): ...@@ -702,14 +710,14 @@ class Decoder(nn.Layer):
prev_out) prev_out)
else: else:
att_c, att_w = self.att(hs, hlens, z_list[0], prev_att_w) att_c, att_w = self.att(hs, hlens, z_list[0], prev_att_w)
att_ws += [att_w] att_ws.append(att_w)
prenet_out = self.prenet( prenet_out = self.prenet(
prev_out) if self.prenet is not None else prev_out prev_out) if self.prenet is not None else prev_out
xs = paddle.concat([att_c, prenet_out], axis=1) xs = paddle.concat([att_c, prenet_out], axis=1)
# we only use the second output of LSTMCell in paddle # we only use the second output of LSTMCell in paddle
_, next_hidden = self.lstm[0](xs, (z_list[0], c_list[0])) _, next_hidden = self.lstm[0](xs, (z_list[0], c_list[0]))
z_list[0], c_list[0] = next_hidden z_list[0], c_list[0] = next_hidden
for i in six.moves.range(1, len(self.lstm)): for i in range(1, len(self.lstm)):
z_list[i], c_list[i] = self.lstm[i](z_list[i - 1], z_list[i], c_list[i] = self.lstm[i](z_list[i - 1],
(z_list[i], c_list[i])) (z_list[i], c_list[i]))
# teacher forcing # teacher forcing
......
...@@ -14,7 +14,6 @@ ...@@ -14,7 +14,6 @@
# Modified from espnet(https://github.com/espnet/espnet) # Modified from espnet(https://github.com/espnet/espnet)
"""Tacotron2 encoder related modules.""" """Tacotron2 encoder related modules."""
import paddle import paddle
import six
from paddle import nn from paddle import nn
...@@ -88,7 +87,7 @@ class Encoder(nn.Layer): ...@@ -88,7 +87,7 @@ class Encoder(nn.Layer):
if econv_layers > 0: if econv_layers > 0:
self.convs = nn.LayerList() self.convs = nn.LayerList()
for layer in six.moves.range(econv_layers): for layer in range(econv_layers):
ichans = (embed_dim if layer == 0 and input_layer == "embed" ichans = (embed_dim if layer == 0 and input_layer == "embed"
else econv_chans) else econv_chans)
if use_batch_norm: if use_batch_norm:
...@@ -130,6 +129,7 @@ class Encoder(nn.Layer): ...@@ -130,6 +129,7 @@ class Encoder(nn.Layer):
direction='bidirectional', direction='bidirectional',
bias_ih_attr=True, bias_ih_attr=True,
bias_hh_attr=True) bias_hh_attr=True)
self.blstm.flatten_parameters()
else: else:
self.blstm = None self.blstm = None
...@@ -157,7 +157,7 @@ class Encoder(nn.Layer): ...@@ -157,7 +157,7 @@ class Encoder(nn.Layer):
""" """
xs = self.embed(xs).transpose([0, 2, 1]) xs = self.embed(xs).transpose([0, 2, 1])
if self.convs is not None: if self.convs is not None:
for i in six.moves.range(len(self.convs)): for i in range(len(self.convs)):
if self.use_residual: if self.use_residual:
xs += self.convs[i](xs) xs += self.convs[i](xs)
else: else:
...@@ -167,7 +167,8 @@ class Encoder(nn.Layer): ...@@ -167,7 +167,8 @@ class Encoder(nn.Layer):
if not isinstance(ilens, paddle.Tensor): if not isinstance(ilens, paddle.Tensor):
ilens = paddle.to_tensor(ilens) ilens = paddle.to_tensor(ilens)
xs = xs.transpose([0, 2, 1]) xs = xs.transpose([0, 2, 1])
self.blstm.flatten_parameters() # for dygraph to static graph
# self.blstm.flatten_parameters()
# (B, Tmax, C) # (B, Tmax, C)
# see https://www.paddlepaddle.org.cn/documentation/docs/zh/faq/train_cn.html#paddletorch-nn-utils-rnn-pack-padded-sequencetorch-nn-utils-rnn-pad-packed-sequenceapi # see https://www.paddlepaddle.org.cn/documentation/docs/zh/faq/train_cn.html#paddletorch-nn-utils-rnn-pack-padded-sequencetorch-nn-utils-rnn-pad-packed-sequenceapi
xs, _ = self.blstm(xs, sequence_length=ilens) xs, _ = self.blstm(xs, sequence_length=ilens)
...@@ -191,6 +192,6 @@ class Encoder(nn.Layer): ...@@ -191,6 +192,6 @@ class Encoder(nn.Layer):
""" """
xs = x.unsqueeze(0) xs = x.unsqueeze(0)
ilens = paddle.to_tensor([x.shape[0]]) ilens = paddle.shape(x)[0]
return self.forward(xs, ilens)[0][0] return self.forward(xs, ilens)[0][0]
...@@ -41,4 +41,4 @@ def repeat(N, fn): ...@@ -41,4 +41,4 @@ def repeat(N, fn):
MultiSequential MultiSequential
Repeated model instance. Repeated model instance.
""" """
return MultiSequential(* [fn(n) for n in range(N)]) return MultiSequential(*[fn(n) for n in range(N)])
...@@ -16,3 +16,7 @@ from . import display ...@@ -16,3 +16,7 @@ from . import display
from . import layer_tools from . import layer_tools
from . import mp_tools from . import mp_tools
from . import scheduler from . import scheduler
def str2bool(str):
return True if str.lower() == 'true' else False
...@@ -123,9 +123,3 @@ class Collate(object): ...@@ -123,9 +123,3 @@ class Collate(object):
frame_clips = [self.random_crop(mel) for mel in examples] frame_clips = [self.random_crop(mel) for mel in examples]
batced_clips = np.stack(frame_clips) batced_clips = np.stack(frame_clips)
return batced_clips return batced_clips
if __name__ == "__main__":
mydataset = MultiSpeakerMelDataset(
Path("/home/chenfeiyu/datasets/SV2TTS/encoder"))
print(mydataset.get_example_by_index(0, 10))
### Prepare the environment ### Prepare the environment
Please follow the instructions shown in [here](../../docs/source/install.md) to install the Deepspeech first. Please follow the instructions shown in [here](../../../docs/source/install.md) to install the Deepspeech first.
### File list ### File list
└── benchmark # 模型名 └── benchmark # 模型名
......
...@@ -22,6 +22,7 @@ from sklearn.preprocessing import StandardScaler ...@@ -22,6 +22,7 @@ from sklearn.preprocessing import StandardScaler
from tqdm import tqdm from tqdm import tqdm
from paddlespeech.t2s.datasets.data_table import DataTable from paddlespeech.t2s.datasets.data_table import DataTable
from paddlespeech.t2s.utils import str2bool
def main(): def main():
...@@ -41,9 +42,6 @@ def main(): ...@@ -41,9 +42,6 @@ def main():
help="path to save statistics. if not provided, " help="path to save statistics. if not provided, "
"stats will be saved in the above root directory with name stats.npy") "stats will be saved in the above root directory with name stats.npy")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument( parser.add_argument(
"--use-relative-path", "--use-relative-path",
type=str2bool, type=str2bool,
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册