diff --git a/docs/source/tts/README.md b/docs/source/tts/README.md index 18283cb276b955b9fcfc52c31717e5539345d5be..3d9ee972de4fcba304743e81fd60e9514ef58e00 100644 --- a/docs/source/tts/README.md +++ b/docs/source/tts/README.md @@ -5,20 +5,6 @@ Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-spee
- -## News -- Oct-12-2021, Refector examples code. -- Oct-12-2021, Parallel WaveGAN with LJSpeech. Check [examples/GANVocoder/parallelwave_gan/ljspeech](./examples/GANVocoder/parallelwave_gan/ljspeech). -- Oct-12-2021, FastSpeech2/FastPitch with LJSpeech. Check [examples/fastspeech2/ljspeech](./examples/fastspeech2/ljspeech). -- Sep-14-2021, Reconstruction of TransformerTTS. Check [examples/transformer_tts/ljspeech](./examples/transformer_tts/ljspeech). -- Aug-31-2021, Chinese Text Frontend. Check [examples/text_frontend](./examples/text_frontend). -- Aug-23-2021, FastSpeech2/FastPitch with AISHELL-3. Check [examples/fastspeech2/aishell3](./examples/fastspeech2/aishell3). -- Aug-03-2021, FastSpeech2/FastPitch with CSMSC. Check [examples/fastspeech2/baker](./examples/fastspeech2/baker). -- Jul-19-2021, SpeedySpeech with CSMSC. Check [examples/speedyspeech/baker](./examples/speedyspeech/baker). -- Jul-01-2021, Parallel WaveGAN with CSMSC. Check [examples/GANVocoder/parallelwave_gan/baker](./examples/GANVocoder/parallelwave_gan/baker). -- Jul-01-2021, Montreal-Forced-Aligner. Check [examples/use_mfa](./examples/use_mfa). -- May-07-2021, Voice Cloning in Chinese. Check [examples/tacotron2_aishell3](./examples/tacotron2_aishell3). - ## Overview In order to facilitate exploiting the existing TTS models directly and developing the new ones, Parakeet selects typical models and provides their reference implementations in PaddlePaddle. Further more, Parakeet abstracts the TTS pipeline and standardizes the procedure of data preprocessing, common modules sharing, model configuration, and the process of training and synthesis. The models supported here include Text FrontEnd, end-to-end Acoustic models and Vocoders: @@ -38,50 +24,11 @@ In order to facilitate exploiting the existing TTS models directly and developin - [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558v4.pdf) - [【GE2E】Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467) -## Setup -It's difficult to install some dependent libraries for this repo in Windows system, we recommend that you **DO NOT** use Windows system, please use `Linux`. - -Make sure the library `libsndfile1` is installed, e.g., on Ubuntu. - -```bash -sudo apt-get install libsndfile1 -``` -### Install PaddlePaddle -See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires PaddlePaddle **2.1.2** or above. - -### Install Parakeet -```bash -git clone https://github.com/PaddlePaddle/Parakeet -cd Parakeet -pip install -e . -``` - -If some python dependent packages cannot be installed successfully, you can run the following script first. -(replace `python3.6` with your own python version) -```bash -sudo apt install -y python3.6-dev -``` - -See [install](https://paddle-parakeet.readthedocs.io/en/latest/install.html) for more details. - -## Examples -Entries to the introduction, and the launch of training and synthsis for different example models: - -- [>>> Chinese Text Frontend](./examples/text_frontend) -- [>>> FastSpeech2/FastPitch](./examples/fastspeech2) -- [>>> Montreal-Forced-Aligner](./examples/use_mfa) -- [>>> Parallel WaveGAN](./examples/GANVocoder/parallelwave_gan) -- [>>> SpeedySpeech](./examples/speedyspeech) -- [>>> Tacotron2_AISHELL3](./examples/tacotron2_aishell3) -- [>>> GE2E](./examples/ge2e) -- [>>> WaveFlow](./examples/waveflow) -- [>>> TransformerTTS](./examples/transformer_tts) -- [>>> Tacotron2](./examples/tacotron2) ## Audio samples -### TTS models (Acoustic Model + Neural Vocoder) -Check our [website](https://paddleparakeet.readthedocs.io/en/latest/demo.html) for audio sampels. + +Check our [website](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) for audio sampels. ## Released Model diff --git a/examples/aishell3/tts3/README.md b/examples/aishell3/tts3/README.md index 82b69ad8432b8bf9b4649502ffa785de7600edc4..5ab15ffb45e9f99017a5eb9f96e7c9b26d172707 100644 --- a/examples/aishell3/tts3/README.md +++ b/examples/aishell3/tts3/README.md @@ -17,7 +17,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3 ``` ### Get MFA result of AISHELL-3 and Extract it We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. -You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo. +You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/data_aishell3`. diff --git a/examples/aishell3/vc0/README.md b/examples/aishell3/vc0/README.md index 28fea629262e303f8752c457eb1438936bc27eb4..cd573c4d82664ef09653ee6015c7a51a781b8797 100644 --- a/examples/aishell3/vc0/README.md +++ b/examples/aishell3/vc0/README.md @@ -41,7 +41,8 @@ We use Montreal Force Aligner 1.0. The label in aishell3 include pinyin,so th We use [lexicon.txt](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon. -You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/Parakeet/alignment_aishell3.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo. +You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/alignment_aishell3.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. + ```bash if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then diff --git a/examples/aishell3/vc1/README.md b/examples/aishell3/vc1/README.md index 8c0aec3af0e46f3829cb999c0a177ad8213ae4f7..974b84cade3536da775fcfb2aab5ae0db3df0683 100644 --- a/examples/aishell3/vc1/README.md +++ b/examples/aishell3/vc1/README.md @@ -1,89 +1,138 @@ + # FastSpeech2 + AISHELL-3 Voice Cloning -This example contains code used to train a [Tacotron2 ](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows: -1. Speaker Encoder: We use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in Tacotron2, because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e). -2. Synthesizer: Then, we use the trained speaker encoder to generate utterance embedding for each sentence in AISHELL-3. This embedding is a extra input of Tacotron2 which will be concated with encoder outputs. -3. Vocoder: We use WaveFlow as the neural Vocoder, refer to [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0). +This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows: +1. Speaker Encoder: We use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2`, because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e). +2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of `FastSpeech2` which will be concated with encoder outputs. +3. Vocoder: We use [Parallel Wave GAN](http://arxiv.org/abs/1910.11480) as the neural Vocoder, refer to [voc1](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1). + +## Dataset +### Download and Extract +Download AISHELL-3. +```bash +wget https://www.openslr.org/resources/93/data_aishell3.tgz +``` +Extract AISHELL-3. +```bash +mkdir data_aishell3 +tar zxvf data_aishell3.tgz -C data_aishell3 +``` +### Get MFA Result and Extract +We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. +You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. + +## Pretrained GE2E Model +We use pretrained GE2E model to generate spwaker embedding for each sentence. + +Download pretrained GE2E model from here [ge2e_ckpt_0.3.zip](https://bj.bcebos.com/paddlespeech/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip), and `unzip` it. ## Get Started Assume the path to the dataset is `~/datasets/data_aishell3`. -Assume the path to the MFA result of AISHELL-3 is `./alignment`. -Assume the path to the pretrained ge2e model is `ge2e_ckpt_path=./ge2e_ckpt_0.3/step-3000000` +Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`. +Assume the path to the pretrained ge2e model is `./ge2e_ckpt_0.3`. + Run the command below to 1. **source path**. -2. preprocess the dataset, +2. preprocess the dataset. 3. train the model. -4. start a voice cloning inference. +4. synthesize waveform from `metadata.jsonl`. +5. start a voice cloning inference. ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. ```bash -CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${input} ${preprocess_path} ${alignment} ${ge2e_ckpt_path} +./run.sh --stage 0 --stop-stage 0 ``` -#### generate utterance embedding - Use pretrained GE2E (speaker encoder) to generate utterance embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`. - +### Data Preprocessing ```bash -if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then - python3 ${BIN_DIR}/../ge2e/inference.py \ - --input=${input} \ - --output=${preprocess_path}/embed \ - --ngpu=1 \ - --checkpoint_path=${ge2e_ckpt_path} -fi +CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${conf_path} ${ge2e_ckpt_path} +``` +When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below. +```text +dump +├── dev +│ ├── norm +│ └── raw +├── embed +│ ├── SSB0005 +│ ├── SSB0009 +│ ├── ... +│ └── ... +├── phone_id_map.txt +├── speaker_id_map.txt +├── test +│ ├── norm +│ └── raw +└── train + ├── energy_stats.npy + ├── norm + ├── pitch_stats.npy + ├── raw + └── speech_stats.npy ``` +The `embed` contains the generated speaker embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`. The computing time of utterance embedding can be x hours. -#### process wav -There are silence in the edge of AISHELL-3's wavs, and the audio amplitude is very small, so, we need to remove the silence and normalize the audio. You can the silence remove method based on volume or energy, but the effect is not very good, We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get the alignment of text and speech, then utilize the alignment results to remove the silence. -We use Montreal Force Aligner 1.0. The label in aishell3 include pinyin,so the lexicon we provided to MFA is pinyin rather than Chinese characters. And the prosody marks(`$` and `%`) need to be removed. You shoud preprocess the dataset into the format which MFA needs, the texts have the same name with wavs and have the suffix `.lab`. +The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. -We use [lexicon.txt](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon. +Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. -You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/Parakeet/alignment_aishell3.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo. +The preprocessing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but there is one more `ge2e/inference` step here. +### Model Training +`./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash -if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then - echo "Process wav ..." - python3 ${BIN_DIR}/process_wav.py \ - --input=${input}/wav \ - --output=${preprocess_path}/normalized_wav \ - --alignment=${alignment} -fi +CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} ``` +The training step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`. -#### preprocess transcription -We revert the transcription into `phones` and `tones`. It is worth noting that our processing here is different from that used for MFA, we separated the tones. This is a processing method, of course, you can only segment initials and vowels. - +### Synthesizing +We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder. +Download pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it. ```bash -if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then - python3 ${BIN_DIR}/preprocess_transcription.py \ - --input=${input} \ - --output=${preprocess_path} -fi +unzip pwg_aishell3_ckpt_0.5.zip ``` -The default input is `~/datasets/data_aishell3/train`,which contains `label_train-set.txt`, the processed results are `metadata.yaml` and `metadata.pickle`. the former is a text format for easy viewing, and the latter is a binary format for direct reading. -#### extract mel -```python -if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then - python3 ${BIN_DIR}/extract_mel.py \ - --input=${preprocess_path}/normalized_wav \ - --output=${preprocess_path}/mel -fi +Parallel WaveGAN checkpoint contains files listed below. +```text +pwg_aishell3_ckpt_0.5 +├── default.yaml # default config used to train parallel wavegan +├── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan +└── snapshot_iter_1000000.pdz # generator parameters of parallel wavegan ``` - -### Train the model +`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash -CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path} +CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` +The synthesizing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/synthesize.py`. -Our model remve stop token prediction in Tacotron2, because of the problem of extremely unbalanced proportion of positive and negative samples of stop token prediction, and it's very sensitive to the clip of audio silence. We use the last symbol from the highest point of attention to the encoder side as the termination condition. +### Voice Cloning +Assume there are some reference audios in `./ref_audio` +```text +ref_audio +├── 001238.wav +├── LJ015-0254.wav +└── audio_self_test.mp3 +``` +`./local/voice_cloning.sh` calls `${BIN_DIR}/voice_cloning.py` -In addition, in order to accelerate the convergence of the model, we add `guided attention loss` to induce the alignment between encoder and decoder to show diagonal lines faster. -### Infernece ```bash -CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${ge2e_params_path} ${tacotron2_params_path} ${waveflow_params_path} ${vc_input} ${vc_output} +CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir} ``` ## Pretrained Model -[tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip). +[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip) + +Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss +:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------: +default|2(gpu) x 96400|0.99699|0.62013|0.53057|0.11954| 0.20426| + +FastSpeech2 checkpoint contains files listed below. +(There is no need for `speaker_id_map.txt` here ) + +```text +fastspeech2_nosil_aishell3_ckpt_vc1_0.5 +├── default.yaml # default config used to train fastspeech2 +├── phone_id_map.txt # phone vocabulary file when training fastspeech2 +├── snapshot_iter_96400.pdz # model parameters and optimizer states +└── speech_stats.npy # statistics used to normalize spectrogram when training fastspeech2 +``` diff --git a/examples/aishell3/voc1/README.md b/examples/aishell3/voc1/README.md index d9e8ce591d4132a2f9c39a25986bfce67707d284..0e40c1b5b18a1c68e6c88ad863e9655fa49105f5 100644 --- a/examples/aishell3/voc1/README.md +++ b/examples/aishell3/voc1/README.md @@ -15,7 +15,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3 ``` ### Get MFA result of AISHELL-3 and Extract it We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. -You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo. +You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/data_aishell3`. diff --git a/examples/csmsc/tts2/README.md b/examples/csmsc/tts2/README.md index 2088ed15c2003a1df072836d655234010c99748b..631fffc0087bb4c2c983591cdd65f53b3b3ef42c 100644 --- a/examples/csmsc/tts2/README.md +++ b/examples/csmsc/tts2/README.md @@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind ### Get MFA result of CSMSC and Extract it We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH. -You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/BZNSYP`. diff --git a/examples/csmsc/tts3/README.md b/examples/csmsc/tts3/README.md index 6e4701dfe3138d9aae3e1de25f3481f780efb482..fcb626ce27eb656d58841610fc4c98c268797f40 100644 --- a/examples/csmsc/tts3/README.md +++ b/examples/csmsc/tts3/README.md @@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind ### Get MFA result of CSMSC and Extract it We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. -You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/BZNSYP`. diff --git a/examples/csmsc/voc1/README.md b/examples/csmsc/voc1/README.md index f789cba0e9f10c9671066aca29f0e1069fd67390..4cce8b2aff7b73c0898af3521e72414d041edfd7 100644 --- a/examples/csmsc/voc1/README.md +++ b/examples/csmsc/voc1/README.md @@ -6,7 +6,7 @@ Download CSMSC from the [official website](https://www.data-baker.com/data/index ### Get MFA results for silence trim We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. -You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/BZNSYP`. diff --git a/examples/csmsc/voc3/README.md b/examples/csmsc/voc3/README.md index 9cb9d34d4692ff7c0799de18e42af5277f95a576..0b2872fb05ad1770f2f11262aa1afe4b14bab0d3 100644 --- a/examples/csmsc/voc3/README.md +++ b/examples/csmsc/voc3/README.md @@ -6,7 +6,7 @@ Download CSMSC from the [official website](https://www.data-baker.com/data/index ### Get MFA results for silence trim We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. -You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/BZNSYP`. diff --git a/examples/ljspeech/tts3/README.md b/examples/ljspeech/tts3/README.md index bc38aac64f24230690b35a9c8016fc344ec9e172..bfd9dd8c56810d9607e22b7bd4f3164b7b07bb77 100644 --- a/examples/ljspeech/tts3/README.md +++ b/examples/ljspeech/tts3/README.md @@ -7,7 +7,7 @@ Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech ### Get MFA result of LJSpeech-1.1 and Extract it We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. -You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. +You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/LJSpeech-1.1`. diff --git a/examples/ljspeech/voc1/README.md b/examples/ljspeech/voc1/README.md index fdeac6329550be85c4e8e58d5b0ea994636cf715..13cc6ed7e467a9dd952160a0a64fb9c20533d5c7 100644 --- a/examples/ljspeech/voc1/README.md +++ b/examples/ljspeech/voc1/README.md @@ -1,26 +1,29 @@ # Parallel WaveGAN with the LJSpeech-1.1 This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/). ## Dataset -### Download and Extract the datasaet +### Download and Extract Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/). -### Get MFA results for silence trim -We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. -You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. +### Get MFA Result and Extract +We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. +You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/LJSpeech-1.1`. Assume the path to the MFA result of LJSpeech-1.1 is `./ljspeech_alignment`. Run the command below to 1. **source path**. -2. preprocess the dataset, +2. preprocess the dataset. 3. train the model. 4. synthesize wavs. - synthesize waveform from `metadata.jsonl`. ```bash ./run.sh ``` - -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` @@ -44,7 +47,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of whi Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. -### Train the model +### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} @@ -91,7 +94,7 @@ benchmark: 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -### Synthesize +### Synthesizing `./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} @@ -122,8 +125,8 @@ optional arguments: 4. `--output-dir` is the directory to save the synthesized audio files. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -## Pretrained Models -Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_ljspeech_ckpt_0.5.zip) +## Pretrained Model +Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip) Parallel WaveGAN checkpoint contains files listed below. @@ -134,4 +137,4 @@ pwg_ljspeech_ckpt_0.5 └── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan ``` ## Acknowledgement -We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN. +We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN. \ No newline at end of file diff --git a/examples/other/use_mfa/README.md b/examples/other/mfa/README.md similarity index 100% rename from examples/other/use_mfa/README.md rename to examples/other/mfa/README.md diff --git a/examples/other/use_mfa/local/cmudict-0.7b b/examples/other/mfa/local/cmudict-0.7b similarity index 100% rename from examples/other/use_mfa/local/cmudict-0.7b rename to examples/other/mfa/local/cmudict-0.7b diff --git a/examples/other/use_mfa/local/detect_oov.py b/examples/other/mfa/local/detect_oov.py similarity index 100% rename from examples/other/use_mfa/local/detect_oov.py rename to examples/other/mfa/local/detect_oov.py diff --git a/examples/other/use_mfa/local/generate_lexicon.py b/examples/other/mfa/local/generate_lexicon.py similarity index 100% rename from examples/other/use_mfa/local/generate_lexicon.py rename to examples/other/mfa/local/generate_lexicon.py diff --git a/examples/other/use_mfa/local/reorganize_aishell3.py b/examples/other/mfa/local/reorganize_aishell3.py similarity index 100% rename from examples/other/use_mfa/local/reorganize_aishell3.py rename to examples/other/mfa/local/reorganize_aishell3.py diff --git a/examples/other/use_mfa/local/reorganize_baker.py b/examples/other/mfa/local/reorganize_baker.py similarity index 100% rename from examples/other/use_mfa/local/reorganize_baker.py rename to examples/other/mfa/local/reorganize_baker.py diff --git a/examples/other/use_mfa/local/reorganize_ljspeech.py b/examples/other/mfa/local/reorganize_ljspeech.py similarity index 100% rename from examples/other/use_mfa/local/reorganize_ljspeech.py rename to examples/other/mfa/local/reorganize_ljspeech.py diff --git a/examples/other/use_mfa/local/reorganize_vctk.py b/examples/other/mfa/local/reorganize_vctk.py similarity index 100% rename from examples/other/use_mfa/local/reorganize_vctk.py rename to examples/other/mfa/local/reorganize_vctk.py diff --git a/examples/other/use_mfa/run.sh b/examples/other/mfa/run.sh similarity index 100% rename from examples/other/use_mfa/run.sh rename to examples/other/mfa/run.sh diff --git a/examples/vctk/tts3/README.md b/examples/vctk/tts3/README.md index 78bfb9668bb7bf554a8b5f1a1cf9d729799c41a9..ad4fb7bf284ba1346ef82061e4cfcced092a5bad 100644 --- a/examples/vctk/tts3/README.md +++ b/examples/vctk/tts3/README.md @@ -7,8 +7,8 @@ Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle ### Get MFA result of VCTK and Extract it We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. -You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. -ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/use_mfa/local/reorganize_vctk.py)): +You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. +ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)): 1. `p315`, because no txt for it. 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them. diff --git a/examples/vctk/voc1/README.md b/examples/vctk/voc1/README.md index c9ecae0d2a1f3265e4d6caa91d0bfed412bcfa75..5c9d54c96483bb9c661631d0835d7484fa6be4fa 100644 --- a/examples/vctk/voc1/README.md +++ b/examples/vctk/voc1/README.md @@ -5,10 +5,10 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a ### Download and Extract the datasaet Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`. -### Get MFA results for silence trim +### Get MFA Result and Extract We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. -You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. -ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/use_mfa/local/reorganize_vctk.py)): +You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. +ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)): 1. `p315`, because no txt for it. 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them. diff --git a/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py b/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py index ca3c0a1f1ba55d05bcb1bcb790388025bf16ba9d..a44d2d3c263b5ca6f78b40d06a87cafa36bc7798 100644 --- a/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py +++ b/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py @@ -36,10 +36,10 @@ from paddlespeech.t2s.models.melgan import MBMelGANEvaluator from paddlespeech.t2s.models.melgan import MBMelGANUpdater from paddlespeech.t2s.models.melgan import MelGANGenerator from paddlespeech.t2s.models.melgan import MelGANMultiScaleDiscriminator -from paddlespeech.t2s.modules.adversarial_loss import DiscriminatorAdversarialLoss -from paddlespeech.t2s.modules.adversarial_loss import GeneratorAdversarialLoss +from paddlespeech.t2s.modules.losses import DiscriminatorAdversarialLoss +from paddlespeech.t2s.modules.losses import GeneratorAdversarialLoss +from paddlespeech.t2s.modules.losses import MultiResolutionSTFTLoss from paddlespeech.t2s.modules.pqmf import PQMF -from paddlespeech.t2s.modules.stft_loss import MultiResolutionSTFTLoss from paddlespeech.t2s.training.extensions.snapshot import Snapshot from paddlespeech.t2s.training.extensions.visualizer import VisualDL from paddlespeech.t2s.training.seeding import seed_everything diff --git a/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py b/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py index 42ef88307a9a766092d8bc3f39cea6890ce73d7c..98b0ed717c89d2d5d7ccf2c910420bd03722beaa 100644 --- a/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py +++ b/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py @@ -36,7 +36,7 @@ from paddlespeech.t2s.models.parallel_wavegan import PWGDiscriminator from paddlespeech.t2s.models.parallel_wavegan import PWGEvaluator from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator from paddlespeech.t2s.models.parallel_wavegan import PWGUpdater -from paddlespeech.t2s.modules.stft_loss import MultiResolutionSTFTLoss +from paddlespeech.t2s.modules.losses import MultiResolutionSTFTLoss from paddlespeech.t2s.training.extensions.snapshot import Snapshot from paddlespeech.t2s.training.extensions.visualizer import VisualDL from paddlespeech.t2s.training.seeding import seed_everything diff --git a/paddlespeech/t2s/models/__init__.py b/paddlespeech/t2s/models/__init__.py index 4ce90896d2e5d9421eb2cc922bd8dca24d48819c..66720649062343c018056a2271797e92e11e94a1 100644 --- a/paddlespeech/t2s/models/__init__.py +++ b/paddlespeech/t2s/models/__init__.py @@ -12,6 +12,8 @@ # See the License for the specific language governing permissions and # limitations under the License. from .fastspeech2 import * +from .melgan import * +from .parallel_wavegan import * from .tacotron2 import * from .transformer_tts import * from .waveflow import * diff --git a/paddlespeech/t2s/models/fastspeech2/fastspeech2.py b/paddlespeech/t2s/models/fastspeech2/fastspeech2.py index 8ff07fa5c0534354d3dc2ab5c67a283d40d83574..aa42a83dec7f54bf5445346ea7b4ea5b16f68e3e 100644 --- a/paddlespeech/t2s/models/fastspeech2/fastspeech2.py +++ b/paddlespeech/t2s/models/fastspeech2/fastspeech2.py @@ -32,7 +32,8 @@ from paddlespeech.t2s.modules.predictor.duration_predictor import DurationPredic from paddlespeech.t2s.modules.predictor.length_regulator import LengthRegulator from paddlespeech.t2s.modules.predictor.variance_predictor import VariancePredictor from paddlespeech.t2s.modules.tacotron2.decoder import Postnet -from paddlespeech.t2s.modules.transformer.encoder import Encoder +from paddlespeech.t2s.modules.transformer.encoder import ConformerEncoder +from paddlespeech.t2s.modules.transformer.encoder import TransformerEncoder class FastSpeech2(nn.Layer): @@ -306,12 +307,10 @@ class FastSpeech2(nn.Layer): num_embeddings=idim, embedding_dim=adim, padding_idx=self.padding_idx) - # add encoder type here - # 测试模型还能跑通不 - # 记得改 transformer tts + if encoder_type == "transformer": print("encoder_type is transformer") - self.encoder = Encoder( + self.encoder = TransformerEncoder( idim=idim, attention_dim=adim, attention_heads=aheads, @@ -325,11 +324,10 @@ class FastSpeech2(nn.Layer): normalize_before=encoder_normalize_before, concat_after=encoder_concat_after, positionwise_layer_type=positionwise_layer_type, - positionwise_conv_kernel_size=positionwise_conv_kernel_size, - encoder_type=encoder_type) + positionwise_conv_kernel_size=positionwise_conv_kernel_size, ) elif encoder_type == "conformer": print("encoder_type is conformer") - self.encoder = Encoder( + self.encoder = ConformerEncoder( idim=idim, attention_dim=adim, attention_heads=aheads, @@ -349,8 +347,7 @@ class FastSpeech2(nn.Layer): activation_type=conformer_activation_type, use_cnn_module=use_cnn_in_conformer, cnn_module_kernel=conformer_enc_kernel_size, - zero_triu=zero_triu, - encoder_type=encoder_type) + zero_triu=zero_triu, ) else: raise ValueError(f"{encoder_type} is not supported.") @@ -417,7 +414,7 @@ class FastSpeech2(nn.Layer): # because fastspeech's decoder is the same as encoder if decoder_type == "transformer": print("decoder_type is transformer") - self.decoder = Encoder( + self.decoder = TransformerEncoder( idim=0, attention_dim=adim, attention_heads=aheads, @@ -432,11 +429,10 @@ class FastSpeech2(nn.Layer): normalize_before=decoder_normalize_before, concat_after=decoder_concat_after, positionwise_layer_type=positionwise_layer_type, - positionwise_conv_kernel_size=positionwise_conv_kernel_size, - encoder_type=decoder_type) + positionwise_conv_kernel_size=positionwise_conv_kernel_size, ) elif decoder_type == "conformer": print("decoder_type is conformer") - self.decoder = Encoder( + self.decoder = ConformerEncoder( idim=0, attention_dim=adim, attention_heads=aheads, @@ -455,8 +451,7 @@ class FastSpeech2(nn.Layer): selfattention_layer_type=conformer_self_attn_layer_type, activation_type=conformer_activation_type, use_cnn_module=use_cnn_in_conformer, - cnn_module_kernel=conformer_dec_kernel_size, - encoder_type=decoder_type) + cnn_module_kernel=conformer_dec_kernel_size, ) else: raise ValueError(f"{decoder_type} is not supported.") diff --git a/paddlespeech/t2s/models/melgan/melgan.py b/paddlespeech/t2s/models/melgan/melgan.py index 80bb1c1b2c4e17fbcf0ff531e9504ff6ccad2332..809403f60f143c695ddf56b5b03b05fccec0babb 100644 --- a/paddlespeech/t2s/models/melgan/melgan.py +++ b/paddlespeech/t2s/models/melgan/melgan.py @@ -78,7 +78,7 @@ class MelGANGenerator(nn.Layer): Padding function module name before dilated convolution layer. pad_params : dict Hyperparameters for padding function. - use_final_nonlinear_activation : paddle.nn.Layer + use_final_nonlinear_activation : nn.Layer Activation function for the final layer. use_weight_norm : bool Whether to use weight norm. diff --git a/paddlespeech/t2s/models/speedyspeech/speedyspeech.py b/paddlespeech/t2s/models/speedyspeech/speedyspeech.py index 0689ec45337c869812646f0acd45f14fda956dee..ece5c279f2f82e76b873d2f0b80b78173fce098f 100644 --- a/paddlespeech/t2s/models/speedyspeech/speedyspeech.py +++ b/paddlespeech/t2s/models/speedyspeech/speedyspeech.py @@ -11,13 +11,34 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +import numpy as np import paddle from paddle import nn -from paddlespeech.t2s.modules.expansion import expand from paddlespeech.t2s.modules.positional_encoding import sinusoid_position_encoding +def expand(encodings: paddle.Tensor, durations: paddle.Tensor) -> paddle.Tensor: + """ + encodings: (B, T, C) + durations: (B, T) + """ + batch_size, t_enc = durations.shape + durations = durations.numpy() + slens = np.sum(durations, -1) + t_dec = np.max(slens) + M = np.zeros([batch_size, t_dec, t_enc]) + for i in range(batch_size): + k = 0 + for j in range(t_enc): + d = durations[i, j] + M[i, k:k + d, j] = 1 + k += d + M = paddle.to_tensor(M, dtype=encodings.dtype) + encodings = paddle.matmul(M, encodings) + return encodings + + class ResidualBlock(nn.Layer): def __init__(self, channels, kernel_size, dilation, n=2): super().__init__() diff --git a/paddlespeech/t2s/models/speedyspeech/speedyspeech_updater.py b/paddlespeech/t2s/models/speedyspeech/speedyspeech_updater.py index 4883a87e53a527d90e205755a0e98b5257955139..6f9937a51845ff7c3962ba14efc1a53f0d15b94d 100644 --- a/paddlespeech/t2s/models/speedyspeech/speedyspeech_updater.py +++ b/paddlespeech/t2s/models/speedyspeech/speedyspeech_updater.py @@ -19,8 +19,8 @@ from paddle.fluid.layers import huber_loss from paddle.nn import functional as F from paddlespeech.t2s.modules.losses import masked_l1_loss +from paddlespeech.t2s.modules.losses import ssim from paddlespeech.t2s.modules.losses import weighted_mean -from paddlespeech.t2s.modules.ssim import ssim from paddlespeech.t2s.training.extensions.evaluator import StandardEvaluator from paddlespeech.t2s.training.reporter import report from paddlespeech.t2s.training.updaters.standard_updater import StandardUpdater diff --git a/paddlespeech/t2s/models/tacotron2.py b/paddlespeech/t2s/models/tacotron2.py index b0946a5ba0c42b3c895b782993f8c47ef5e7a14c..01ea4f7d235aa76eba3bc18d029e2e07e8ccbe5c 100644 --- a/paddlespeech/t2s/models/tacotron2.py +++ b/paddlespeech/t2s/models/tacotron2.py @@ -20,7 +20,6 @@ from paddle.nn import functional as F from paddle.nn import initializer as I from tqdm import trange -from paddlespeech.t2s.modules.attention import LocationSensitiveAttention from paddlespeech.t2s.modules.conv import Conv1dBatchNorm from paddlespeech.t2s.modules.losses import guided_attention_loss from paddlespeech.t2s.utils import checkpoint @@ -28,6 +27,99 @@ from paddlespeech.t2s.utils import checkpoint __all__ = ["Tacotron2", "Tacotron2Loss"] +class LocationSensitiveAttention(nn.Layer): + """Location Sensitive Attention module. + + Reference: `Attention-Based Models for Speech Recognition `_ + + Parameters + ----------- + d_query: int + The feature size of query. + d_key : int + The feature size of key. + d_attention : int + The feature size of dimension. + location_filters : int + Filter size of attention convolution. + location_kernel_size : int + Kernel size of attention convolution. + """ + + def __init__(self, + d_query: int, + d_key: int, + d_attention: int, + location_filters: int, + location_kernel_size: int): + super().__init__() + + self.query_layer = nn.Linear(d_query, d_attention, bias_attr=False) + self.key_layer = nn.Linear(d_key, d_attention, bias_attr=False) + self.value = nn.Linear(d_attention, 1, bias_attr=False) + + # Location Layer + self.location_conv = nn.Conv1D( + 2, + location_filters, + kernel_size=location_kernel_size, + padding=int((location_kernel_size - 1) / 2), + bias_attr=False, + data_format='NLC') + self.location_layer = nn.Linear( + location_filters, d_attention, bias_attr=False) + + def forward(self, + query, + processed_key, + value, + attention_weights_cat, + mask=None): + """Compute context vector and attention weights. + + Parameters + ----------- + query : Tensor [shape=(batch_size, d_query)] + The queries. + processed_key : Tensor [shape=(batch_size, time_steps_k, d_attention)] + The keys after linear layer. + value : Tensor [shape=(batch_size, time_steps_k, d_key)] + The values. + attention_weights_cat : Tensor [shape=(batch_size, time_step_k, 2)] + Attention weights concat. + mask : Tensor, optional + The mask. Shape should be (batch_size, times_steps_k, 1). + Defaults to None. + + Returns + ---------- + attention_context : Tensor [shape=(batch_size, d_attention)] + The context vector. + attention_weights : Tensor [shape=(batch_size, time_steps_k)] + The attention weights. + """ + + processed_query = self.query_layer(paddle.unsqueeze(query, axis=[1])) + processed_attention_weights = self.location_layer( + self.location_conv(attention_weights_cat)) + # (B, T_enc, 1) + alignment = self.value( + paddle.tanh(processed_attention_weights + processed_key + + processed_query)) + + if mask is not None: + alignment = alignment + (1.0 - mask) * -1e9 + + attention_weights = F.softmax(alignment, axis=1) + attention_context = paddle.matmul( + attention_weights, value, transpose_x=True) + + attention_weights = paddle.squeeze(attention_weights, axis=-1) + attention_context = paddle.squeeze(attention_context, axis=1) + + return attention_context, attention_weights + + class DecoderPreNet(nn.Layer): """Decoder prenet module for Tacotron2. @@ -197,7 +289,7 @@ class Tacotron2Encoder(nn.Layer): super().__init__() k = math.sqrt(1.0 / (d_hidden * kernel_size)) - self.conv_batchnorms = paddle.nn.LayerList([ + self.conv_batchnorms = nn.LayerList([ Conv1dBatchNorm( d_hidden, d_hidden, @@ -903,7 +995,7 @@ class Tacotron2Loss(nn.Layer): self.use_stop_token_loss = use_stop_token_loss self.use_guided_attention_loss = use_guided_attention_loss self.attn_criterion = guided_attention_loss - self.stop_criterion = paddle.nn.BCEWithLogitsLoss() + self.stop_criterion = nn.BCEWithLogitsLoss() self.sigma = sigma def forward(self, diff --git a/paddlespeech/t2s/models/transformer_tts/transformer_tts.py b/paddlespeech/t2s/models/transformer_tts/transformer_tts.py index e8adafb29a830333dc1f9eb910ca346af6f77a14..ae6d736559384b8951f02c09dbc8bf987625e1f5 100644 --- a/paddlespeech/t2s/models/transformer_tts/transformer_tts.py +++ b/paddlespeech/t2s/models/transformer_tts/transformer_tts.py @@ -34,7 +34,7 @@ from paddlespeech.t2s.modules.transformer.attention import MultiHeadedAttention from paddlespeech.t2s.modules.transformer.decoder import Decoder from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding from paddlespeech.t2s.modules.transformer.embedding import ScaledPositionalEncoding -from paddlespeech.t2s.modules.transformer.encoder import Encoder +from paddlespeech.t2s.modules.transformer.encoder import TransformerEncoder from paddlespeech.t2s.modules.transformer.mask import subsequent_mask @@ -281,7 +281,7 @@ class TransformerTTS(nn.Layer): num_embeddings=idim, embedding_dim=adim, padding_idx=self.padding_idx) - self.encoder = Encoder( + self.encoder = TransformerEncoder( idim=idim, attention_dim=adim, attention_heads=aheads, diff --git a/paddlespeech/t2s/models/waveflow.py b/paddlespeech/t2s/models/waveflow.py index c57429db1fc3ea9587e8c3a0eaec95c09330da9f..e519e0c5050cc3edb0065c8f442525f709f21d00 100644 --- a/paddlespeech/t2s/models/waveflow.py +++ b/paddlespeech/t2s/models/waveflow.py @@ -329,7 +329,7 @@ class ResidualNet(nn.LayerList): if len(dilations_h) != n_layer: raise ValueError( "number of dilations_h should equals num of layers") - super(ResidualNet, self).__init__() + super().__init__() for i in range(n_layer): dilation = (dilations_h[i], 2**i) layer = ResidualBlock(residual_channels, condition_channels, diff --git a/paddlespeech/t2s/modules/__init__.py b/paddlespeech/t2s/modules/__init__.py index 5b569f5d05100fa80587c9a06dc2c16f1d58a936..1e3312002cada4503b6c4d43f2ea5b30ba9d7efb 100644 --- a/paddlespeech/t2s/modules/__init__.py +++ b/paddlespeech/t2s/modules/__init__.py @@ -14,5 +14,4 @@ from .conv import * from .geometry import * from .losses import * -from .masking import * from .positional_encoding import * diff --git a/paddlespeech/t2s/modules/glu.py b/paddlespeech/t2s/modules/activation.py similarity index 69% rename from paddlespeech/t2s/modules/glu.py rename to paddlespeech/t2s/modules/activation.py index 1669fb36738a47f06ad4b0350fe2b7182bca49ab..f5b0af6e901e619ad3133fb8db46daef47f6b78b 100644 --- a/paddlespeech/t2s/modules/glu.py +++ b/paddlespeech/t2s/modules/activation.py @@ -11,8 +11,9 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +import paddle +import paddle.nn.functional as F from paddle import nn -from paddle.nn import functional as F class GLU(nn.Layer): @@ -24,3 +25,18 @@ class GLU(nn.Layer): def forward(self, xs): return F.glu(xs, axis=self.dim) + + +def get_activation(act): + """Return activation function.""" + + activation_funcs = { + "hardtanh": paddle.nn.Hardtanh, + "tanh": paddle.nn.Tanh, + "relu": paddle.nn.ReLU, + "selu": paddle.nn.SELU, + "swish": paddle.nn.Swish, + "glu": GLU + } + + return activation_funcs[act]() diff --git a/paddlespeech/t2s/modules/adversarial_loss.py b/paddlespeech/t2s/modules/adversarial_loss.py deleted file mode 100644 index d2c8f7a94ea89d5fe5975799e3cacad4f94b4cd0..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/modules/adversarial_loss.py +++ /dev/null @@ -1,125 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# Modified from espnet(https://github.com/espnet/espnet) -"""Adversarial loss modules.""" -import paddle -import paddle.nn.functional as F -from paddle import nn - - -class GeneratorAdversarialLoss(nn.Layer): - """Generator adversarial loss module.""" - - def __init__( - self, - average_by_discriminators=True, - loss_type="mse", ): - """Initialize GeneratorAversarialLoss module.""" - super().__init__() - self.average_by_discriminators = average_by_discriminators - assert loss_type in ["mse", "hinge"], f"{loss_type} is not supported." - if loss_type == "mse": - self.criterion = self._mse_loss - else: - self.criterion = self._hinge_loss - - def forward(self, outputs): - """Calcualate generator adversarial loss. - Parameters - ---------- - outputs: Tensor or List - Discriminator outputs or list of discriminator outputs. - Returns - ---------- - Tensor - Generator adversarial loss value. - """ - if isinstance(outputs, (tuple, list)): - adv_loss = 0.0 - for i, outputs_ in enumerate(outputs): - if isinstance(outputs_, (tuple, list)): - # case including feature maps - outputs_ = outputs_[-1] - adv_loss += self.criterion(outputs_) - if self.average_by_discriminators: - adv_loss /= i + 1 - else: - adv_loss = self.criterion(outputs) - - return adv_loss - - def _mse_loss(self, x): - return F.mse_loss(x, paddle.ones_like(x)) - - def _hinge_loss(self, x): - return -x.mean() - - -class DiscriminatorAdversarialLoss(nn.Layer): - """Discriminator adversarial loss module.""" - - def __init__( - self, - average_by_discriminators=True, - loss_type="mse", ): - """Initialize DiscriminatorAversarialLoss module.""" - super().__init__() - self.average_by_discriminators = average_by_discriminators - assert loss_type in ["mse"], f"{loss_type} is not supported." - if loss_type == "mse": - self.fake_criterion = self._mse_fake_loss - self.real_criterion = self._mse_real_loss - - def forward(self, outputs_hat, outputs): - """Calcualate discriminator adversarial loss. - Parameters - ---------- - outputs_hat : Tensor or list - Discriminator outputs or list of - discriminator outputs calculated from generator outputs. - outputs : Tensor or list - Discriminator outputs or list of - discriminator outputs calculated from groundtruth. - Returns - ---------- - Tensor - Discriminator real loss value. - Tensor - Discriminator fake loss value. - """ - if isinstance(outputs, (tuple, list)): - real_loss = 0.0 - fake_loss = 0.0 - for i, (outputs_hat_, - outputs_) in enumerate(zip(outputs_hat, outputs)): - if isinstance(outputs_hat_, (tuple, list)): - # case including feature maps - outputs_hat_ = outputs_hat_[-1] - outputs_ = outputs_[-1] - real_loss += self.real_criterion(outputs_) - fake_loss += self.fake_criterion(outputs_hat_) - if self.average_by_discriminators: - fake_loss /= i + 1 - real_loss /= i + 1 - else: - real_loss = self.real_criterion(outputs) - fake_loss = self.fake_criterion(outputs_hat) - - return real_loss, fake_loss - - def _mse_real_loss(self, x): - return F.mse_loss(x, paddle.ones_like(x)) - - def _mse_fake_loss(self, x): - return F.mse_loss(x, paddle.zeros_like(x)) diff --git a/paddlespeech/t2s/modules/attention.py b/paddlespeech/t2s/modules/attention.py deleted file mode 100644 index 154625cc3c1d9426ed2d21edc3064798b73ccd3a..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/modules/attention.py +++ /dev/null @@ -1,348 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import math - -import numpy as np -import paddle -from paddle import nn -from paddle.nn import functional as F - - -def scaled_dot_product_attention(q, k, v, mask=None, dropout=0.0, - training=True): - r"""Scaled dot product attention with masking. - - Assume that q, k, v all have the same leading dimensions (denoted as * in - descriptions below). Dropout is applied to attention weights before - weighted sum of values. - - Parameters - ----------- - q : Tensor [shape=(\*, T_q, d)] - the query tensor. - k : Tensor [shape=(\*, T_k, d)] - the key tensor. - v : Tensor [shape=(\*, T_k, d_v)] - the value tensor. - mask : Tensor, [shape=(\*, T_q, T_k) or broadcastable shape], optional - the mask tensor, zeros correspond to paddings. Defaults to None. - - Returns - ---------- - out : Tensor [shape=(\*, T_q, d_v)] - the context vector. - attn_weights : Tensor [shape=(\*, T_q, T_k)] - the attention weights. - """ - d = q.shape[-1] # we only support imperative execution - qk = paddle.matmul(q, k, transpose_y=True) - scaled_logit = paddle.scale(qk, 1.0 / math.sqrt(d)) - - if mask is not None: - scaled_logit += paddle.scale((1.0 - mask), -1e9) # hard coded here - - attn_weights = F.softmax(scaled_logit, axis=-1) - attn_weights = F.dropout(attn_weights, dropout, training=training) - out = paddle.matmul(attn_weights, v) - return out, attn_weights - - -def drop_head(x, drop_n_heads, training=True): - """Drop n context vectors from multiple ones. - - Parameters - ---------- - x : Tensor [shape=(batch_size, num_heads, time_steps, channels)] - The input, multiple context vectors. - drop_n_heads : int [0<= drop_n_heads <= num_heads] - Number of vectors to drop. - training : bool - A flag indicating whether it is in training. If `False`, no dropout is - applied. - - Returns - ------- - Tensor - The output. - """ - if not training or (drop_n_heads == 0): - return x - - batch_size, num_heads, _, _ = x.shape - # drop all heads - if num_heads == drop_n_heads: - return paddle.zeros_like(x) - - mask = np.ones([batch_size, num_heads]) - mask[:, :drop_n_heads] = 0 - for subarray in mask: - np.random.shuffle(subarray) - scale = float(num_heads) / (num_heads - drop_n_heads) - mask = scale * np.reshape(mask, [batch_size, num_heads, 1, 1]) - out = x * paddle.to_tensor(mask) - return out - - -def _split_heads(x, num_heads): - batch_size, time_steps, _ = x.shape - x = paddle.reshape(x, [batch_size, time_steps, num_heads, -1]) - x = paddle.transpose(x, [0, 2, 1, 3]) - return x - - -def _concat_heads(x): - batch_size, _, time_steps, _ = x.shape - x = paddle.transpose(x, [0, 2, 1, 3]) - x = paddle.reshape(x, [batch_size, time_steps, -1]) - return x - - -# Standard implementations of Monohead Attention & Multihead Attention -class MonoheadAttention(nn.Layer): - """Monohead Attention module. - - Parameters - ---------- - model_dim : int - Feature size of the query. - dropout : float, optional - Dropout probability of scaled dot product attention and final context - vector. Defaults to 0.0. - k_dim : int, optional - Feature size of the key of each scaled dot product attention. If not - provided, it is set to `model_dim / num_heads`. Defaults to None. - v_dim : int, optional - Feature size of the key of each scaled dot product attention. If not - provided, it is set to `model_dim / num_heads`. Defaults to None. - """ - - def __init__(self, - model_dim: int, - dropout: float=0.0, - k_dim: int=None, - v_dim: int=None): - super(MonoheadAttention, self).__init__() - k_dim = k_dim or model_dim - v_dim = v_dim or model_dim - self.affine_q = nn.Linear(model_dim, k_dim) - self.affine_k = nn.Linear(model_dim, k_dim) - self.affine_v = nn.Linear(model_dim, v_dim) - self.affine_o = nn.Linear(v_dim, model_dim) - - self.model_dim = model_dim - self.dropout = dropout - - def forward(self, q, k, v, mask): - """Compute context vector and attention weights. - - Parameters - ----------- - q : Tensor [shape=(batch_size, time_steps_q, model_dim)] - The queries. - k : Tensor [shape=(batch_size, time_steps_k, model_dim)] - The keys. - v : Tensor [shape=(batch_size, time_steps_k, model_dim)] - The values. - mask : Tensor [shape=(batch_size, times_steps_q, time_steps_k] or broadcastable shape - The mask. - - Returns - ---------- - out : Tensor [shape=(batch_size, time_steps_q, model_dim)] - The context vector. - attention_weights : Tensor [shape=(batch_size, times_steps_q, time_steps_k)] - The attention weights. - """ - q = self.affine_q(q) # (B, T, C) - k = self.affine_k(k) - v = self.affine_v(v) - - context_vectors, attention_weights = scaled_dot_product_attention( - q, k, v, mask, self.dropout, self.training) - - out = self.affine_o(context_vectors) - return out, attention_weights - - -class MultiheadAttention(nn.Layer): - """Multihead Attention module. - - Parameters - ----------- - model_dim: int - The feature size of query. - num_heads : int - The number of attention heads. - dropout : float, optional - Dropout probability of scaled dot product attention and final context - vector. Defaults to 0.0. - k_dim : int, optional - Feature size of the key of each scaled dot product attention. If not - provided, it is set to ``model_dim / num_heads``. Defaults to None. - v_dim : int, optional - Feature size of the key of each scaled dot product attention. If not - provided, it is set to ``model_dim / num_heads``. Defaults to None. - - Raises - --------- - ValueError - If ``model_dim`` is not divisible by ``num_heads``. - """ - - def __init__(self, - model_dim: int, - num_heads: int, - dropout: float=0.0, - k_dim: int=None, - v_dim: int=None): - super(MultiheadAttention, self).__init__() - if model_dim % num_heads != 0: - raise ValueError("model_dim must be divisible by num_heads") - depth = model_dim // num_heads - k_dim = k_dim or depth - v_dim = v_dim or depth - self.affine_q = nn.Linear(model_dim, num_heads * k_dim) - self.affine_k = nn.Linear(model_dim, num_heads * k_dim) - self.affine_v = nn.Linear(model_dim, num_heads * v_dim) - self.affine_o = nn.Linear(num_heads * v_dim, model_dim) - - self.num_heads = num_heads - self.model_dim = model_dim - self.dropout = dropout - - def forward(self, q, k, v, mask): - """Compute context vector and attention weights. - - Parameters - ----------- - q : Tensor [shape=(batch_size, time_steps_q, model_dim)] - The queries. - k : Tensor [shape=(batch_size, time_steps_k, model_dim)] - The keys. - v : Tensor [shape=(batch_size, time_steps_k, model_dim)] - The values. - mask : Tensor [shape=(batch_size, times_steps_q, time_steps_k] or broadcastable shape - The mask. - - Returns - ---------- - out : Tensor [shape=(batch_size, time_steps_q, model_dim)] - The context vector. - attention_weights : Tensor [shape=(batch_size, times_steps_q, time_steps_k)] - The attention weights. - """ - q = _split_heads(self.affine_q(q), self.num_heads) # (B, h, T, C) - k = _split_heads(self.affine_k(k), self.num_heads) - v = _split_heads(self.affine_v(v), self.num_heads) - mask = paddle.unsqueeze(mask, 1) # unsqueeze for the h dim - - context_vectors, attention_weights = scaled_dot_product_attention( - q, k, v, mask, self.dropout, self.training) - # NOTE: there is more sophisticated implementation: Scheduled DropHead - context_vectors = _concat_heads(context_vectors) # (B, T, h*C) - out = self.affine_o(context_vectors) - return out, attention_weights - - -class LocationSensitiveAttention(nn.Layer): - """Location Sensitive Attention module. - - Reference: `Attention-Based Models for Speech Recognition `_ - - Parameters - ----------- - d_query: int - The feature size of query. - d_key : int - The feature size of key. - d_attention : int - The feature size of dimension. - location_filters : int - Filter size of attention convolution. - location_kernel_size : int - Kernel size of attention convolution. - """ - - def __init__(self, - d_query: int, - d_key: int, - d_attention: int, - location_filters: int, - location_kernel_size: int): - super().__init__() - - self.query_layer = nn.Linear(d_query, d_attention, bias_attr=False) - self.key_layer = nn.Linear(d_key, d_attention, bias_attr=False) - self.value = nn.Linear(d_attention, 1, bias_attr=False) - - # Location Layer - self.location_conv = nn.Conv1D( - 2, - location_filters, - kernel_size=location_kernel_size, - padding=int((location_kernel_size - 1) / 2), - bias_attr=False, - data_format='NLC') - self.location_layer = nn.Linear( - location_filters, d_attention, bias_attr=False) - - def forward(self, - query, - processed_key, - value, - attention_weights_cat, - mask=None): - """Compute context vector and attention weights. - - Parameters - ----------- - query : Tensor [shape=(batch_size, d_query)] - The queries. - processed_key : Tensor [shape=(batch_size, time_steps_k, d_attention)] - The keys after linear layer. - value : Tensor [shape=(batch_size, time_steps_k, d_key)] - The values. - attention_weights_cat : Tensor [shape=(batch_size, time_step_k, 2)] - Attention weights concat. - mask : Tensor, optional - The mask. Shape should be (batch_size, times_steps_k, 1). - Defaults to None. - - Returns - ---------- - attention_context : Tensor [shape=(batch_size, d_attention)] - The context vector. - attention_weights : Tensor [shape=(batch_size, time_steps_k)] - The attention weights. - """ - - processed_query = self.query_layer(paddle.unsqueeze(query, axis=[1])) - processed_attention_weights = self.location_layer( - self.location_conv(attention_weights_cat)) - # (B, T_enc, 1) - alignment = self.value( - paddle.tanh(processed_attention_weights + processed_key + - processed_query)) - - if mask is not None: - alignment = alignment + (1.0 - mask) * -1e9 - - attention_weights = F.softmax(alignment, axis=1) - attention_context = paddle.matmul( - attention_weights, value, transpose_x=True) - - attention_weights = paddle.squeeze(attention_weights, axis=-1) - attention_context = paddle.squeeze(attention_context, axis=1) - - return attention_context, attention_weights diff --git a/paddlespeech/t2s/modules/audio.py b/paddlespeech/t2s/modules/audio.py deleted file mode 100644 index 926ce8f24b76d86e2956fc80300988cad27f1757..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/modules/audio.py +++ /dev/null @@ -1,229 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import librosa -import numpy as np -import paddle -from librosa.util import pad_center -from paddle import nn -from paddle.nn import functional as F -from scipy import signal - -__all__ = ["quantize", "dequantize", "STFT", "MelScale"] - - -def quantize(values, n_bands): - """Linearlly quantize a float Tensor in [-1, 1) to an interger Tensor in - [0, n_bands). - - Parameters - ----------- - values : Tensor [dtype: flaot32 or float64] - The floating point value. - - n_bands : int - The number of bands. The output integer Tensor's value is in the range - [0, n_bans). - - Returns - ---------- - Tensor [dtype: int 64] - The quantized tensor. - """ - quantized = paddle.cast((values + 1.0) / 2.0 * n_bands, "int64") - return quantized - - -def dequantize(quantized, n_bands, dtype=None): - """Linearlly dequantize an integer Tensor into a float Tensor in the range - [-1, 1). - - Parameters - ----------- - quantized : Tensor [dtype: int] - The quantized value in the range [0, n_bands). - - n_bands : int - Number of bands. The input integer Tensor's value is in the range - [0, n_bans). - - dtype : str, optional - Data type of the output. - - Returns - ----------- - Tensor - The dequantized tensor, dtype is specified by `dtype`. If `dtype` is - not specified, the default float data type is used. - """ - dtype = dtype or paddle.get_default_dtype() - value = (paddle.cast(quantized, dtype) + 0.5) * (2.0 / n_bands) - 1.0 - return value - - -class STFT(nn.Layer): - """A module for computing stft transformation in a differentiable way. - - Parameters - ------------ - n_fft : int - Number of samples in a frame. - hop_length : int - Number of samples shifted between adjacent frames. - win_length : int - Length of the window. - window : str, optional - Name of window function, see `scipy.signal.get_window` for more - details. Defaults to "hanning". - center : bool - If True, the signal y is padded so that frame D[:, t] is centered - at y[t * hop_length]. If False, then D[:, t] begins at y[t * hop_length]. - Defaults to True. - pad_mode : string or function - If center=True, this argument is passed to np.pad for padding the edges - of the signal y. By default (pad_mode="reflect"), y is padded on both - sides with its own reflection, mirrored around its first and last - sample respectively. If center=False, this argument is ignored. - - Notes - ----------- - It behaves like ``librosa.core.stft``. See ``librosa.core.stft`` for more - details. - - Given a audio which ``T`` samples, it the STFT transformation outputs a - spectrum with (C, frames) and complex dtype, where ``C = 1 + n_fft / 2`` - and ``frames = 1 + T // hop_lenghth``. - - Ony ``center`` and ``reflect`` padding is supported now. - - """ - - def __init__(self, - n_fft, - hop_length=None, - win_length=None, - window="hanning", - center=True, - pad_mode="reflect"): - super().__init__() - # By default, use the entire frame - if win_length is None: - win_length = n_fft - - # Set the default hop, if it's not already specified - if hop_length is None: - hop_length = int(win_length // 4) - - self.hop_length = hop_length - self.n_bin = 1 + n_fft // 2 - self.n_fft = n_fft - self.center = center - self.pad_mode = pad_mode - - # calculate window - window = signal.get_window(window, win_length, fftbins=True) - - # pad window to n_fft size - if n_fft != win_length: - window = pad_center(window, n_fft, mode="constant") - # lpad = (n_fft - win_length) // 2 - # rpad = n_fft - win_length - lpad - # window = np.pad(window, ((lpad, pad), ), 'constant') - - # calculate weights - # r = np.arange(0, n_fft) - # M = np.expand_dims(r, -1) * np.expand_dims(r, 0) - # w_real = np.reshape(window * - # np.cos(2 * np.pi * M / n_fft)[:self.n_bin], - # (self.n_bin, 1, self.n_fft)) - # w_imag = np.reshape(window * - # np.sin(-2 * np.pi * M / n_fft)[:self.n_bin], - # (self.n_bin, 1, self.n_fft)) - weight = np.fft.fft(np.eye(n_fft))[:self.n_bin] - w_real = weight.real - w_imag = weight.imag - w = np.concatenate([w_real, w_imag], axis=0) - w = w * window - w = np.expand_dims(w, 1) - weight = paddle.cast(paddle.to_tensor(w), paddle.get_default_dtype()) - self.register_buffer("weight", weight) - - def forward(self, x): - """Compute the stft transform. - Parameters - ------------ - x : Tensor [shape=(B, T)] - The input waveform. - Returns - ------------ - real : Tensor [shape=(B, C, frames)] - The real part of the spectrogram. - - imag : Tensor [shape=(B, C, frames)] - The image part of the spectrogram. - """ - x = paddle.unsqueeze(x, axis=1) - if self.center: - x = F.pad( - x, [self.n_fft // 2, self.n_fft // 2], - data_format='NCL', - mode=self.pad_mode) - - # to BCT, C=1 - out = F.conv1d(x, self.weight, stride=self.hop_length) - real, imag = paddle.chunk(out, 2, axis=1) # BCT - return real, imag - - def power(self, x): - """Compute the power spectrum. - Parameters - ------------ - x : Tensor [shape=(B, T)] - The input waveform. - Returns - ------------ - Tensor [shape=(B, C, T)] - The power spectrum. - """ - real, imag = self.forward(x) - power = real**2 + imag**2 - return power - - def magnitude(self, x): - """Compute the magnitude of the spectrum. - Parameters - ------------ - x : Tensor [shape=(B, T)] - The input waveform. - Returns - ------------ - Tensor [shape=(B, C, T)] - The magnitude of the spectrum. - """ - power = self.power(x) - magnitude = paddle.sqrt(power) # TODO(chenfeiyu): maybe clipping - return magnitude - - -class MelScale(nn.Layer): - def __init__(self, sr, n_fft, n_mels, fmin, fmax): - super().__init__() - mel_basis = librosa.filters.mel(sr, n_fft, n_mels, fmin, fmax) - # self.weight = paddle.to_tensor(mel_basis) - weight = paddle.to_tensor(mel_basis, dtype=paddle.get_default_dtype()) - self.register_buffer("weight", weight) - - def forward(self, spec): - # (n_mels, n_freq) * (batch_size, n_freq, n_frames) - mel = paddle.matmul(self.weight, spec) - return mel diff --git a/paddlespeech/t2s/modules/causal_conv.py b/paddlespeech/t2s/modules/causal_conv.py index c0dd5b28c96ce8bfd4edd52a27d687ff5901f27f..c0d4f95598e3889db48f4e7113f4242908ba18e0 100644 --- a/paddlespeech/t2s/modules/causal_conv.py +++ b/paddlespeech/t2s/modules/causal_conv.py @@ -13,9 +13,10 @@ # limitations under the License. """Causal convolusion layer modules.""" import paddle +from paddle import nn -class CausalConv1D(paddle.nn.Layer): +class CausalConv1D(nn.Layer): """CausalConv1D module with customized initialization.""" def __init__( @@ -31,7 +32,7 @@ class CausalConv1D(paddle.nn.Layer): super().__init__() self.pad = getattr(paddle.nn, pad)((kernel_size - 1) * dilation, **pad_params) - self.conv = paddle.nn.Conv1D( + self.conv = nn.Conv1D( in_channels, out_channels, kernel_size, @@ -52,7 +53,7 @@ class CausalConv1D(paddle.nn.Layer): return self.conv(self.pad(x))[:, :, :x.shape[2]] -class CausalConv1DTranspose(paddle.nn.Layer): +class CausalConv1DTranspose(nn.Layer): """CausalConv1DTranspose module with customized initialization.""" def __init__(self, @@ -63,7 +64,7 @@ class CausalConv1DTranspose(paddle.nn.Layer): bias=True): """Initialize CausalConvTranspose1d module.""" super().__init__() - self.deconv = paddle.nn.Conv1DTranspose( + self.deconv = nn.Conv1DTranspose( in_channels, out_channels, kernel_size, stride, bias_attr=bias) self.stride = stride diff --git a/paddlespeech/t2s/modules/conformer/convolution.py b/paddlespeech/t2s/modules/conformer/convolution.py index 25246736b92dfda364cf53a02ed37bb670e99c55..e4a6c8c67e70842feaf84160f273b083f86c012f 100644 --- a/paddlespeech/t2s/modules/conformer/convolution.py +++ b/paddlespeech/t2s/modules/conformer/convolution.py @@ -72,8 +72,10 @@ class ConvolutionModule(nn.Layer): x = x.transpose([0, 2, 1]) # GLU mechanism - x = self.pointwise_conv1(x) # (batch, 2*channel, dim) - x = nn.functional.glu(x, axis=1) # (batch, channel, dim) + # (batch, 2*channel, time) + x = self.pointwise_conv1(x) + # (batch, channel, time) + x = nn.functional.glu(x, axis=1) # 1D Depthwise Conv x = self.depthwise_conv(x) diff --git a/paddlespeech/t2s/modules/conformer/encoder_layer.py b/paddlespeech/t2s/modules/conformer/encoder_layer.py index a7a4936786f9b47f945740d4b45eb7a2b98101ee..2949dc37634beb8332a8c231c41253bb1501ad6a 100644 --- a/paddlespeech/t2s/modules/conformer/encoder_layer.py +++ b/paddlespeech/t2s/modules/conformer/encoder_layer.py @@ -25,19 +25,19 @@ class EncoderLayer(nn.Layer): ---------- size : int Input dimension. - self_attn : paddle.nn.Layer + self_attn : nn.Layer Self-attention module instance. `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance can be used as the argument. - feed_forward : paddle.nn.Layer + feed_forward : nn.Layer Feed-forward module instance. `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument. - feed_forward_macaron : paddle.nn.Layer + feed_forward_macaron : nn.Layer Additional feed-forward module instance. `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument. - conv_module : paddle.nn.Layer + conv_module : nn.Layer Convolution module instance. `ConvlutionModule` instance can be used as the argument. dropout_rate : float @@ -67,7 +67,7 @@ class EncoderLayer(nn.Layer): concat_after=False, stochastic_depth_rate=0.0, ): """Construct an EncoderLayer object.""" - super(EncoderLayer, self).__init__() + super().__init__() self.self_attn = self_attn self.feed_forward = feed_forward self.feed_forward_macaron = feed_forward_macaron diff --git a/paddlespeech/t2s/modules/conv.py b/paddlespeech/t2s/modules/conv.py index d9bd98df5f6dbed5ce1921225d47f9e8d3ba5eea..68766d5e5af0a52ea9e1243de8bf39531142e3e2 100644 --- a/paddlespeech/t2s/modules/conv.py +++ b/paddlespeech/t2s/modules/conv.py @@ -84,7 +84,7 @@ class Conv1dCell(nn.Conv1D): _kernel_size = kernel_size[0] if isinstance(kernel_size, ( tuple, list)) else kernel_size self._r = 1 + (_kernel_size - 1) * _dilation - super(Conv1dCell, self).__init__( + super().__init__( in_channels, out_channels, kernel_size, @@ -226,7 +226,7 @@ class Conv1dBatchNorm(nn.Layer): data_format="NCL", momentum=0.9, epsilon=1e-05): - super(Conv1dBatchNorm, self).__init__() + super().__init__() self.conv = nn.Conv1D( in_channels, out_channels, diff --git a/paddlespeech/t2s/modules/expansion.py b/paddlespeech/t2s/modules/expansion.py deleted file mode 100644 index e9d4b6fe8645468828179eb4d2420070237de470..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/modules/expansion.py +++ /dev/null @@ -1,37 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import numpy as np -import paddle -from paddle import Tensor - - -def expand(encodings: Tensor, durations: Tensor) -> Tensor: - """ - encodings: (B, T, C) - durations: (B, T) - """ - batch_size, t_enc = durations.shape - durations = durations.numpy() - slens = np.sum(durations, -1) - t_dec = np.max(slens) - M = np.zeros([batch_size, t_dec, t_enc]) - for i in range(batch_size): - k = 0 - for j in range(t_enc): - d = durations[i, j] - M[i, k:k + d, j] = 1 - k += d - M = paddle.to_tensor(M, dtype=encodings.dtype) - encodings = paddle.matmul(M, encodings) - return encodings diff --git a/paddlespeech/t2s/modules/layer_norm.py b/paddlespeech/t2s/modules/layer_norm.py index a1c775fc8bfa6d1b720e58cf4db223f01c558c5d..4edd22c98302fd10e99184b53a64c6cd7a6ee855 100644 --- a/paddlespeech/t2s/modules/layer_norm.py +++ b/paddlespeech/t2s/modules/layer_norm.py @@ -13,9 +13,10 @@ # limitations under the License. """Layer normalization module.""" import paddle +from paddle import nn -class LayerNorm(paddle.nn.LayerNorm): +class LayerNorm(nn.LayerNorm): """Layer normalization module. Parameters @@ -28,7 +29,7 @@ class LayerNorm(paddle.nn.LayerNorm): def __init__(self, nout, dim=-1): """Construct an LayerNorm object.""" - super(LayerNorm, self).__init__(nout) + super().__init__(nout) self.dim = dim def forward(self, x): diff --git a/paddlespeech/t2s/modules/losses.py b/paddlespeech/t2s/modules/losses.py index ece9e04505e65c8e2b6779c117f8d21557c12f4d..6b0ab6b3342f39e2f9dc8e17d4d3f69ff82460a2 100644 --- a/paddlespeech/t2s/modules/losses.py +++ b/paddlespeech/t2s/modules/losses.py @@ -11,18 +11,16 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +import math + import paddle +from paddle import nn from paddle.fluid.layers import sequence_mask from paddle.nn import functional as F - -__all__ = [ - "guided_attention_loss", - "weighted_mean", - "masked_l1_loss", - "masked_softmax_with_cross_entropy", -] +from scipy import signal +# Loss for Tacotron2 def attention_guide(dec_lens, enc_lens, N, T, g, dtype=None): """Build that W matrix. shape(B, T_dec, T_enc) W[i, n, t] = 1 - exp(-(n/dec_lens[i] - t/enc_lens[i])**2 / (2g**2)) @@ -57,6 +55,367 @@ def guided_attention_loss(attention_weight, dec_lens, enc_lens, g): return loss +# Losses for GAN Vocoder +def stft(x, + fft_size, + hop_length=None, + win_length=None, + window='hann', + center=True, + pad_mode='reflect'): + """Perform STFT and convert to magnitude spectrogram. + Parameters + ---------- + x : Tensor + Input signal tensor (B, T). + fft_size : int + FFT size. + hop_size : int + Hop size. + win_length : int + window : str, optional + window : str + Name of window function, see `scipy.signal.get_window` for more + details. Defaults to "hann". + center : bool, optional + center (bool, optional): Whether to pad `x` to make that the + :math:`t \times hop\_length` at the center of :math:`t`-th frame. Default: `True`. + pad_mode : str, optional + Choose padding pattern when `center` is `True`. + Returns + ---------- + Tensor: + Magnitude spectrogram (B, #frames, fft_size // 2 + 1). + """ + # calculate window + window = signal.get_window(window, win_length, fftbins=True) + window = paddle.to_tensor(window) + x_stft = paddle.signal.stft( + x, + fft_size, + hop_length, + win_length, + window=window, + center=center, + pad_mode=pad_mode) + + real = x_stft.real() + imag = x_stft.imag() + + return paddle.sqrt(paddle.clip(real**2 + imag**2, min=1e-7)).transpose( + [0, 2, 1]) + + +class SpectralConvergenceLoss(nn.Layer): + """Spectral convergence loss module.""" + + def __init__(self): + """Initilize spectral convergence loss module.""" + super().__init__() + + def forward(self, x_mag, y_mag): + """Calculate forward propagation. + Parameters + ---------- + x_mag : Tensor + Magnitude spectrogram of predicted signal (B, #frames, #freq_bins). + y_mag : Tensor) + Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins). + Returns + ---------- + Tensor + Spectral convergence loss value. + """ + return paddle.norm( + y_mag - x_mag, p="fro") / paddle.clip( + paddle.norm(y_mag, p="fro"), min=1e-10) + + +class LogSTFTMagnitudeLoss(nn.Layer): + """Log STFT magnitude loss module.""" + + def __init__(self, epsilon=1e-7): + """Initilize los STFT magnitude loss module.""" + super().__init__() + self.epsilon = epsilon + + def forward(self, x_mag, y_mag): + """Calculate forward propagation. + Parameters + ---------- + x_mag : Tensor + Magnitude spectrogram of predicted signal (B, #frames, #freq_bins). + y_mag : Tensor + Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins). + Returns + ---------- + Tensor + Log STFT magnitude loss value. + """ + return F.l1_loss( + paddle.log(paddle.clip(y_mag, min=self.epsilon)), + paddle.log(paddle.clip(x_mag, min=self.epsilon))) + + +class STFTLoss(nn.Layer): + """STFT loss module.""" + + def __init__(self, + fft_size=1024, + shift_size=120, + win_length=600, + window="hann"): + """Initialize STFT loss module.""" + super().__init__() + self.fft_size = fft_size + self.shift_size = shift_size + self.win_length = win_length + self.window = window + self.spectral_convergence_loss = SpectralConvergenceLoss() + self.log_stft_magnitude_loss = LogSTFTMagnitudeLoss() + + def forward(self, x, y): + """Calculate forward propagation. + Parameters + ---------- + x : Tensor + Predicted signal (B, T). + y : Tensor + Groundtruth signal (B, T). + Returns + ---------- + Tensor + Spectral convergence loss value. + Tensor + Log STFT magnitude loss value. + """ + x_mag = stft(x, self.fft_size, self.shift_size, self.win_length, + self.window) + y_mag = stft(y, self.fft_size, self.shift_size, self.win_length, + self.window) + sc_loss = self.spectral_convergence_loss(x_mag, y_mag) + mag_loss = self.log_stft_magnitude_loss(x_mag, y_mag) + + return sc_loss, mag_loss + + +class MultiResolutionSTFTLoss(nn.Layer): + """Multi resolution STFT loss module.""" + + def __init__( + self, + fft_sizes=[1024, 2048, 512], + hop_sizes=[120, 240, 50], + win_lengths=[600, 1200, 240], + window="hann", ): + """Initialize Multi resolution STFT loss module. + Parameters + ---------- + fft_sizes : list + List of FFT sizes. + hop_sizes : list + List of hop sizes. + win_lengths : list + List of window lengths. + window : str + Window function type. + """ + super().__init__() + assert len(fft_sizes) == len(hop_sizes) == len(win_lengths) + self.stft_losses = nn.LayerList() + for fs, ss, wl in zip(fft_sizes, hop_sizes, win_lengths): + self.stft_losses.append(STFTLoss(fs, ss, wl, window)) + + def forward(self, x, y): + """Calculate forward propagation. + Parameters + ---------- + x : Tensor + Predicted signal (B, T) or (B, #subband, T). + y : Tensor + Groundtruth signal (B, T) or (B, #subband, T). + Returns + ---------- + Tensor + Multi resolution spectral convergence loss value. + Tensor + Multi resolution log STFT magnitude loss value. + """ + if len(x.shape) == 3: + # (B, C, T) -> (B x C, T) + x = x.reshape([-1, x.shape[2]]) + # (B, C, T) -> (B x C, T) + y = y.reshape([-1, y.shape[2]]) + sc_loss = 0.0 + mag_loss = 0.0 + for f in self.stft_losses: + sc_l, mag_l = f(x, y) + sc_loss += sc_l + mag_loss += mag_l + sc_loss /= len(self.stft_losses) + mag_loss /= len(self.stft_losses) + + return sc_loss, mag_loss + + +class GeneratorAdversarialLoss(nn.Layer): + """Generator adversarial loss module.""" + + def __init__( + self, + average_by_discriminators=True, + loss_type="mse", ): + """Initialize GeneratorAversarialLoss module.""" + super().__init__() + self.average_by_discriminators = average_by_discriminators + assert loss_type in ["mse", "hinge"], f"{loss_type} is not supported." + if loss_type == "mse": + self.criterion = self._mse_loss + else: + self.criterion = self._hinge_loss + + def forward(self, outputs): + """Calcualate generator adversarial loss. + Parameters + ---------- + outputs: Tensor or List + Discriminator outputs or list of discriminator outputs. + Returns + ---------- + Tensor + Generator adversarial loss value. + """ + if isinstance(outputs, (tuple, list)): + adv_loss = 0.0 + for i, outputs_ in enumerate(outputs): + if isinstance(outputs_, (tuple, list)): + # case including feature maps + outputs_ = outputs_[-1] + adv_loss += self.criterion(outputs_) + if self.average_by_discriminators: + adv_loss /= i + 1 + else: + adv_loss = self.criterion(outputs) + + return adv_loss + + def _mse_loss(self, x): + return F.mse_loss(x, paddle.ones_like(x)) + + def _hinge_loss(self, x): + return -x.mean() + + +class DiscriminatorAdversarialLoss(nn.Layer): + """Discriminator adversarial loss module.""" + + def __init__( + self, + average_by_discriminators=True, + loss_type="mse", ): + """Initialize DiscriminatorAversarialLoss module.""" + super().__init__() + self.average_by_discriminators = average_by_discriminators + assert loss_type in ["mse"], f"{loss_type} is not supported." + if loss_type == "mse": + self.fake_criterion = self._mse_fake_loss + self.real_criterion = self._mse_real_loss + + def forward(self, outputs_hat, outputs): + """Calcualate discriminator adversarial loss. + Parameters + ---------- + outputs_hat : Tensor or list + Discriminator outputs or list of + discriminator outputs calculated from generator outputs. + outputs : Tensor or list + Discriminator outputs or list of + discriminator outputs calculated from groundtruth. + Returns + ---------- + Tensor + Discriminator real loss value. + Tensor + Discriminator fake loss value. + """ + if isinstance(outputs, (tuple, list)): + real_loss = 0.0 + fake_loss = 0.0 + for i, (outputs_hat_, + outputs_) in enumerate(zip(outputs_hat, outputs)): + if isinstance(outputs_hat_, (tuple, list)): + # case including feature maps + outputs_hat_ = outputs_hat_[-1] + outputs_ = outputs_[-1] + real_loss += self.real_criterion(outputs_) + fake_loss += self.fake_criterion(outputs_hat_) + if self.average_by_discriminators: + fake_loss /= i + 1 + real_loss /= i + 1 + else: + real_loss = self.real_criterion(outputs) + fake_loss = self.fake_criterion(outputs_hat) + + return real_loss, fake_loss + + def _mse_real_loss(self, x): + return F.mse_loss(x, paddle.ones_like(x)) + + def _mse_fake_loss(self, x): + return F.mse_loss(x, paddle.zeros_like(x)) + + +# Losses for SpeedySpeech +# Structural Similarity Index Measure (SSIM) +def gaussian(window_size, sigma): + gauss = paddle.to_tensor([ + math.exp(-(x - window_size // 2)**2 / float(2 * sigma**2)) + for x in range(window_size) + ]) + return gauss / gauss.sum() + + +def create_window(window_size, channel): + _1D_window = gaussian(window_size, 1.5).unsqueeze(1) + _2D_window = paddle.matmul(_1D_window, paddle.transpose( + _1D_window, [1, 0])).unsqueeze([0, 1]) + window = paddle.expand(_2D_window, [channel, 1, window_size, window_size]) + return window + + +def _ssim(img1, img2, window, window_size, channel, size_average=True): + mu1 = F.conv2d(img1, window, padding=window_size // 2, groups=channel) + mu2 = F.conv2d(img2, window, padding=window_size // 2, groups=channel) + + mu1_sq = mu1.pow(2) + mu2_sq = mu2.pow(2) + mu1_mu2 = mu1 * mu2 + + sigma1_sq = F.conv2d( + img1 * img1, window, padding=window_size // 2, groups=channel) - mu1_sq + sigma2_sq = F.conv2d( + img2 * img2, window, padding=window_size // 2, groups=channel) - mu2_sq + sigma12 = F.conv2d( + img1 * img2, window, padding=window_size // 2, groups=channel) - mu1_mu2 + + C1 = 0.01**2 + C2 = 0.03**2 + + ssim_map = ((2 * mu1_mu2 + C1) * (2 * sigma12 + C2)) \ + / ((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2)) + + if size_average: + return ssim_map.mean() + else: + return ssim_map.mean(1).mean(1).mean(1) + + +def ssim(img1, img2, window_size=11, size_average=True): + (_, channel, _, _) = img1.shape + window = create_window(window_size, channel) + return _ssim(img1, img2, window, window_size, channel, size_average) + + def weighted_mean(input, weight): """Weighted mean. It can also be used as masked mean. @@ -98,28 +457,3 @@ def masked_l1_loss(prediction, target, mask): abs_error = F.l1_loss(prediction, target, reduction='none') loss = weighted_mean(abs_error, mask) return loss - - -def masked_softmax_with_cross_entropy(logits, label, mask, axis=-1): - """Compute masked softmax with cross entropy loss. - - Parameters - ---------- - logits : Tensor - The logits. The ``axis``-th axis is the class dimension. - label : Tensor [dtype: int] - The label. The size of the ``axis``-th axis should be 1. - mask : Tensor - The mask. The shape should be broadcastable to ``label``. - axis : int, optional - The index of the class dimension in the shape of ``logits``, by default - -1. - - Returns - ------- - Tensor [shape=(1,)] - The masked softmax with cross entropy loss. - """ - ce = F.softmax_with_cross_entropy(logits, label, axis=axis) - loss = weighted_mean(ce, mask) - return loss diff --git a/paddlespeech/t2s/modules/masking.py b/paddlespeech/t2s/modules/masking.py deleted file mode 100644 index 7cf37040a728e9fa8cf4c4ef11d818b444186358..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/modules/masking.py +++ /dev/null @@ -1,120 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import paddle - -__all__ = [ - "id_mask", - "feature_mask", - "combine_mask", - "future_mask", -] - - -def id_mask(input, padding_index=0, dtype="bool"): - """Generate mask with input ids. - - Those positions where the value equals ``padding_index`` correspond to 0 or - ``False``, otherwise, 1 or ``True``. - - Parameters - ---------- - input : Tensor [dtype: int] - The input tensor. It represents the ids. - padding_index : int, optional - The id which represents padding, by default 0. - dtype : str, optional - Data type of the returned mask, by default "bool". - - Returns - ------- - Tensor - The generate mask. It has the same shape as ``input`` does. - """ - return paddle.cast(input != padding_index, dtype) - - -def feature_mask(input, axis, dtype="bool"): - """Compute mask from input features. - - For a input features, represented as batched feature vectors, those vectors - which all zeros are considerd padding vectors. - - Parameters - ---------- - input : Tensor [dtype: float] - The input tensor which represents featues. - axis : int - The index of the feature dimension in ``input``. Other dimensions are - considered ``spatial`` dimensions. - dtype : str, optional - Data type of the generated mask, by default "bool" - Returns - ------- - Tensor - The geenrated mask with ``spatial`` shape as mentioned above. - - It has one less dimension than ``input`` does. - """ - feature_sum = paddle.sum(paddle.abs(input), axis) - return paddle.cast(feature_sum != 0, dtype) - - -def combine_mask(mask1, mask2): - """Combine two mask with multiplication or logical and. - - Parameters - ----------- - mask1 : Tensor - The first mask. - mask2 : Tensor - The second mask with broadcastable shape with ``mask1``. - Returns - -------- - Tensor - Combined mask. - - Notes - ------ - It is mainly used to combine the padding mask and no future mask for - transformer decoder. - - Padding mask is used to mask padding positions of the decoder inputs and - no future mask is used to prevent the decoder to see future information. - """ - if mask1.dtype == paddle.fluid.core.VarDesc.VarType.BOOL: - return paddle.logical_and(mask1, mask2) - else: - return mask1 * mask2 - - -def future_mask(time_steps, dtype="bool"): - """Generate lower triangular mask. - - It is used at transformer decoder to prevent the decoder to see future - information. - - Parameters - ---------- - time_steps : int - Decoder time steps. - dtype : str, optional - The data type of the generate mask, by default "bool". - - Returns - ------- - Tensor - The generated mask. - """ - mask = paddle.tril(paddle.ones([time_steps, time_steps])) - return paddle.cast(mask, dtype) diff --git a/paddlespeech/t2s/modules/nets_utils.py b/paddlespeech/t2s/modules/nets_utils.py index 879cdba63e87d6c898a1fd417a9f9e3958a9bcb2..3822b33d074ce2343412a7eadd4294a0f2f9accb 100644 --- a/paddlespeech/t2s/modules/nets_utils.py +++ b/paddlespeech/t2s/modules/nets_utils.py @@ -129,7 +129,7 @@ def initialize(model: nn.Layer, init: str): Parameters ---------- - model : paddle.nn.Layer + model : nn.Layer Target. init : str Method of initialization. @@ -150,17 +150,3 @@ def initialize(model: nn.Layer, init: str): nn.initializer.Constant()) else: raise ValueError("Unknown initialization: " + init) - - -def get_activation(act): - """Return activation function.""" - - activation_funcs = { - "hardtanh": paddle.nn.Hardtanh, - "tanh": paddle.nn.Tanh, - "relu": paddle.nn.ReLU, - "selu": paddle.nn.SELU, - "swish": paddle.nn.Swish, - } - - return activation_funcs[act]() diff --git a/paddlespeech/t2s/modules/pqmf.py b/paddlespeech/t2s/modules/pqmf.py index c299fb5772070fa2e5a4fdc5201541b787b7d449..fb850a4d55548f8dfcc5262e919eb0ab1d91c8ec 100644 --- a/paddlespeech/t2s/modules/pqmf.py +++ b/paddlespeech/t2s/modules/pqmf.py @@ -16,6 +16,7 @@ import numpy as np import paddle import paddle.nn.functional as F +from paddle import nn from scipy.signal import kaiser @@ -56,7 +57,7 @@ def design_prototype_filter(taps=62, cutoff_ratio=0.142, beta=9.0): return h -class PQMF(paddle.nn.Layer): +class PQMF(nn.Layer): """PQMF module. This module is based on `Near-perfect-reconstruction pseudo-QMF banks`_. .. _`Near-perfect-reconstruction pseudo-QMF banks`: @@ -105,7 +106,7 @@ class PQMF(paddle.nn.Layer): self.updown_filter = updown_filter self.subbands = subbands # keep padding info - self.pad_fn = paddle.nn.Pad1D(taps // 2, mode='constant', value=0.0) + self.pad_fn = nn.Pad1D(taps // 2, mode='constant', value=0.0) def analysis(self, x): """Analysis with PQMF. diff --git a/paddlespeech/t2s/modules/predictor/duration_predictor.py b/paddlespeech/t2s/modules/predictor/duration_predictor.py index b269b6866452f57296656f30e0b74f472737891c..6d7adf236d285db283b0a3558c1ad086c59a3108 100644 --- a/paddlespeech/t2s/modules/predictor/duration_predictor.py +++ b/paddlespeech/t2s/modules/predictor/duration_predictor.py @@ -65,7 +65,7 @@ class DurationPredictor(nn.Layer): Offset value to avoid nan in log domain. """ - super(DurationPredictor, self).__init__() + super().__init__() self.offset = offset self.conv = nn.LayerList() for idx in range(n_layers): @@ -155,7 +155,7 @@ class DurationPredictorLoss(nn.Layer): reduction : str Reduction type in loss calculation. """ - super(DurationPredictorLoss, self).__init__() + super().__init__() self.criterion = nn.MSELoss(reduction=reduction) self.offset = offset diff --git a/paddlespeech/t2s/modules/ssim.py b/paddlespeech/t2s/modules/ssim.py deleted file mode 100644 index c9899cd6bc60d937b909f0b5cf405f4b16194a04..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/modules/ssim.py +++ /dev/null @@ -1,80 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -from math import exp - -import paddle -import paddle.nn.functional as F -from paddle import nn - - -def gaussian(window_size, sigma): - gauss = paddle.to_tensor([ - exp(-(x - window_size // 2)**2 / float(2 * sigma**2)) - for x in range(window_size) - ]) - return gauss / gauss.sum() - - -def create_window(window_size, channel): - _1D_window = gaussian(window_size, 1.5).unsqueeze(1) - _2D_window = paddle.matmul(_1D_window, paddle.transpose( - _1D_window, [1, 0])).unsqueeze([0, 1]) - window = paddle.expand(_2D_window, [channel, 1, window_size, window_size]) - return window - - -def _ssim(img1, img2, window, window_size, channel, size_average=True): - mu1 = F.conv2d(img1, window, padding=window_size // 2, groups=channel) - mu2 = F.conv2d(img2, window, padding=window_size // 2, groups=channel) - - mu1_sq = mu1.pow(2) - mu2_sq = mu2.pow(2) - mu1_mu2 = mu1 * mu2 - - sigma1_sq = F.conv2d( - img1 * img1, window, padding=window_size // 2, groups=channel) - mu1_sq - sigma2_sq = F.conv2d( - img2 * img2, window, padding=window_size // 2, groups=channel) - mu2_sq - sigma12 = F.conv2d( - img1 * img2, window, padding=window_size // 2, groups=channel) - mu1_mu2 - - C1 = 0.01**2 - C2 = 0.03**2 - - ssim_map = ((2 * mu1_mu2 + C1) * (2 * sigma12 + C2)) \ - / ((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2)) - - if size_average: - return ssim_map.mean() - else: - return ssim_map.mean(1).mean(1).mean(1) - - -class SSIM(nn.Layer): - def __init__(self, window_size=11, size_average=True): - super().__init__() - self.window_size = window_size - self.size_average = size_average - self.channel = 1 - self.window = create_window(window_size, self.channel) - - def forward(self, img1, img2): - return _ssim(img1, img2, self.window, self.window_size, self.channel, - self.size_average) - - -def ssim(img1, img2, window_size=11, size_average=True): - (_, channel, _, _) = img1.shape - window = create_window(window_size, channel) - return _ssim(img1, img2, window, window_size, channel, size_average) diff --git a/paddlespeech/t2s/modules/stft_loss.py b/paddlespeech/t2s/modules/stft_loss.py deleted file mode 100644 index 31963e718de6b617d1067458ee00987ebbe5a1ad..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/modules/stft_loss.py +++ /dev/null @@ -1,220 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# Modified from espnet(https://github.com/espnet/espnet) -import paddle -from paddle import nn -from paddle.nn import functional as F -from scipy import signal - - -def stft(x, - fft_size, - hop_length=None, - win_length=None, - window='hann', - center=True, - pad_mode='reflect'): - """Perform STFT and convert to magnitude spectrogram. - Parameters - ---------- - x : Tensor - Input signal tensor (B, T). - fft_size : int - FFT size. - hop_size : int - Hop size. - win_length : int - window : str, optional - window : str - Name of window function, see `scipy.signal.get_window` for more - details. Defaults to "hann". - center : bool, optional - center (bool, optional): Whether to pad `x` to make that the - :math:`t \times hop\_length` at the center of :math:`t`-th frame. Default: `True`. - pad_mode : str, optional - Choose padding pattern when `center` is `True`. - Returns - ---------- - Tensor: - Magnitude spectrogram (B, #frames, fft_size // 2 + 1). - """ - # calculate window - window = signal.get_window(window, win_length, fftbins=True) - window = paddle.to_tensor(window) - x_stft = paddle.signal.stft( - x, - fft_size, - hop_length, - win_length, - window=window, - center=center, - pad_mode=pad_mode) - - real = x_stft.real() - imag = x_stft.imag() - - return paddle.sqrt(paddle.clip(real**2 + imag**2, min=1e-7)).transpose( - [0, 2, 1]) - - -class SpectralConvergenceLoss(nn.Layer): - """Spectral convergence loss module.""" - - def __init__(self): - """Initilize spectral convergence loss module.""" - super().__init__() - - def forward(self, x_mag, y_mag): - """Calculate forward propagation. - Parameters - ---------- - x_mag : Tensor - Magnitude spectrogram of predicted signal (B, #frames, #freq_bins). - y_mag : Tensor) - Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins). - Returns - ---------- - Tensor - Spectral convergence loss value. - """ - return paddle.norm( - y_mag - x_mag, p="fro") / paddle.clip( - paddle.norm(y_mag, p="fro"), min=1e-10) - - -class LogSTFTMagnitudeLoss(nn.Layer): - """Log STFT magnitude loss module.""" - - def __init__(self, epsilon=1e-7): - """Initilize los STFT magnitude loss module.""" - super().__init__() - self.epsilon = epsilon - - def forward(self, x_mag, y_mag): - """Calculate forward propagation. - Parameters - ---------- - x_mag : Tensor - Magnitude spectrogram of predicted signal (B, #frames, #freq_bins). - y_mag : Tensor - Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins). - Returns - ---------- - Tensor - Log STFT magnitude loss value. - """ - return F.l1_loss( - paddle.log(paddle.clip(y_mag, min=self.epsilon)), - paddle.log(paddle.clip(x_mag, min=self.epsilon))) - - -class STFTLoss(nn.Layer): - """STFT loss module.""" - - def __init__(self, - fft_size=1024, - shift_size=120, - win_length=600, - window="hann"): - """Initialize STFT loss module.""" - super().__init__() - self.fft_size = fft_size - self.shift_size = shift_size - self.win_length = win_length - self.window = window - self.spectral_convergence_loss = SpectralConvergenceLoss() - self.log_stft_magnitude_loss = LogSTFTMagnitudeLoss() - - def forward(self, x, y): - """Calculate forward propagation. - Parameters - ---------- - x : Tensor - Predicted signal (B, T). - y : Tensor - Groundtruth signal (B, T). - Returns - ---------- - Tensor - Spectral convergence loss value. - Tensor - Log STFT magnitude loss value. - """ - x_mag = stft(x, self.fft_size, self.shift_size, self.win_length, - self.window) - y_mag = stft(y, self.fft_size, self.shift_size, self.win_length, - self.window) - sc_loss = self.spectral_convergence_loss(x_mag, y_mag) - mag_loss = self.log_stft_magnitude_loss(x_mag, y_mag) - - return sc_loss, mag_loss - - -class MultiResolutionSTFTLoss(nn.Layer): - """Multi resolution STFT loss module.""" - - def __init__( - self, - fft_sizes=[1024, 2048, 512], - hop_sizes=[120, 240, 50], - win_lengths=[600, 1200, 240], - window="hann", ): - """Initialize Multi resolution STFT loss module. - Parameters - ---------- - fft_sizes : list - List of FFT sizes. - hop_sizes : list - List of hop sizes. - win_lengths : list - List of window lengths. - window : str - Window function type. - """ - super().__init__() - assert len(fft_sizes) == len(hop_sizes) == len(win_lengths) - self.stft_losses = nn.LayerList() - for fs, ss, wl in zip(fft_sizes, hop_sizes, win_lengths): - self.stft_losses.append(STFTLoss(fs, ss, wl, window)) - - def forward(self, x, y): - """Calculate forward propagation. - Parameters - ---------- - x : Tensor - Predicted signal (B, T) or (B, #subband, T). - y : Tensor - Groundtruth signal (B, T) or (B, #subband, T). - Returns - ---------- - Tensor - Multi resolution spectral convergence loss value. - Tensor - Multi resolution log STFT magnitude loss value. - """ - if len(x.shape) == 3: - # (B, C, T) -> (B x C, T) - x = x.reshape([-1, x.shape[2]]) - # (B, C, T) -> (B x C, T) - y = y.reshape([-1, y.shape[2]]) - sc_loss = 0.0 - mag_loss = 0.0 - for f in self.stft_losses: - sc_l, mag_l = f(x, y) - sc_loss += sc_l - mag_loss += mag_l - sc_loss /= len(self.stft_losses) - mag_loss /= len(self.stft_losses) - - return sc_loss, mag_loss diff --git a/paddlespeech/t2s/modules/style_encoder.py b/paddlespeech/t2s/modules/style_encoder.py index 8a23e85c61dd2f2cbbd06e281335317575dfc5ff..e76226f3c587d22006bcf6d513046ec641e91a61 100644 --- a/paddlespeech/t2s/modules/style_encoder.py +++ b/paddlespeech/t2s/modules/style_encoder.py @@ -74,7 +74,7 @@ class StyleEncoder(nn.Layer): gru_units: int=128, ): """Initilize global style encoder module.""" assert check_argument_types() - super(StyleEncoder, self).__init__() + super().__init__() self.ref_enc = ReferenceEncoder( idim=idim, @@ -93,11 +93,15 @@ class StyleEncoder(nn.Layer): def forward(self, speech: paddle.Tensor) -> paddle.Tensor: """Calculate forward propagation. - Args: - speech (Tensor): Batch of padded target features (B, Lmax, odim). + Parameters + ---------- + speech : Tensor + Batch of padded target features (B, Lmax, odim). - Returns: - Tensor: Style token embeddings (B, token_dim). + Returns + ---------- + Tensor: + Style token embeddings (B, token_dim). """ ref_embs = self.ref_enc(speech) @@ -145,7 +149,7 @@ class ReferenceEncoder(nn.Layer): gru_units: int=128, ): """Initilize reference encoder module.""" assert check_argument_types() - super(ReferenceEncoder, self).__init__() + super().__init__() # check hyperparameters are valid assert conv_kernel_size % 2 == 1, "kernel size must be odd." @@ -249,7 +253,7 @@ class StyleTokenLayer(nn.Layer): dropout_rate: float=0.0, ): """Initilize style token layer module.""" assert check_argument_types() - super(StyleTokenLayer, self).__init__() + super().__init__() gst_embs = paddle.randn(shape=[gst_tokens, gst_token_dim // gst_heads]) self.gst_embs = paddle.create_parameter( diff --git a/paddlespeech/t2s/modules/tacotron2/encoder.py b/paddlespeech/t2s/modules/tacotron2/encoder.py index b95e3529ff7eda456863b519e8982580c066cc01..f1889061396b3ee6ce62026a0a0e039d61cfdba0 100644 --- a/paddlespeech/t2s/modules/tacotron2/encoder.py +++ b/paddlespeech/t2s/modules/tacotron2/encoder.py @@ -73,7 +73,7 @@ class Encoder(nn.Layer): Dropout rate. """ - super(Encoder, self).__init__() + super().__init__() # store the hyperparameters self.idim = idim self.use_residual = use_residual diff --git a/paddlespeech/t2s/modules/transformer/decoder.py b/paddlespeech/t2s/modules/transformer/decoder.py index 072fc813737f3963ccfb6536a1e90e033116e7d4..fe2949f47dc9a1b837ae6aa4a8294723a7b270b7 100644 --- a/paddlespeech/t2s/modules/transformer/decoder.py +++ b/paddlespeech/t2s/modules/transformer/decoder.py @@ -67,11 +67,11 @@ class Decoder(nn.Layer): Dropout rate in self-attention. src_attention_dropout_rate : float Dropout rate in source-attention. - input_layer : (Union[str, paddle.nn.Layer]) + input_layer : (Union[str, nn.Layer]) Input layer type. use_output_layer : bool Whether to use output layer. - pos_enc_class : paddle.nn.Layer + pos_enc_class : nn.Layer Positional encoding module class. `PositionalEncoding `or `ScaledPositionalEncoding` normalize_before : bool @@ -122,8 +122,7 @@ class Decoder(nn.Layer): input_layer, pos_enc_class(attention_dim, positional_dropout_rate)) else: - raise NotImplementedError( - "only `embed` or paddle.nn.Layer is supported.") + raise NotImplementedError("only `embed` or nn.Layer is supported.") self.normalize_before = normalize_before # self-attention module definition diff --git a/paddlespeech/t2s/modules/transformer/decoder_layer.py b/paddlespeech/t2s/modules/transformer/decoder_layer.py index 0310d83ea3933c185612d53db316457f24fb8020..44978f1e84c6ddd5d8f927838705c0c59f29a630 100644 --- a/paddlespeech/t2s/modules/transformer/decoder_layer.py +++ b/paddlespeech/t2s/modules/transformer/decoder_layer.py @@ -26,13 +26,13 @@ class DecoderLayer(nn.Layer): ---------- size : int Input dimension. - self_attn : paddle.nn.Layer + self_attn : nn.Layer Self-attention module instance. `MultiHeadedAttention` instance can be used as the argument. - src_attn : paddle.nn.Layer + src_attn : nn.Layer Self-attention module instance. `MultiHeadedAttention` instance can be used as the argument. - feed_forward : paddle.nn.Layer + feed_forward : nn.Layer Feed-forward module instance. `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument. dropout_rate : float diff --git a/paddlespeech/t2s/modules/transformer/embedding.py b/paddlespeech/t2s/modules/transformer/embedding.py index 3c3f36168f36b05ec85b939ba5ee0c7e3bdc2274..40ab03ee7afa2c5a0eb490bf638884a40ae68c62 100644 --- a/paddlespeech/t2s/modules/transformer/embedding.py +++ b/paddlespeech/t2s/modules/transformer/embedding.py @@ -43,7 +43,7 @@ class PositionalEncoding(nn.Layer): dtype="float32", reverse=False): """Construct an PositionalEncoding object.""" - super(PositionalEncoding, self).__init__() + super().__init__() self.d_model = d_model self.reverse = reverse self.xscale = math.sqrt(self.d_model) @@ -117,7 +117,7 @@ class ScaledPositionalEncoding(PositionalEncoding): self.alpha = paddle.create_parameter( shape=x.shape, dtype=self.dtype, - default_initializer=paddle.nn.initializer.Assign(x)) + default_initializer=nn.initializer.Assign(x)) def reset_parameters(self): """Reset parameters.""" @@ -141,7 +141,7 @@ class ScaledPositionalEncoding(PositionalEncoding): return self.dropout(x) -class RelPositionalEncoding(paddle.nn.Layer): +class RelPositionalEncoding(nn.Layer): """Relative positional encoding module (new implementation). Details can be found in https://github.com/espnet/espnet/pull/2816. See : Appendix B in https://arxiv.org/abs/1901.02860 @@ -157,10 +157,10 @@ class RelPositionalEncoding(paddle.nn.Layer): def __init__(self, d_model, dropout_rate, max_len=5000, dtype="float32"): """Construct an PositionalEncoding object.""" - super(RelPositionalEncoding, self).__init__() + super().__init__() self.d_model = d_model self.xscale = math.sqrt(self.d_model) - self.dropout = paddle.nn.Dropout(p=dropout_rate) + self.dropout = nn.Dropout(p=dropout_rate) self.pe = None self.dtype = dtype self.extend_pe(paddle.expand(paddle.zeros([1]), (1, max_len))) diff --git a/paddlespeech/t2s/modules/transformer/encoder.py b/paddlespeech/t2s/modules/transformer/encoder.py index 2fdf02cfe3d5be8c4ea2fd86916aed39e595a64c..b422f01dcd8e13430ea6944af7bbb6c02075bb9c 100644 --- a/paddlespeech/t2s/modules/transformer/encoder.py +++ b/paddlespeech/t2s/modules/transformer/encoder.py @@ -17,10 +17,10 @@ from typing import Union from paddle import nn +from paddlespeech.t2s.modules.activation import get_activation from paddlespeech.t2s.modules.conformer.convolution import ConvolutionModule from paddlespeech.t2s.modules.conformer.encoder_layer import EncoderLayer as ConformerEncoderLayer from paddlespeech.t2s.modules.layer_norm import LayerNorm -from paddlespeech.t2s.modules.nets_utils import get_activation from paddlespeech.t2s.modules.transformer.attention import MultiHeadedAttention from paddlespeech.t2s.modules.transformer.attention import RelPositionMultiHeadedAttention from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding @@ -34,8 +34,8 @@ from paddlespeech.t2s.modules.transformer.repeat import repeat from paddlespeech.t2s.modules.transformer.subsampling import Conv2dSubsampling -class Encoder(nn.Layer): - """Transformer encoder module. +class BaseEncoder(nn.Layer): + """Base Encoder module. Parameters ---------- @@ -55,7 +55,7 @@ class Encoder(nn.Layer): Dropout rate after adding positional encoding. attention_dropout_rate : float Dropout rate in attention. - input_layer : Union[str, paddle.nn.Layer] + input_layer : Union[str, nn.Layer] Input layer type. normalize_before : bool Whether to use layer_norm before the first block. @@ -120,7 +120,7 @@ class Encoder(nn.Layer): stochastic_depth_rate: float=0.0, intermediate_layers: Union[List[int], None]=None, encoder_type: str="transformer"): - """Construct an Encoder object.""" + """Construct an Bae Encoder object.""" super().__init__() activation = get_activation(activation_type) pos_enc_class = self.get_pos_enc_class(pos_enc_layer_type, @@ -264,7 +264,6 @@ class Encoder(nn.Layer): nn.Dropout(dropout_rate), nn.ReLU(), pos_enc_class(attention_dim, positional_dropout_rate), ) - elif input_layer == "conv2d": embed = Conv2dSubsampling( idim, @@ -305,46 +304,118 @@ class Encoder(nn.Layer): paddle.Tensor Mask tensor (#batch, 1, time). """ - if self.encoder_type == "transformer": - xs = self.embed(xs) - xs, masks = self.encoders(xs, masks) - if self.normalize_before: - xs = self.after_norm(xs) - return xs, masks - elif self.encoder_type == "conformer": - if isinstance(self.embed, (Conv2dSubsampling)): - xs, masks = self.embed(xs, masks) - else: - xs = self.embed(xs) - - if self.intermediate_layers is None: - xs, masks = self.encoders(xs, masks) - else: - intermediate_outputs = [] - for layer_idx, encoder_layer in enumerate(self.encoders): - xs, masks = encoder_layer(xs, masks) - - if (self.intermediate_layers is not None and - layer_idx + 1 in self.intermediate_layers): - # intermediate branches also require normalization. - encoder_output = xs - if isinstance(encoder_output, tuple): - encoder_output = encoder_output[0] - if self.normalize_before: - encoder_output = self.after_norm(encoder_output) - intermediate_outputs.append(encoder_output) - - if isinstance(xs, tuple): - xs = xs[0] - - if self.normalize_before: - xs = self.after_norm(xs) - - if self.intermediate_layers is not None: - return xs, masks, intermediate_outputs - return xs, masks - else: - raise ValueError(f"{self.encoder_type} is not supported.") + xs = self.embed(xs) + xs, masks = self.encoders(xs, masks) + if self.normalize_before: + xs = self.after_norm(xs) + return xs, masks + + +class TransformerEncoder(BaseEncoder): + """Transformer encoder module. + Parameters + ---------- + idim : int + Input dimension. + attention_dim : int + Dimention of attention. + attention_heads : int + The number of heads of multi head attention. + linear_units : int + The number of units of position-wise feed forward. + num_blocks : int + The number of decoder blocks. + dropout_rate : float + Dropout rate. + positional_dropout_rate : float + Dropout rate after adding positional encoding. + attention_dropout_rate : float + Dropout rate in attention. + input_layer : Union[str, paddle.nn.Layer] + Input layer type. + pos_enc_layer_type : str + Encoder positional encoding layer type. + normalize_before : bool + Whether to use layer_norm before the first block. + concat_after : bool + Whether to concat attention layer's input and output. + if True, additional linear will be applied. + i.e. x -> x + linear(concat(x, att(x))) + if False, no additional linear will be applied. i.e. x -> x + att(x) + positionwise_layer_type : str + "linear", "conv1d", or "conv1d-linear". + positionwise_conv_kernel_size : int + Kernel size of positionwise conv1d layer. + selfattention_layer_type : str + Encoder attention layer type. + activation_type : str + Encoder activation function type. + padding_idx : int + Padding idx for input_layer=embed. + """ + + def __init__( + self, + idim, + attention_dim: int=256, + attention_heads: int=4, + linear_units: int=2048, + num_blocks: int=6, + dropout_rate: float=0.1, + positional_dropout_rate: float=0.1, + attention_dropout_rate: float=0.0, + input_layer: str="conv2d", + pos_enc_layer_type: str="abs_pos", + normalize_before: bool=True, + concat_after: bool=False, + positionwise_layer_type: str="linear", + positionwise_conv_kernel_size: int=1, + selfattention_layer_type: str="selfattn", + activation_type: str="relu", + padding_idx: int=-1, ): + """Construct an Transformer Encoder object.""" + super().__init__( + idim, + attention_dim=attention_dim, + attention_heads=attention_heads, + linear_units=linear_units, + num_blocks=num_blocks, + dropout_rate=dropout_rate, + positional_dropout_rate=positional_dropout_rate, + attention_dropout_rate=attention_dropout_rate, + input_layer=input_layer, + pos_enc_layer_type=pos_enc_layer_type, + normalize_before=normalize_before, + concat_after=concat_after, + positionwise_layer_type=positionwise_layer_type, + positionwise_conv_kernel_size=positionwise_conv_kernel_size, + selfattention_layer_type=selfattention_layer_type, + activation_type=activation_type, + padding_idx=padding_idx, + encoder_type="transformer") + + def forward(self, xs, masks): + """Encode input sequence. + + Parameters + ---------- + xs : paddle.Tensor + Input tensor (#batch, time, idim). + masks : paddle.Tensor + Mask tensor (#batch, 1, time). + + Returns + ---------- + paddle.Tensor + Output tensor (#batch, time, attention_dim). + paddle.Tensor + Mask tensor (#batch, 1, time). + """ + xs = self.embed(xs) + xs, masks = self.encoders(xs, masks) + if self.normalize_before: + xs = self.after_norm(xs) + return xs, masks def forward_one_step(self, xs, masks, cache=None): """Encode input frame. @@ -378,3 +449,161 @@ class Encoder(nn.Layer): if self.normalize_before: xs = self.after_norm(xs) return xs, masks, new_cache + + +class ConformerEncoder(BaseEncoder): + """Conformer encoder module. + Parameters + ---------- + idim : int + Input dimension. + attention_dim : int + Dimention of attention. + attention_heads : int + The number of heads of multi head attention. + linear_units : int + The number of units of position-wise feed forward. + num_blocks : int + The number of decoder blocks. + dropout_rate : float + Dropout rate. + positional_dropout_rate : float + Dropout rate after adding positional encoding. + attention_dropout_rate : float + Dropout rate in attention. + input_layer : Union[str, nn.Layer] + Input layer type. + normalize_before : bool + Whether to use layer_norm before the first block. + concat_after : bool + Whether to concat attention layer's input and output. + if True, additional linear will be applied. + i.e. x -> x + linear(concat(x, att(x))) + if False, no additional linear will be applied. i.e. x -> x + att(x) + positionwise_layer_type : str + "linear", "conv1d", or "conv1d-linear". + positionwise_conv_kernel_size : int + Kernel size of positionwise conv1d layer. + macaron_style : bool + Whether to use macaron style for positionwise layer. + pos_enc_layer_type : str + Encoder positional encoding layer type. + selfattention_layer_type : str + Encoder attention layer type. + activation_type : str + Encoder activation function type. + use_cnn_module : bool + Whether to use convolution module. + zero_triu : bool + Whether to zero the upper triangular part of attention matrix. + cnn_module_kernel : int + Kernerl size of convolution module. + padding_idx : int + Padding idx for input_layer=embed. + stochastic_depth_rate : float + Maximum probability to skip the encoder layer. + intermediate_layers : Union[List[int], None] + indices of intermediate CTC layer. + indices start from 1. + if not None, intermediate outputs are returned (which changes return type + signature.) + """ + + def __init__( + self, + idim: int, + attention_dim: int=256, + attention_heads: int=4, + linear_units: int=2048, + num_blocks: int=6, + dropout_rate: float=0.1, + positional_dropout_rate: float=0.1, + attention_dropout_rate: float=0.0, + input_layer: str="conv2d", + normalize_before: bool=True, + concat_after: bool=False, + positionwise_layer_type: str="linear", + positionwise_conv_kernel_size: int=1, + macaron_style: bool=False, + pos_enc_layer_type: str="rel_pos", + selfattention_layer_type: str="rel_selfattn", + activation_type: str="swish", + use_cnn_module: bool=False, + zero_triu: bool=False, + cnn_module_kernel: int=31, + padding_idx: int=-1, + stochastic_depth_rate: float=0.0, + intermediate_layers: Union[List[int], None]=None, ): + """Construct an Conformer Encoder object.""" + super().__init__( + idim=idim, + attention_dim=attention_dim, + attention_heads=attention_heads, + linear_units=linear_units, + num_blocks=num_blocks, + dropout_rate=dropout_rate, + positional_dropout_rate=positional_dropout_rate, + attention_dropout_rate=attention_dropout_rate, + input_layer=input_layer, + normalize_before=normalize_before, + concat_after=concat_after, + positionwise_layer_type=positionwise_layer_type, + positionwise_conv_kernel_size=positionwise_conv_kernel_size, + macaron_style=macaron_style, + pos_enc_layer_type=pos_enc_layer_type, + selfattention_layer_type=selfattention_layer_type, + activation_type=activation_type, + use_cnn_module=use_cnn_module, + zero_triu=zero_triu, + cnn_module_kernel=cnn_module_kernel, + padding_idx=padding_idx, + stochastic_depth_rate=stochastic_depth_rate, + intermediate_layers=intermediate_layers, + encoder_type="conformer") + + def forward(self, xs, masks): + """Encode input sequence. + Parameters + ---------- + xs : paddle.Tensor + Input tensor (#batch, time, idim). + masks : paddle.Tensor + Mask tensor (#batch, 1, time). + Returns + ---------- + paddle.Tensor + Output tensor (#batch, time, attention_dim). + paddle.Tensor + Mask tensor (#batch, 1, time). + """ + if isinstance(self.embed, (Conv2dSubsampling)): + xs, masks = self.embed(xs, masks) + else: + xs = self.embed(xs) + + if self.intermediate_layers is None: + xs, masks = self.encoders(xs, masks) + else: + intermediate_outputs = [] + for layer_idx, encoder_layer in enumerate(self.encoders): + xs, masks = encoder_layer(xs, masks) + + if (self.intermediate_layers is not None and + layer_idx + 1 in self.intermediate_layers): + # intermediate branches also require normalization. + encoder_output = xs + if isinstance(encoder_output, tuple): + encoder_output = encoder_output[0] + if self.normalize_before: + encoder_output = self.after_norm(encoder_output) + intermediate_outputs.append(encoder_output) + + if isinstance(xs, tuple): + xs = xs[0] + + if self.normalize_before: + xs = self.after_norm(xs) + + if self.intermediate_layers is not None: + return xs, masks, intermediate_outputs + return xs, masks diff --git a/paddlespeech/t2s/modules/transformer/encoder_layer.py b/paddlespeech/t2s/modules/transformer/encoder_layer.py index fb2c2e823a327ab0a531076df00d19f4ae9faa7b..f55ded3de65bdc644b5eab63354ccd3a120edd51 100644 --- a/paddlespeech/t2s/modules/transformer/encoder_layer.py +++ b/paddlespeech/t2s/modules/transformer/encoder_layer.py @@ -24,10 +24,10 @@ class EncoderLayer(nn.Layer): ---------- size : int Input dimension. - self_attn : paddle.nn.Layer + self_attn : nn.Layer Self-attention module instance. `MultiHeadedAttention` instance can be used as the argument. - feed_forward : paddle.nn.Layer + feed_forward : nn.Layer Feed-forward module instance. `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument. dropout_rate : float @@ -50,7 +50,7 @@ class EncoderLayer(nn.Layer): normalize_before=True, concat_after=False, ): """Construct an EncoderLayer object.""" - super(EncoderLayer, self).__init__() + super().__init__() self.self_attn = self_attn self.feed_forward = feed_forward self.norm1 = nn.LayerNorm(size) diff --git a/paddlespeech/t2s/modules/transformer/lightconv.py b/paddlespeech/t2s/modules/transformer/lightconv.py index 1aeb6d6e19a8f7b0e0820829a3ed7f496693fd28..ccf84c8a3364ea872fd589091c6a7ef28f4ac928 100644 --- a/paddlespeech/t2s/modules/transformer/lightconv.py +++ b/paddlespeech/t2s/modules/transformer/lightconv.py @@ -18,7 +18,7 @@ import paddle import paddle.nn.functional as F from paddle import nn -from paddlespeech.t2s.modules.glu import GLU +from paddlespeech.t2s.modules.activation import get_activation from paddlespeech.t2s.modules.masked_fill import masked_fill MIN_VALUE = float(numpy.finfo(numpy.float32).min) @@ -56,7 +56,7 @@ class LightweightConvolution(nn.Layer): use_kernel_mask=False, use_bias=False, ): """Construct Lightweight Convolution layer.""" - super(LightweightConvolution, self).__init__() + super().__init__() assert n_feat % wshare == 0 self.wshare = wshare @@ -68,7 +68,7 @@ class LightweightConvolution(nn.Layer): # linear -> GLU -> lightconv -> linear self.linear1 = nn.Linear(n_feat, n_feat * 2) self.linear2 = nn.Linear(n_feat, n_feat) - self.act = GLU() + self.act = get_activation("glu") # lightconv related self.uniform_ = nn.initializer.Uniform() diff --git a/paddlespeech/t2s/modules/transformer/multi_layer_conv.py b/paddlespeech/t2s/modules/transformer/multi_layer_conv.py index 8845b2a2b07f161a85b2fec254e9c4957e4614b4..df8929e306b0743899ec712ea470c792838f355c 100644 --- a/paddlespeech/t2s/modules/transformer/multi_layer_conv.py +++ b/paddlespeech/t2s/modules/transformer/multi_layer_conv.py @@ -12,10 +12,10 @@ # See the License for the specific language governing permissions and # limitations under the License. """Layer modules for FFT block in FastSpeech (Feed-forward Transformer).""" -import paddle +from paddle import nn -class MultiLayeredConv1d(paddle.nn.Layer): +class MultiLayeredConv1d(nn.Layer): """Multi-layered conv1d for Transformer block. This is a module of multi-leyered conv1d designed @@ -43,21 +43,21 @@ class MultiLayeredConv1d(paddle.nn.Layer): Dropout rate. """ - super(MultiLayeredConv1d, self).__init__() - self.w_1 = paddle.nn.Conv1D( + super().__init__() + self.w_1 = nn.Conv1D( in_chans, hidden_chans, kernel_size, stride=1, padding=(kernel_size - 1) // 2, ) - self.w_2 = paddle.nn.Conv1D( + self.w_2 = nn.Conv1D( hidden_chans, in_chans, kernel_size, stride=1, padding=(kernel_size - 1) // 2, ) - self.dropout = paddle.nn.Dropout(dropout_rate) - self.relu = paddle.nn.ReLU() + self.dropout = nn.Dropout(dropout_rate) + self.relu = nn.ReLU() def forward(self, x): """Calculate forward propagation. @@ -77,7 +77,7 @@ class MultiLayeredConv1d(paddle.nn.Layer): [0, 2, 1]) -class Conv1dLinear(paddle.nn.Layer): +class Conv1dLinear(nn.Layer): """Conv1D + Linear for Transformer block. A variant of MultiLayeredConv1d, which replaces second conv-layer to linear. @@ -98,16 +98,16 @@ class Conv1dLinear(paddle.nn.Layer): dropout_rate : float Dropout rate. """ - super(Conv1dLinear, self).__init__() - self.w_1 = paddle.nn.Conv1D( + super().__init__() + self.w_1 = nn.Conv1D( in_chans, hidden_chans, kernel_size, stride=1, padding=(kernel_size - 1) // 2, ) - self.w_2 = paddle.nn.Linear(hidden_chans, in_chans, bias_attr=True) - self.dropout = paddle.nn.Dropout(dropout_rate) - self.relu = paddle.nn.ReLU() + self.w_2 = nn.Linear(hidden_chans, in_chans, bias_attr=True) + self.dropout = nn.Dropout(dropout_rate) + self.relu = nn.ReLU() def forward(self, x): """Calculate forward propagation. diff --git a/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py b/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py index 297a3b4fbb78bc121e7102f85a95272d4f91c8ef..28ed1c31b10fa0c499baf11ec55baf6bf93fff26 100644 --- a/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py +++ b/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py @@ -14,9 +14,10 @@ # Modified from espnet(https://github.com/espnet/espnet) """Positionwise feed forward layer definition.""" import paddle +from paddle import nn -class PositionwiseFeedForward(paddle.nn.Layer): +class PositionwiseFeedForward(nn.Layer): """Positionwise feed forward layer. Parameters @@ -35,7 +36,7 @@ class PositionwiseFeedForward(paddle.nn.Layer): dropout_rate, activation=paddle.nn.ReLU()): """Construct an PositionwiseFeedForward object.""" - super(PositionwiseFeedForward, self).__init__() + super().__init__() self.w_1 = paddle.nn.Linear(idim, hidden_units, bias_attr=True) self.w_2 = paddle.nn.Linear(hidden_units, idim, bias_attr=True) self.dropout = paddle.nn.Dropout(dropout_rate) diff --git a/paddlespeech/t2s/modules/transformer/subsampling.py b/paddlespeech/t2s/modules/transformer/subsampling.py index e1bd75bb51d6ae2f18501da8dcde95805fa1c0b0..cf0fca8a1b0115f60445f52d164d822533873359 100644 --- a/paddlespeech/t2s/modules/transformer/subsampling.py +++ b/paddlespeech/t2s/modules/transformer/subsampling.py @@ -14,11 +14,12 @@ # Modified from espnet(https://github.com/espnet/espnet) """Subsampling layer definition.""" import paddle +from paddle import nn from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding -class Conv2dSubsampling(paddle.nn.Layer): +class Conv2dSubsampling(nn.Layer): """Convolutional 2D subsampling (to 1/4 length). Parameters ---------- @@ -28,20 +29,20 @@ class Conv2dSubsampling(paddle.nn.Layer): Output dimension. dropout_rate : float Dropout rate. - pos_enc : paddle.nn.Layer + pos_enc : nn.Layer Custom position encoding layer. """ def __init__(self, idim, odim, dropout_rate, pos_enc=None): """Construct an Conv2dSubsampling object.""" - super(Conv2dSubsampling, self).__init__() - self.conv = paddle.nn.Sequential( - paddle.nn.Conv2D(1, odim, 3, 2), - paddle.nn.ReLU(), - paddle.nn.Conv2D(odim, odim, 3, 2), - paddle.nn.ReLU(), ) - self.out = paddle.nn.Sequential( - paddle.nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim), + super().__init__() + self.conv = nn.Sequential( + nn.Conv2D(1, odim, 3, 2), + nn.ReLU(), + nn.Conv2D(odim, odim, 3, 2), + nn.ReLU(), ) + self.out = nn.Sequential( + nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim), pos_enc if pos_enc is not None else PositionalEncoding(odim, dropout_rate), ) diff --git a/paddlespeech/t2s/training/optimizer.py b/paddlespeech/t2s/training/optimizer.py index c6a6944d10758e2c1d0e6fae79ae1db995b0ac32..907e3dafafb2ea2d78919f9d5042d7b8d402872c 100644 --- a/paddlespeech/t2s/training/optimizer.py +++ b/paddlespeech/t2s/training/optimizer.py @@ -12,6 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. import paddle +from paddle import nn optim_classes = dict( adadelta=paddle.optimizer.Adadelta, @@ -25,7 +26,7 @@ optim_classes = dict( sgd=paddle.optimizer.SGD, ) -def build_optimizers(model: paddle.nn.Layer, +def build_optimizers(model: nn.Layer, optim='adadelta', max_grad_norm=None, learning_rate=0.01) -> paddle.optimizer: diff --git a/tests/unit/tts/test_stft.py b/tests/unit/tts/test_stft.py index d2d56dca400ffd8da055509583944d0b660a5cb4..624226e941c07778af1716728e2a62256124a958 100644 --- a/tests/unit/tts/test_stft.py +++ b/tests/unit/tts/test_stft.py @@ -11,52 +11,11 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -import librosa -import numpy as np import paddle import torch from parallel_wavegan.losses import stft_loss as sl -from scipy import signal -from paddlespeech.t2s.modules.stft_loss import MultiResolutionSTFTLoss -from paddlespeech.t2s.modules.stft_loss import STFT - - -def test_stft(): - stft = STFT(n_fft=1024, hop_length=256, win_length=1024) - x = paddle.uniform([4, 46080]) - S = stft.magnitude(x) - window = signal.get_window('hann', 1024, fftbins=True) - D2 = torch.stft( - torch.as_tensor(x.numpy()), - n_fft=1024, - hop_length=256, - win_length=1024, - window=torch.as_tensor(window)) - S2 = (D2**2).sum(-1).sqrt() - S3 = np.abs( - librosa.stft(x.numpy()[0], n_fft=1024, hop_length=256, win_length=1024)) - print(S2.shape) - print(S.numpy()[0]) - print(S2.data.cpu().numpy()[0]) - print(S3) - - -def test_torch_stft(): - # NOTE: torch.stft use no window by default - x = np.random.uniform(-1.0, 1.0, size=(46080, )) - window = signal.get_window('hann', 1024, fftbins=True) - D2 = torch.stft( - torch.as_tensor(x), - n_fft=1024, - hop_length=256, - win_length=1024, - window=torch.as_tensor(window)) - D3 = librosa.stft( - x, n_fft=1024, hop_length=256, win_length=1024, window='hann') - print(D2[:, :, 0].data.cpu().numpy()[:, 30:60]) - print(D3.real[:, 30:60]) - # print(D3.imag[:, 30:60]) +from paddlespeech.t2s.modules.losses import MultiResolutionSTFTLoss def test_multi_resolution_stft_loss():