未验证 提交 e2ca0d8a 编写于 作者: F Feiyu Chan 提交者: GitHub

cherry-pick: update installation mannuals for paddle 1.6 (#3678)

* update to 1.6 APIs and fix bugs with tensorboard. (#3663)

* ini commit for deepvoice, add tensorboard to requirements

* fix urls for code we adapted from

* fix makedirs for python2, fix README

* fix open with encoding for python2 compatability

* fix python2's str(), use encode for unicode, and str() for int

* fix python2 encoding issue, add model architecture and project structure for README

* add model structure, add explanation for hyperparameter priority order.

* fix repitiio n in README_cn, and reorder that in README

* README update; fix integer division issues for python2 compatability

* fix data type for input data, specifier the int type as np.int64 to be platform agnostic

* fix README for preprocess.py, use io.open instead of open for python2 compatability.

* update commanline options, use new save/load API

* fix IO conflict bug for data parallel training

* only construct summary writer in process 0 to further avoid conflict

* update details about checkpoint in README (#3673)

* ini commit for deepvoice, add tensorboard to requirements

* fix urls for code we adapted from

* fix makedirs for python2, fix README

* fix open with encoding for python2 compatability

* fix python2's str(), use encode for unicode, and str() for int

* fix python2 encoding issue, add model architecture and project structure for README

* add model structure, add explanation for hyperparameter priority order.

* fix repitiio n in README_cn, and reorder that in README

* README update; fix integer division issues for python2 compatability

* fix data type for input data, specifier the int type as np.int64 to be platform agnostic

* fix README for preprocess.py, use io.open instead of open for python2 compatability.

* update commanline options, use new save/load API

* fix IO conflict bug for data parallel training

* only construct summary writer in process 0 to further avoid conflict

* fix typos and update details for checkpoints

* update installation guide
上级 8005424b
......@@ -8,11 +8,11 @@ We implement Deepvoice3 model in paddle fluid with dynamic graph, which is conve
### Install paddlepaddle
For faster training speed and better support, it is recommended that you install the lasted develop version of paddlepaddle. You can either download the lasted dev wheel or build paddle from source.
This implementation requires paddlepaddle 1.6. You can either download the compiled package or build paddle from source.
1. Download lasted wheel. See [**Multi-version whl package list - dev**](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/install/Tables_en.html#multi-version-whl-package-list-dev) for more details.
1. Install the compiled package, via pip, conda or docker. See [**Installation Mannuals**](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/install/index_en.html) for more details.
2. Build paddlepaddle from source. See [**Compile From Source Code**](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/install/compile/fromsource_en.html) for more details. Note that if you want to enable data parallel training for multiple GPUs, you should set `-DWITH_DISTRIBUTE=ON` with cmake.
2. Build paddlepaddle from source. See [**Compile From Source Code**](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/install/compile/fromsource_en.html) for more details. Note that if you want to enable data parallel training for multiple GPUs, you should set `-DWITH_DISTRIBUTE=ON` with cmake.
### Other Requirements
......@@ -62,10 +62,11 @@ There are many hyperparameters to be tuned depending on the specification of mod
Note that `preprocess.py`, `train.py` and `synthesis.py` all accept a `--preset` parameter. To ensure consistency, you should use the same preset for preprocessing, training and synthesizing.
Note that you can overwrite preset hyperparameters with command line argument `--hparams`, just pass several key-value pair in `${key}=${value}` format seperated by comma (`,`). For example `--hparams="batch_size=8, nepochs=500"` can overwrite default values in the preset json file. For more details about hyperparameters, see `hparams.py`, which contains the definition of `hparams`. Priority order of hyperparameters is command line option `--hparams` > `--preset` json configuration file > definition of hparams in `hparams.py`.
Note that you can overwrite preset hyperparameters with command line argument `--hparams`, just pass several key-value pair in `${key}=${value}` format seperated by comma (`,`). For example `--hparams="batch_size=8, nepochs=500"` can overwrite default values in the preset json file.
Some hyperparameters are only related to training, like `batch_size`, `checkpoint_interval` and you can use different values for preprocessing and training. But hyperparameters related to data preprocessing, like `num_mels` and `ref_level_db`, should be kept the same for preprocessing and training.
For more details about hyperparameters, see `hparams.py`, which contains the definition of `hparams`. Priority order of hyperparameters is command line option `--hparams` > `--preset` json configuration file > definition of hparams in `hparams.py`.
### Dataset
......@@ -85,14 +86,14 @@ python preprocess.py \
${name} ${in_dir} ${out_dir}
```
Now `${dataset_name}$` only supports `ljspeech`. Support for other datasets is pending.
Now `${name}$` only supports `ljspeech`. Support for other datasets is pending.
Assuming that you use `presers/deepvoice3_ljspeech.json` for LJSpeech and the path of the unziped dataset is `~/data/LJSpeech-1.1`, then you can preprocess data with the following command.
Assuming that you use `presers/deepvoice3_ljspeech.json` for LJSpeech and the path of the unziped dataset is `./data/LJSpeech-1.1`, then you can preprocess data with the following command.
```bash
python preprocess.py \
--preset=presets/deepvoice3_ljspeech.json \
ljspeech ~/data/LJSpeech-1.1/ ./data/ljspeech
ljspeech ./data/LJSpeech-1.1/ ./data/ljspeech
```
When this is done, you will see extracted features in `./data/ljspeech` including:
......@@ -120,9 +121,9 @@ You can load saved checkpoint and resume training with `--checkpoint`, if you wa
#### train a part of the model
You can also train parts of the model while freezing other parts, by passing `--train-seq2seq-only` or `--train-postnet-only`. When training only parts of the model, other parts should be loaded from saved checkpoints.
You can also train parts of the model while freezing other parts, by passing `--train-seq2seq-only` or `--train-postnet-only`. When training only parts of the model, other parts should be loaded from saved checkpoint.
To train only the `seq2seq` or `postnet`, you should load from a whole model with `--checkpoint`and keep the same configurations. Note that when training only the `postnet`, you should set `use_decoder_state_for_postnet_input=false`, because when train only the postnet, the postnet takes the ground truth mel-spectrogram as input.
To train only the `seq2seq` or `postnet`, you should load from a whole model with `--checkpoint` and keep the same configurations with which the checkpoint is trained. Note that when training only the `postnet`, you should set `use_decoder_state_for_postnet_input=false`, because when train only the postnet, the postnet takes the ground truth mel-spectrogram as input. Note that the default value for `use_decoder_state_for_postnet_input` is `True`.
example:
......@@ -132,7 +133,7 @@ python train.py --data-root=${data-root} --use-gpu \
--preset=${preset_json_path} \
--hparams="parameters you may want to override" \
--train-seq2seq-only \
--checkpoint=${path_of_the_saved_model}
--output=${directory_to_save_results}
```
### Training on multiple GPUs
......@@ -147,7 +148,7 @@ python -m paddle.distributed.launch \
training_script ...
```
`paddle.distributed.launch` parallelizes training in multiprocessing mode.`--selected_gpus` means the logical ids of the selected GPUs, and `started_port` means the port used by the first worker. Outputs of each worker are saved in `--log_dir.` Then follows the command for training on a single GPU, except that you should pass `--use-data-paralle` in addition.
`paddle.distributed.launch` parallelizes training in multiprocessing mode.`--selected_gpus` means the logical ids of the selected GPUs, and `started_port` means the port used by the first worker. Outputs of each process are saved in `--log_dir.` Then follows the command for training on a single GPU, except that you should pass `--use-data-paralle` in addition.
```bash
export CUDA_VISIBLE_DEVICES=2,3,4,5 # The IDs of visible physical devices
......@@ -159,16 +160,16 @@ python -m paddle.distributed.launch \
--hparams="parameters you may want to override"
```
In the example above, we set only GPU `2, 3, 4, 5` to be visible. Then `--selected_gpus="0, 1, 2, 3"` means the logical ids of the selected gpus, which correpond to GPU `2, 3, 4, 5`.
In the example above, we set only GPU `2, 3, 4, 5` to be visible. Then `--selected_gpus="0, 1, 2, 3"` means the logical ids of the selected gpus, which correponds to GPU `2, 3, 4, 5`.
Model checkpoints (directory ending with `.model`) are saved in `./checkpoints` per 10000 steps by default. Layer-wise averaged attention alignments (.png) are saved in `.checkpointys/alignment_ave`. And alignments for each attention layer are saved in `.checkpointys/alignment_layer{attention_layer_num}` per 10000 steps for inspection.
Model checkpoints (`*.pdparams` for the model and `*.pdopt` for the optimizer) are saved in `${directory_to_save_results}/checkpoints` per 10000 steps by default. Layer-wise averaged attention alignments (.png) are saved in `${directory_to_save_results}/checkpoints/alignment_ave`. And alignments for each attention layer are saved in `${directory_to_save_results}/checkpoints/alignment_layer{attention_layer_num}` per 10000 steps for inspection.
Synthesis results of 6 sentences (hardcoded in `eval_model.py`) are saved in `checkpoints/eval`, including `step{step_num}_text{text_id}_single_alignment.png` for averaged alignments and `step{step_num}_text{text_id}_single_predicted.wav` for the predicted waveforms.
Synthesis results of 6 sentences (hardcoded in `eval_model.py`) are saved in `${directory_to_save_results}/checkpoints/eval`, including `step{step_num}_text{text_id}_single_alignment.png` for averaged alignments and `step{step_num}_text{text_id}_single_predicted.wav` for the predicted waveforms.
### Monitor with Tensorboard
Logs with tensorboard are saved in `./log/${datetime}` directory by default. You can monitor logs by tensorboard.
Logs with tensorboard are saved in `${directory_to_save_results}/log/` directory by default. You can monitor logs by tensorboard.
```bash
tensorboard --logdir=${log_dir} --host=$HOSTNAME --port=8888
......@@ -179,9 +180,9 @@ tensorboard --logdir=${log_dir} --host=$HOSTNAME --port=8888
Given a list of text, `synthesis.py` synthesize audio signals from a trained model.
```bash
python infer.py --use-gpu --preset=${preset_json_path} \
python synthesis.py --use-gpu --preset=${preset_json_path} \
--hparams="parameters you may want to override" \
${checkpoint} ${text_list_file} ${dst_dir}}
${checkpoint} ${text_list_file} ${dst_dir}
```
Example test_list.txt:
......@@ -198,7 +199,7 @@ generated waveform files and alignment files are saved in `${dst_dir}`.
According to [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654), the position rate is different for different datasets. There are 2 position rates, one for the query and the other for the key, which are referred to as $\omega_1$ and $\omega_2$ in th paper, and the corresponding names in preset json are `query_position_rate` and `key_position_rate`.
For example, the `query_position_rate` and `key_position_rate` for LJSpeech are `1.0` and `1.385`, respectively. These values are computed with `compute_timestamp_ratio.py`. Run the command below.
For example, the `query_position_rate` and `key_position_rate` for LJSpeech are `1.0` and `1.385`, respectively. Fix the `query_position_rate` as 1.0, the `key_position_rate` is computed with `compute_timestamp_ratio.py`. Run the command below, where `${data_root}` means the path of the preprocessed dataset.
```bash
python compute_timestamp_ratio.py --preset=${preset_json_path} \
......@@ -216,4 +217,4 @@ Then set the `key_position_rate=1.385` and `query_position_rate=1.0` in the pres
## Acknowledgement
We thankfully included and adapted some files r9y9's from [deepvoice3_pytorch](https://github.com/r9y9/deepvoice3_pytorch).
We thankfully included and adapted some files from r9y9's [deepvoice3_pytorch](https://github.com/r9y9/deepvoice3_pytorch).
......@@ -9,9 +9,9 @@ Paddle 实现的 Deepvoice3,一个基于卷积神经网络的语音合成 (Tex
### 安装 paddlepaddle 框架
为了更快的训练速度和更好的支持,我们推荐使用最新的开发版 paddle。用户可以最新编译的开发版 whl 包,也可以选择从源码编译 Paddle。
本实现依赖 paddlepaddle 1.6 版本。用户可以安装编译好的包,也可以选择从源码编译 Paddle。
1. 下载最新编译的开发版 whl 包。可以从 [**多版本 wheel 包列表-dev**](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/install/Tables.html#whl-dev) 页面中选择合适的版本
1. 通过 pip, conda 或者 docker 安装编译好的包。请参考[**安装说明**](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/install/index_cn.html)
2. 从源码编译 Paddle. 参考[**从源码编译**](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/install/compile/fromsource.html) 页面。注意,如果你需要使用多卡训练,那么编译前需要设置选项 `-DWITH_DISTRIBUTE=ON`
......@@ -62,9 +62,9 @@ nltk.download("cmudict")
`preprocess.py``train.py``synthesis.py` 都接受 `--preset` 参数。为了保持一致性,最好在数据预处理,模型训练和语音合成时使用相同的预设配置。
可以通过 `--hparams` 参数来覆盖预设的超参数配置,参数格式是逗号分隔的键值对 `${key}=${value}`,例如 `--hparams="batch_size=8, nepochs=500"`关于超参数设置更多细节可以参考 `hparams.py` ,其中定义了 hparams。超参数的优先级序列是:通过命令行参数 `--hparams` 传入的参数优先级高于通过 `--preset` 参数传入的 json 配置文件,高于 `hparams.py` 中的定义。
可以通过 `--hparams` 参数来覆盖预设的超参数配置,参数格式是逗号分隔的键值对 `${key}=${value}`,例如 `--hparams="batch_size=8, nepochs=500"`
部分参数可以只和训练有关,如 `batch_size`, `checkpoint_interval`, 用户在训练时可以使用不同的值。但部分参数和数据预处理相关,如 `num_mels``ref_level_db`, 这些参数在数据预处理和训练时候应该保持一致。
部分参数只和训练有关,如 `batch_size`, `checkpoint_interval`, 用户在训练时可以使用不同的值。但部分参数和数据预处理相关,如 `num_mels``ref_level_db`, 这些参数在数据预处理和训练时候应该保持一致。
关于超参数设置更多细节可以参考 `hparams.py` ,其中定义了 hparams。超参数的优先级序列是:通过命令行参数 `--hparams` 传入的参数优先级高于通过 `--preset` 参数传入的 json 配置文件,高于 `hparams.py` 中的定义。
......@@ -86,14 +86,14 @@ python preprocess.py \
${name} ${in_dir} ${out_dir}
```
目前 `${dataset_name}$` 只支持 `ljspeech`。未来将会支持更多数据集。
目前 `${name}$` 只支持 `ljspeech`。未来将会支持更多数据集。
假设你使用 `presers/deepvoice3_ljspeech.json` 作为处理 LJSpeech 的预设配置文件,并且解压后的数据集位于 `~/data/LJSpeech-1.1`, 那么使用如下的命令进行数据预处理。
假设你使用 `presers/deepvoice3_ljspeech.json` 作为处理 LJSpeech 的预设配置文件,并且解压后的数据集位于 `./data/LJSpeech-1.1`, 那么使用如下的命令进行数据预处理。
```bash
python preprocess.py \
--preset=presets/deepvoice3_ljspeech.json \
ljspeech ~/data/LJSpeech-1.1/ ./data/ljspeech
ljspeech ./data/LJSpeech-1.1/ ./data/ljspeech
```
数据处理完成后,你会在 `./data/ljspeech` 看到提取的特征,包含如下文件。
......@@ -123,7 +123,7 @@ python train.py --data-root=${data-root} --use-gpu \
用户可以通过 `--train-seq2seq-only` 或者 `--train-postnet-only` 来实现固定模型的其他部分,只训练需要训练的部分。但当只训练模型的一部分时,其他的部分需要从保存的模型中加载。
当只训练模型的 `seq2seq` 部分或者 `postnet` 部分时,需要使用 `--checkpoint` 加载整个模型并保持相同的配置。注意,当只训练 `postnet` 的时候,需要保证配置中的`use_decoder_state_for_postnet_input=false`,因为在这种情况下,postnet 使用真实的 mel 频谱作为输入。
当只训练模型的 `seq2seq` 部分或者 `postnet` 部分时,需要使用 `--checkpoint` 加载整个模型并保持相同的配置。注意,当只训练 `postnet` 的时候,需要保证配置中的`use_decoder_state_for_postnet_input=false`,因为在这种情况下,postnet 使用真实的 mel 频谱作为输入。注意,`use_decoder_state_for_postnet_input` 的默认值是 `True`
示例:
......@@ -133,18 +133,18 @@ python train.py --data-root=${data-root} --use-gpu \
--preset=${preset_json_path} \
--hparams="parameters you may want to override" \
--train-seq2seq-only \
--checkpoint=${path_of_the_saved_model}
--output=${directory_to_save_results}
```
### 使用 GPU 多卡训练
本模型支持使用多个 GPU 通过数据并行的方式 训练。方法是使用 `paddle.distributed.launch` 模块来启动 `train.py`
本模型支持使用多个 GPU 通过数据并行的方式训练。方法是使用 `paddle.distributed.launch` 模块来启动 `train.py`
```bash
python -m paddle.distributed.launch \
--started_port ${port_of_the_first_worker} \
--selected_gpus ${logical_gpu_ids_to_choose} \
--log_dir ${path_of_write_log} \
--log_dir ${path_to_write_log} \
training_script ...
```
......@@ -157,19 +157,20 @@ python -m paddle.distributed.launch \
train.py --data-root=${data-root} \
--use-gpu --use-data-parallel \
--preset=${preset_json_path} \
--hparams="parameters you may want to override"
--hparams="parameters you may want to override" \
--output=${directory_to_save_results}
```
上述的示例中,设置了 `2, 3, 4, 5` 号显卡为可见的 GPU。然后 `--selected_gpus=0,1,2,3` 选择的是 GPU 的逻辑序号,分别对应于 `2, 3, 4, 5` 号卡。
模型默认被保存为后缀为 `.model`的文件夹,保存在 `./checkpoints` 文件夹中。多层平均的注意力机制对齐结果被保存为 `.png` 图片,默认保存在 `.checkpointys/alignment_ave` 中。每一层的注意力机制对齐结果默认被保存在 `.checkpointys/alignment_layer{attention_layer_num}`文件夹中。默认每 10000 步保存一次用于查看。
模型 (模型参数保存为`*.pdparams` 文件,优化器被保存为 `*.pdopt` 文件)保存在 `${directory_to_save_results}/checkpoints` 文件夹中。多层平均的注意力机制对齐结果被保存为 `.png` 图片,默认保存在 `${directory_to_save_results}/checkpoints/alignment_ave` 中。每一层的注意力机制对齐结果默认被保存在 `${directory_to_save_results}/checkpoints/alignment_layer{attention_layer_num}`文件夹中。默认每 10000 步保存一次用于查看。
对 6 个给定的句子的语音合成结果保存在 `checkpoints/eval` 中,包含多层平均平均的注意力机制对齐结果,这被保存为名为 `step{step_num}_text{text_id}_single_alignment.png` 的图片;以及合成的音频文件,保存为名为 `step{step_num}_text{text_id}_single_predicted.wav` 的音频。
对 6 个给定的句子的语音合成结果保存在 `${directory_to_save_results}/checkpoints/eval` 中,包含多层平均平均的注意力机制对齐结果,这被保存为名为 `step{step_num}_text{text_id}_single_alignment.png` 的图片;以及合成的音频文件,保存为名为 `step{step_num}_text{text_id}_single_predicted.wav` 的音频。
### 使用 Tensorboard 查看训练
Tensorboard 训练日志默认被保存在 `./log/${datetime}` 文件夹,可以通过 tensorboard 查看。使用方法如下。
Tensorboard 训练日志被保存在 `${directory_to_save_results}/log/` 文件夹,可以通过 tensorboard 查看。使用方法如下。
```bash
tensorboard --logdir=${log_dir} --host=$HOSTNAME --port=8888
......@@ -180,9 +181,9 @@ tensorboard --logdir=${log_dir} --host=$HOSTNAME --port=8888
给定一组文本,使用 `synthesis.py` 从一个训练好的模型来合成语音,使用方法如下。
```bash
python infer.py --use-gpu --preset=${preset_json_path} \
python synthesis.py --use-gpu --preset=${preset_json_path} \
--hparams="parameters you may want to override" \
${checkpoint} ${text_list_file} ${dst_dir}}
${checkpoint} ${text_list_file} ${dst_dir}
```
示例文本文件如下:
......@@ -199,7 +200,7 @@ A text-to-speech synthesis system typically consists of multiple stages, such as
根据 [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654), 对于不同的数据集,会有不同的 position rate. 有两个不同的 position rate,一个用于 query 一个用于 key, 这在论文中称为 $\omega_1$ 和 $\omega_2$ ,在预设配置文件中的名字分别为 `query_position_rate``key_position_rate`
比如 LJSpeech 数据集的 `query_position_rate``key_position_rate` 分别为 `1.0``1.385`这些值可以 `compute_timestamp_ratio.py`。使用如下命令计算
比如 LJSpeech 数据集的 `query_position_rate``key_position_rate` 分别为 `1.0``1.385`固定 `query_position_rate` 为 1.0,`key_position_rate` 可以使用 `compute_timestamp_ratio.py` 计算,命令如下,其中 `${data_root}` 是预处理后的数据集路径
```bash
python compute_timestamp_ratio.py --preset=${preset_json_path} \
......
# Part of code was adpated from https://github.com/r9y9/deepvoice3_pytorch/tree/master/compute_timestamp_ratio.py
# Copyright (c) 2017: Ryuichi Yamamoto.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import sys
import io
import numpy as np
from hparams import hparams, hparams_debug_string
from deepvoice3_paddle.data import TextDataSource, MelSpecDataSource
......@@ -34,7 +39,7 @@ if __name__ == "__main__":
# Load preset if specified
if preset is not None:
with open(preset) as f:
with io.open(preset) as f:
hparams.parse_json(f.read())
# Override hyper parameters
hparams.parse(args.hparams)
......
# Parameter conversion
## generate name map
To convert a model trained with `https://github.com/r9y9/deepvoice3_pytorch`, we provide a script to generate name map between pytorch model and paddle model for `deepvoice3`. You can provide `--preset` and `--hparams` to specify the model's configuration.
```bash
python generate_name_map.py --preset=${preset_to_use} --haprams="hyper parameters to overwrite"
```
It would print a name map. The format of the name map file looks like this. Each line consists of 3 fields, the first is the name of a parameter in the saved state dict of pytorch model, the second and third is the name and shape of the corresponding parameter in the saved state dict of paddle.
```
seq2seq.encoder.embed_tokens.weight encoder/Encoder_0/Embedding_0.w_0 [149, 256]
seq2seq.encoder.convolutions.0.bias encoder/Encoder_0/ConvProj1D_1/Conv2D_0.b_0 [512]
seq2seq.encoder.convolutions.0.weight_g encoder/Encoder_0/ConvProj1D_1/Conv2D_0.w_1 [512]
```
Redirect the output to a file to save it.
```bash
python generate_name_map.py --preset=${preset_to_use} --haprams="hyper parameters to overwrite" > name_map.txt
```
## convert saved pytorch model to paddle model
Given a name map and a saved pytorch model, you can convert it to a paddle model.
```bash
python convert.py \
--pytorch-model ${pytorch_model.pth} \
--paddle-model ${path_to_save_paddle_model} \
--name-map ${name_map_path}
```
Note that the user should provide the name map file, and ensure the models are equivalent to each other and corresponding parameters have the right shapes.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import torch
import paddle
from paddle import fluid
import argparse
parser = argparse.ArgumentParser()
parser.add_argument(
"--pytorch-model",
dest='pytorch_model',
type=str,
help="The source pytorch mode.")
parser.add_argument(
"--paddle-model",
dest='paddle_model',
type=str,
help="The directory to save paddle model, now saves model as a folder.")
parser.add_argument(
"--name-map",
dest="name_map",
type=str,
help="name mapping for the source model and the target model.")
def read_name_map(fname):
"""
There should be a 3-column file.
The first comuln is the name of parameter in pytorch model's state dict;
The second column is the name of parameter in paddle model's state dict;
The third column is the shape of the repective parameter in paddle model.
"""
name_map = {}
with open(fname, 'rt') as f:
for line in f:
src_key, tgt_key, tgt_shape = line.strip().split('\t')
tgt_shape = eval(tgt_shape)
name_map[src_key] = (tgt_key, tgt_shape)
return name_map
def torch2paddle(state_dict, name_map, dirname):
"""
state_dict: pytorch model's state dict.
name_map: a text file for name mapping from pytorch model to paddle model.
dirname: path of the paddle model to save.
"""
program = fluid.Program()
global_block = program.global_block()
for k in state_dict.keys():
global_block.create_parameter(
name=name_map[k][0],
shape=[1],
dtype='float32',
initializer=fluid.initializer.Constant(value=0.0))
place = fluid.core.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
exe.run(program)
# NOTE: transpose the pytorch model's parameter if neccessary
# we do not transpose here because we used conv instead of FC layer to replace Linear in pytorch,
# which does not need us to transpose the paramerters.
# but when you use a FC layer corresponding a torch Linear module, be sure to transpose the weight.
# Other transformations are not concerned, but users should check the data shape to ensure that
# the transformations are what's expected.
for k, v in state_dict.items():
fluid.global_scope().find_var(name_map[k][0]).get_tensor().set(
v.cpu().numpy().reshape(name_map[k][1]), place)
fluid.io.save_params(exe, dirname, main_program=program)
if __name__ == "__main__":
args, _ = parser.parse_known_args()
result = torch.load(args.pytorch_model)
state_dict = result["state_dict"]
name_map = read_name_map(args.name_map)
torch2paddle(state_dict, name_map, args.paddle_model)
......@@ -15,6 +15,7 @@
import numpy as np
import random
import io
import platform
from os.path import dirname, join
......@@ -52,15 +53,14 @@ class TextDataSource(FileDataSource):
def collect_files(self):
meta = join(self.data_root, "train.txt")
with open(meta, "rb") as f:
with io.open(meta, "rt", encoding="utf-8") as f:
lines = f.readlines()
l = lines[0].decode("utf-8").split("|")
l = lines[0].split("|")
assert len(l) == 4 or len(l) == 5
self.multi_speaker = len(l) == 5
texts = list(map(lambda l: l.decode("utf-8").split("|")[3], lines))
texts = list(map(lambda l: l.split("|")[3], lines))
if self.multi_speaker:
speaker_ids = list(
map(lambda l: int(l.decode("utf-8").split("|")[-1]), lines))
speaker_ids = list(map(lambda l: int(l.split("|")[-1]), lines))
# Filter by speaker_id
# using multi-speaker dataset as a single speaker dataset
if self.speaker_id is not None:
......@@ -106,21 +106,18 @@ class _NPYDataSource(FileDataSource):
def collect_files(self):
meta = join(self.data_root, "train.txt")
with open(meta, "rb") as f:
with io.open(meta, "rt", encoding="utf-8") as f:
lines = f.readlines()
l = lines[0].decode("utf-8").split("|")
l = lines[0].split("|")
assert len(l) == 4 or len(l) == 5
multi_speaker = len(l) == 5
self.frame_lengths = list(
map(lambda l: int(l.decode("utf-8").split("|")[2]), lines))
self.frame_lengths = list(map(lambda l: int(l.split("|")[2]), lines))
paths = list(
map(lambda l: l.decode("utf-8").split("|")[self.col], lines))
paths = list(map(lambda l: l.split("|")[self.col], lines))
paths = list(map(lambda f: join(self.data_root, f), paths))
if multi_speaker and self.speaker_id is not None:
speaker_ids = list(
map(lambda l: int(l.decode("utf-8").split("|")[-1]), lines))
speaker_ids = list(map(lambda l: int(l.split("|")[-1]), lines))
# Filter by speaker_id
# using multi-speaker dataset as a single speaker dataset
indices = np.array(speaker_ids) == self.speaker_id
......@@ -297,7 +294,7 @@ def create_batch(batch):
# text positions
text_positions = np.array(
[_pad(np.arange(1, len(x[0]) + 1), max_input_len) for x in batch],
dtype=np.int)
dtype=np.int64)
text_positions = np.expand_dims(text_positions, axis=-1)
max_decoder_target_len = max_target_len // r // downsample_step
......@@ -306,7 +303,8 @@ def create_batch(batch):
s, e = 1, max_decoder_target_len + 1
frame_positions = np.tile(
np.expand_dims(
np.arange(s, e), axis=0), (len(batch), 1))
np.arange(
s, e, dtype=np.int64), axis=0), (len(batch), 1))
frame_positions = np.expand_dims(frame_positions, axis=-1)
# done flags
......
......@@ -219,7 +219,6 @@ class Encoder(dg.Layer):
values (Variable), Shape(B, C_embed, 1, T_enc), the encoded
representation for values.
"""
x = self.embed(x)
x = fluid.layers.dropout(
......
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
from os.path import dirname, join
import paddle
from paddle import fluid
import paddle.fluid.dygraph as dg
def _load(checkpoint_path):
"""
Load saved state dict and optimizer state(optional).
"""
state_dict, optimizer_state = dg.load_persistables(dirname=checkpoint_path)
return state_dict, optimizer_state
def load_checkpoint(path, model, optimizer=None, reset_optimizer=True):
"""
layers like FC, Conv*, ... the Layer does not initialize their parameters
before first run.
1. if you want to load only a part of a saved whole model, to part of an
existing model, just pass the part as the target model , and path of the
saved whole model as source path.
2. if you want to load exactly from what is saved, just passed the model
and path as expected.
The rule of thumb is:
1. loading to a model works with name, a unique global name.
2. loading from a directory works with file structure, each parameter is
saved in a file. Loading a file from directory A/ would `create` a
corresponding Variable for each saved parameter, whose name is the file's
relative path from directory A/.
"""
print("Load checkpoint from: {}".format(path))
state_dict, optimizer_state = _load(path)
model.load_dict(state_dict)
if not reset_optimizer and optimizer is not None:
if optimizer_state is not None:
print("[loading] Load optimizer state from {}".format(path))
optimizer.load(optimizer_state)
return model
def _load_embedding(path, model):
print("[loading] Loading embedding from {}".format(path))
state_dict, optimizer_state = _load(path)
key = os.path.join(model.full_name(), "ConvS2S_0/Encoder_0/Embedding_0.w_0")
tensor = model.state_dict()[key]._ivar.value().get_tensor()
tensor.set(state_dict[key], fluid.framework._current_expected_place())
def save_checkpoint(model, optimizer, checkpoint_dir, global_step):
checkpoint_path = join(checkpoint_dir,
"checkpoint_step{:09d}.model".format(global_step))
dg.save_persistables(
model.state_dict(), dirname=checkpoint_path, optimizers=optimizer)
print("[checkpoint] Saved checkpoint:", checkpoint_path)
......@@ -23,7 +23,7 @@ from warnings import warn
from datetime import datetime
import matplotlib
# Force matplotlib to not use any Xwindows backend.
# Force matplotlib not to use any Xwindows backend.
matplotlib.use("Agg")
from matplotlib import pyplot as plt
from matplotlib import cm
......@@ -109,7 +109,7 @@ def prepare_spec_image(spectrogram):
where T means the time steps of the spectrogram. It is treated
as the width of the image. And C means the channels of the
spectrogram, which is treated as the height of the image. And 4
means it is a 'ARGP' format.
means it is a 'ARGB' format.
"""
# [0, 1]
......
# coding: utf-8
# Part of code was adpated from https://github.com/r9y9/deepvoice3_pytorch/tree/master/preprocess.py
# Copyright (c) 2017: Ryuichi Yamamoto.
......@@ -78,7 +77,7 @@ if __name__ == "__main__":
# Load preset if specified
if preset is not None:
with open(preset) as f:
with io.open(preset) as f:
hparams.parse_json(f.read())
# Override hyper parameters
hparams.parse(args.hparams)
......
......@@ -9,6 +9,6 @@ tqdm==4.35.0
tensorboardX==1.8
matplotlib
requests==2.22.0
lws
lws==1.2.4
nnmnkwii
tensorboard
......@@ -19,6 +19,7 @@ from __future__ import print_function
import argparse
import sys
import os
import io
from os.path import dirname, join, basename, splitext, exists
from tqdm import tqdm
import numpy as np
......@@ -34,7 +35,6 @@ from deepvoice3_paddle.dry_run import dry_run
from hparams import hparams
from train import make_deepvoice3_from_hparams
from eval_model import tts, plot_alignment
from deepvoice3_paddle.save_load import load_checkpoint
def build_parser():
......@@ -51,14 +51,6 @@ def build_parser():
"--use-gpu",
action="store_true",
help="Whether to use gpu for generation.")
parser.add_argument(
"--checkpoint-seq2seq",
type=str,
help="Load seq2seq model from checkpoint path.")
parser.add_argument(
"--checkpoint-postnet",
type=str,
help="Load postnet model from checkpoint path.")
parser.add_argument(
"--file-name-suffix", type=str, default="", help="File name suffix.")
parser.add_argument(
......@@ -91,8 +83,6 @@ if __name__ == "__main__":
text_list_file_path = args.text_list_file
dst_dir = args.dst_dir
use_gpu = args.use_gpu
checkpoint_seq2seq_path = args.checkpoint_seq2seq
checkpoint_postnet_path = args.checkpoint_postnet
max_decoder_steps = args.max_decoder_steps
file_name_suffix = args.file_name_suffix
......@@ -107,7 +97,7 @@ if __name__ == "__main__":
# Load preset if specified
if preset is not None:
with open(preset) as f:
with io.open(preset) as f:
hparams.parse_json(f.read())
# Override hyper parameters
hparams.parse(args.hparams)
......@@ -118,18 +108,19 @@ if __name__ == "__main__":
# Model
model = make_deepvoice3_from_hparams(hparams)
dry_run(model)
load_checkpoint(checkpoint_path, model)
model_dict, _ = dg.load_dygraph(args.checkpoint)
model.set_dict(model_dict)
checkpoint_name = splitext(basename(checkpoint_path))[0]
model.seq2seq.decoder.max_decoder_steps = max_decoder_steps
if not os.path.exists(dst_dir):
os.mkdir(dst_dir)
with open(text_list_file_path, "rb") as f:
os.makedirs(dst_dir)
with io.open(text_list_file_path, "rt", encoding="utf-8") as f:
lines = f.readlines()
for idx, line in enumerate(lines):
text = line.decode("utf-8")[:-1]
text = line[:-1]
words = nltk.word_tokenize(text)
waveform, alignment, _, _ = tts(model,
text,
......
......@@ -16,6 +16,9 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import io
from paddle import fluid
import paddle.fluid.dygraph as dg
......@@ -27,18 +30,11 @@ from deepvoice3_paddle.data import (TextDataSource, MelSpecDataSource,
LinearSpecDataSource,
PartialyRandomizedSimilarTimeLengthSampler,
Dataset, make_loader, create_batch)
from deepvoice3_paddle import frontend
from deepvoice3_paddle.builder import deepvoice3, WindowRange
from deepvoice3_paddle.dry_run import dry_run
from deepvoice3_paddle.save_load import load_checkpoint, _load_embedding, save_checkpoint
from train_model import train_model
from deepvoice3_paddle.loss import TTSLoss
import platform
from datetime import datetime
from tensorboardX import SummaryWriter
......@@ -57,10 +53,10 @@ def build_arg_parser():
parser.add_argument(
"--use-gpu", action="store_true", help="Whether to use gpu training.")
parser.add_argument(
"--checkpoint-dir",
"--output",
type=str,
default="checkpoints",
help="Directory where to save model checkpoints")
default="result",
help="Directory to save results")
parser.add_argument(
"--preset",
type=str,
......@@ -87,9 +83,6 @@ def build_arg_parser():
"--train-postnet-only",
action="store_true",
help="Train only postnet model.")
parser.add_argument("--log-event-path", type=str, help="Log event path.")
parser.add_argument(
"--load-embedding", type=str, help="Load embedding from checkpoint.")
parser.add_argument(
"--speaker-id",
type=int,
......@@ -152,21 +145,6 @@ def make_optimizer_from_hparams(hparams):
return optim, clipper
def make_writer_from_args(args):
if args.log_event_path is None:
if platform.system() == "Windows":
log_event_path = "log/run-test" + str(datetime.now()).replace(
" ", "_").replace(":", "_")
else:
log_event_path = "log/run-test" + str(datetime.now()).replace(" ",
"_")
else:
log_event_path = args.log_event_path
print("Log event path: {}".format(log_event_path))
writer = SummaryWriter(log_event_path)
return writer
def make_loss_from_hparams(hparams):
criterion = TTSLoss(
hparams.masked_loss_weight, hparams.priority_freq_weight,
......@@ -202,12 +180,19 @@ if __name__ == "__main__":
# Load preset if specified
if args.preset is not None:
with open(args.preset) as f:
with io.open(args.preset) as f:
hparams.parse_json(f.read())
# Override hyper parameters
hparams.parse(args.hparams)
print(hparams_debug_string())
checkpoint_dir = os.path.join(args.output, "checkpoints")
tensorboard_dir = os.path.join(args.output, "log")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
if not os.path.exists(tensorboard_dir):
os.makedirs(tensorboard_dir)
data_root = args.data_root
speaker_id = args.speaker_id
X = FileSourceDataset(TextDataSource(data_root, speaker_id))
......@@ -239,7 +224,8 @@ if __name__ == "__main__":
model = make_deepvoice3_from_hparams(hparams)
optimizer, clipper = make_optimizer_from_hparams(hparams)
writer = make_writer_from_args(args)
print("Log event path: {}".format(tensorboard_dir))
writer = SummaryWriter(tensorboard_dir) if local_rank == 0 else None
criterion = make_loss_from_hparams(hparams)
# loading saved model
......@@ -250,17 +236,8 @@ if __name__ == "__main__":
assert hparams.use_decoder_state_for_postnet_input is False, \
"when training only the postnet, there is no decoder states"
dry_run(model)
if args.checkpoint is not None:
load_checkpoint(
args.checkpoint,
model,
optimizer,
reset_optimizer=args.reset_optimizer)
if args.load_embedding is not None:
_load_embedding(args.load_embedding, model)
model_dict, optimizer_dict = dg.load_dygraph(args.checkpoint)
if args.use_data_parallel:
strategy = dg.parallel.prepare_context()
......
......@@ -23,20 +23,16 @@ from paddle import fluid
import paddle.fluid.dygraph as dg
from tqdm import tqdm
from deepvoice3_paddle.save_load import save_checkpoint
from eval_model import eval_model, save_states
def train_model(model, loader, criterion, optimizer, clipper, writer, args,
hparams):
assert fluid.framework.in_dygraph_mode(
), "this function must be run with dygraph guard"
), "this function must be run within dygraph guard"
local_rank = dg.parallel.Env().local_rank
if not os.path.exists(args.checkpoint_dir):
os.mkdir(args.checkpoint_dir)
# amount of shifting when compute losses
linear_shift = hparams.outputs_per_step
mel_shift = hparams.outputs_per_step
......@@ -44,6 +40,8 @@ def train_model(model, loader, criterion, optimizer, clipper, writer, args,
global_step = 0
global_epoch = 0
ismultispeaker = model.n_speakers > 1
checkpoint_dir = os.path.join(args.output, "checkpoints")
tensorboard_dir = os.path.join(args.output, "log")
for epoch in range(hparams.nepochs):
epoch_loss = 0.
......@@ -113,7 +111,7 @@ def train_model(model, loader, criterion, optimizer, clipper, writer, args,
linear_predicted, linear_target, linear_mask)
lin_loss = criterion.binary_divergence_weight * lin_div \
+ (1 - criterion.binary_divergence_weight) * lin_l1_loss
if writer is not None:
if writer is not None and local_rank == 0:
writer.add_scalar("linear_loss",
float(lin_loss.numpy()), global_step)
writer.add_scalar("linear_l1_loss",
......@@ -134,7 +132,7 @@ def train_model(model, loader, criterion, optimizer, clipper, writer, args,
mel_mask)
mel_loss = criterion.binary_divergence_weight * mel_div \
+ (1 - criterion.binary_divergence_weight) * mel_l1_loss
if writer is not None:
if writer is not None and local_rank == 0:
writer.add_scalar("mel_loss",
float(mel_loss.numpy()), global_step)
writer.add_scalar("mel_l1_loss",
......@@ -143,7 +141,7 @@ def train_model(model, loader, criterion, optimizer, clipper, writer, args,
float(mel_div.numpy()), global_step)
done_loss = criterion.done_loss(done_hat, done)
if writer is not None:
if writer is not None and local_rank == 0:
writer.add_scalar("done_loss",
float(done_loss.numpy()), global_step)
......@@ -153,7 +151,7 @@ def train_model(model, loader, criterion, optimizer, clipper, writer, args,
attn_loss = criterion.attention_loss(alignments,
input_lengths.numpy(),
decoder_length)
if writer is not None:
if writer is not None and local_rank == 0:
writer.add_scalar("attention_loss",
float(attn_loss.numpy()), global_step)
......@@ -169,7 +167,8 @@ def train_model(model, loader, criterion, optimizer, clipper, writer, args,
loss = mel_loss + done_loss
else:
loss = lin_loss
if writer is not None:
if writer is not None and local_rank == 0:
writer.add_scalar("loss", float(loss.numpy()), global_step)
if isinstance(optimizer._learning_rate,
......@@ -177,7 +176,8 @@ def train_model(model, loader, criterion, optimizer, clipper, writer, args,
current_lr = optimizer._learning_rate.step().numpy()
else:
current_lr = optimizer._learning_rate
writer.add_scalar("learning_rate", current_lr, global_step)
if writer is not None and local_rank == 0:
writer.add_scalar("learning_rate", current_lr, global_step)
epoch_loss += loss.numpy()[0]
......@@ -185,13 +185,15 @@ def train_model(model, loader, criterion, optimizer, clipper, writer, args,
global_step % hparams.checkpoint_interval == 0):
save_states(global_step, writer, mel_outputs, linear_outputs,
alignments, mel, linear,
input_lengths.numpy(), args.checkpoint_dir)
save_checkpoint(model, optimizer, args.checkpoint_dir,
global_step)
input_lengths.numpy(), checkpoint_dir)
step_path = os.path.join(
checkpoint_dir, "checkpoint_{:09d}".format(global_step))
dg.save_dygraph(model.state_dict(), step_path)
dg.save_dygraph(optimizer.state_dict(), step_path)
if (local_rank == 0 and global_step > 0 and
global_step % hparams.eval_interval == 0):
eval_model(global_step, writer, model, args.checkpoint_dir,
eval_model(global_step, writer, model, checkpoint_dir,
ismultispeaker)
if args.use_data_parallel:
......@@ -232,5 +234,9 @@ def train_model(model, loader, criterion, optimizer, clipper, writer, args,
global_step += 1
average_loss_in_epoch = epoch_loss / (step + 1)
print("Epoch loss: {}".format(average_loss_in_epoch))
if writer is not None and local_rank == 0:
writer.add_scalar("average_loss_in_epoch", average_loss_in_epoch,
global_epoch)
global_epoch += 1
print("Epoch loss: {}".format(epoch_loss / (step + 1)))
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册