未验证 提交 cae023eb 编写于 作者: F Feiyu Chan 提交者: GitHub

update details about checkpoint in README (#3673)

* ini commit for deepvoice, add tensorboard to requirements

* fix urls for code we adapted from

* fix makedirs for python2, fix README

* fix open with encoding for python2 compatability

* fix python2's str(), use encode for unicode, and str() for int

* fix python2 encoding issue, add model architecture and project structure for README

* add model structure, add explanation for hyperparameter priority order.

* fix repitiio n in README_cn, and reorder that in README

* README update; fix integer division issues for python2 compatability

* fix data type for input data, specifier the int type as np.int64 to be platform agnostic

* fix README for preprocess.py, use io.open instead of open for python2 compatability.

* update commanline options, use new save/load API

* fix IO conflict bug for data parallel training

* only construct summary writer in process 0 to further avoid conflict

* fix typos and update details for checkpoints
上级 044f19e7
...@@ -88,12 +88,12 @@ python preprocess.py \ ...@@ -88,12 +88,12 @@ python preprocess.py \
Now `${name}$` only supports `ljspeech`. Support for other datasets is pending. Now `${name}$` only supports `ljspeech`. Support for other datasets is pending.
Assuming that you use `presers/deepvoice3_ljspeech.json` for LJSpeech and the path of the unziped dataset is `~/data/LJSpeech-1.1`, then you can preprocess data with the following command. Assuming that you use `presers/deepvoice3_ljspeech.json` for LJSpeech and the path of the unziped dataset is `./data/LJSpeech-1.1`, then you can preprocess data with the following command.
```bash ```bash
python preprocess.py \ python preprocess.py \
--preset=presets/deepvoice3_ljspeech.json \ --preset=presets/deepvoice3_ljspeech.json \
ljspeech ~/data/LJSpeech-1.1/ ./data/ljspeech ljspeech ./data/LJSpeech-1.1/ ./data/ljspeech
``` ```
When this is done, you will see extracted features in `./data/ljspeech` including: When this is done, you will see extracted features in `./data/ljspeech` including:
...@@ -123,7 +123,7 @@ You can load saved checkpoint and resume training with `--checkpoint`, if you wa ...@@ -123,7 +123,7 @@ You can load saved checkpoint and resume training with `--checkpoint`, if you wa
You can also train parts of the model while freezing other parts, by passing `--train-seq2seq-only` or `--train-postnet-only`. When training only parts of the model, other parts should be loaded from saved checkpoint. You can also train parts of the model while freezing other parts, by passing `--train-seq2seq-only` or `--train-postnet-only`. When training only parts of the model, other parts should be loaded from saved checkpoint.
To train only the `seq2seq` or `postnet`, you should load from a whole model with `--checkpoint`and keep the same configurations. Note that when training only the `postnet`, you should set `use_decoder_state_for_postnet_input=false`, because when train only the postnet, the postnet takes the ground truth mel-spectrogram as input. Note that the default value for `use_decoder_state_for_postnet_input` is `True`. To train only the `seq2seq` or `postnet`, you should load from a whole model with `--checkpoint` and keep the same configurations with which the checkpoint is trained. Note that when training only the `postnet`, you should set `use_decoder_state_for_postnet_input=false`, because when train only the postnet, the postnet takes the ground truth mel-spectrogram as input. Note that the default value for `use_decoder_state_for_postnet_input` is `True`.
example: example:
...@@ -148,7 +148,7 @@ python -m paddle.distributed.launch \ ...@@ -148,7 +148,7 @@ python -m paddle.distributed.launch \
training_script ... training_script ...
``` ```
`paddle.distributed.launch` parallelizes training in multiprocessing mode.`--selected_gpus` means the logical ids of the selected GPUs, and `started_port` means the port used by the first worker. Outputs of each worker are saved in `--log_dir.` Then follows the command for training on a single GPU, except that you should pass `--use-data-paralle` in addition. `paddle.distributed.launch` parallelizes training in multiprocessing mode.`--selected_gpus` means the logical ids of the selected GPUs, and `started_port` means the port used by the first worker. Outputs of each process are saved in `--log_dir.` Then follows the command for training on a single GPU, except that you should pass `--use-data-paralle` in addition.
```bash ```bash
export CUDA_VISIBLE_DEVICES=2,3,4,5 # The IDs of visible physical devices export CUDA_VISIBLE_DEVICES=2,3,4,5 # The IDs of visible physical devices
...@@ -160,11 +160,11 @@ python -m paddle.distributed.launch \ ...@@ -160,11 +160,11 @@ python -m paddle.distributed.launch \
--hparams="parameters you may want to override" --hparams="parameters you may want to override"
``` ```
In the example above, we set only GPU `2, 3, 4, 5` to be visible. Then `--selected_gpus="0, 1, 2, 3"` means the logical ids of the selected gpus, which correpond to GPU `2, 3, 4, 5`. In the example above, we set only GPU `2, 3, 4, 5` to be visible. Then `--selected_gpus="0, 1, 2, 3"` means the logical ids of the selected gpus, which correponds to GPU `2, 3, 4, 5`.
Model checkpoints (`*.pdparams` for the model and `*.pdoptim` for the optimizer) are saved in `${directory_to_save_results}/checkpoints` per 10000 steps by default. Layer-wise averaged attention alignments (.png) are saved in `${directory_to_save_results}/checkpoints/alignment_ave`. And alignments for each attention layer are saved in `${directory_to_save_results}/checkpoints/alignment_layer{attention_layer_num}` per 10000 steps for inspection. Model checkpoints (`*.pdparams` for the model and `*.pdopt` for the optimizer) are saved in `${directory_to_save_results}/checkpoints` per 10000 steps by default. Layer-wise averaged attention alignments (.png) are saved in `${directory_to_save_results}/checkpoints/alignment_ave`. And alignments for each attention layer are saved in `${directory_to_save_results}/checkpoints/alignment_layer{attention_layer_num}` per 10000 steps for inspection.
Synthesis results of 6 sentences (hardcoded in `eval_model.py`) are saved in `checkpoints/eval`, including `step{step_num}_text{text_id}_single_alignment.png` for averaged alignments and `step{step_num}_text{text_id}_single_predicted.wav` for the predicted waveforms. Synthesis results of 6 sentences (hardcoded in `eval_model.py`) are saved in `${directory_to_save_results}/checkpoints/eval`, including `step{step_num}_text{text_id}_single_alignment.png` for averaged alignments and `step{step_num}_text{text_id}_single_predicted.wav` for the predicted waveforms.
### Monitor with Tensorboard ### Monitor with Tensorboard
...@@ -199,7 +199,7 @@ generated waveform files and alignment files are saved in `${dst_dir}`. ...@@ -199,7 +199,7 @@ generated waveform files and alignment files are saved in `${dst_dir}`.
According to [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654), the position rate is different for different datasets. There are 2 position rates, one for the query and the other for the key, which are referred to as $\omega_1$ and $\omega_2$ in th paper, and the corresponding names in preset json are `query_position_rate` and `key_position_rate`. According to [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654), the position rate is different for different datasets. There are 2 position rates, one for the query and the other for the key, which are referred to as $\omega_1$ and $\omega_2$ in th paper, and the corresponding names in preset json are `query_position_rate` and `key_position_rate`.
For example, the `query_position_rate` and `key_position_rate` for LJSpeech are `1.0` and `1.385`, respectively. These values are computed with `compute_timestamp_ratio.py`. Run the command below, where `${data_root}` means the path of the preprocessed dataset. For example, the `query_position_rate` and `key_position_rate` for LJSpeech are `1.0` and `1.385`, respectively. Fix the `query_position_rate` as 1.0, the `key_position_rate` is computed with `compute_timestamp_ratio.py`. Run the command below, where `${data_root}` means the path of the preprocessed dataset.
```bash ```bash
python compute_timestamp_ratio.py --preset=${preset_json_path} \ python compute_timestamp_ratio.py --preset=${preset_json_path} \
...@@ -217,4 +217,4 @@ Then set the `key_position_rate=1.385` and `query_position_rate=1.0` in the pres ...@@ -217,4 +217,4 @@ Then set the `key_position_rate=1.385` and `query_position_rate=1.0` in the pres
## Acknowledgement ## Acknowledgement
We thankfully included and adapted some files r9y9's from [deepvoice3_pytorch](https://github.com/r9y9/deepvoice3_pytorch). We thankfully included and adapted some files from r9y9's [deepvoice3_pytorch](https://github.com/r9y9/deepvoice3_pytorch).
...@@ -64,7 +64,7 @@ nltk.download("cmudict") ...@@ -64,7 +64,7 @@ nltk.download("cmudict")
可以通过 `--hparams` 参数来覆盖预设的超参数配置,参数格式是逗号分隔的键值对 `${key}=${value}`,例如 `--hparams="batch_size=8, nepochs=500"` 可以通过 `--hparams` 参数来覆盖预设的超参数配置,参数格式是逗号分隔的键值对 `${key}=${value}`,例如 `--hparams="batch_size=8, nepochs=500"`
部分参数可以只和训练有关,如 `batch_size`, `checkpoint_interval`, 用户在训练时可以使用不同的值。但部分参数和数据预处理相关,如 `num_mels``ref_level_db`, 这些参数在数据预处理和训练时候应该保持一致。 部分参数只和训练有关,如 `batch_size`, `checkpoint_interval`, 用户在训练时可以使用不同的值。但部分参数和数据预处理相关,如 `num_mels``ref_level_db`, 这些参数在数据预处理和训练时候应该保持一致。
关于超参数设置更多细节可以参考 `hparams.py` ,其中定义了 hparams。超参数的优先级序列是:通过命令行参数 `--hparams` 传入的参数优先级高于通过 `--preset` 参数传入的 json 配置文件,高于 `hparams.py` 中的定义。 关于超参数设置更多细节可以参考 `hparams.py` ,其中定义了 hparams。超参数的优先级序列是:通过命令行参数 `--hparams` 传入的参数优先级高于通过 `--preset` 参数传入的 json 配置文件,高于 `hparams.py` 中的定义。
...@@ -88,12 +88,12 @@ python preprocess.py \ ...@@ -88,12 +88,12 @@ python preprocess.py \
目前 `${name}$` 只支持 `ljspeech`。未来将会支持更多数据集。 目前 `${name}$` 只支持 `ljspeech`。未来将会支持更多数据集。
假设你使用 `presers/deepvoice3_ljspeech.json` 作为处理 LJSpeech 的预设配置文件,并且解压后的数据集位于 `~/data/LJSpeech-1.1`, 那么使用如下的命令进行数据预处理。 假设你使用 `presers/deepvoice3_ljspeech.json` 作为处理 LJSpeech 的预设配置文件,并且解压后的数据集位于 `./data/LJSpeech-1.1`, 那么使用如下的命令进行数据预处理。
```bash ```bash
python preprocess.py \ python preprocess.py \
--preset=presets/deepvoice3_ljspeech.json \ --preset=presets/deepvoice3_ljspeech.json \
ljspeech ~/data/LJSpeech-1.1/ ./data/ljspeech ljspeech ./data/LJSpeech-1.1/ ./data/ljspeech
``` ```
数据处理完成后,你会在 `./data/ljspeech` 看到提取的特征,包含如下文件。 数据处理完成后,你会在 `./data/ljspeech` 看到提取的特征,包含如下文件。
...@@ -138,7 +138,7 @@ python train.py --data-root=${data-root} --use-gpu \ ...@@ -138,7 +138,7 @@ python train.py --data-root=${data-root} --use-gpu \
### 使用 GPU 多卡训练 ### 使用 GPU 多卡训练
本模型支持使用多个 GPU 通过数据并行的方式 训练。方法是使用 `paddle.distributed.launch` 模块来启动 `train.py` 本模型支持使用多个 GPU 通过数据并行的方式训练。方法是使用 `paddle.distributed.launch` 模块来启动 `train.py`
```bash ```bash
python -m paddle.distributed.launch \ python -m paddle.distributed.launch \
...@@ -163,9 +163,9 @@ python -m paddle.distributed.launch \ ...@@ -163,9 +163,9 @@ python -m paddle.distributed.launch \
上述的示例中,设置了 `2, 3, 4, 5` 号显卡为可见的 GPU。然后 `--selected_gpus=0,1,2,3` 选择的是 GPU 的逻辑序号,分别对应于 `2, 3, 4, 5` 号卡。 上述的示例中,设置了 `2, 3, 4, 5` 号显卡为可见的 GPU。然后 `--selected_gpus=0,1,2,3` 选择的是 GPU 的逻辑序号,分别对应于 `2, 3, 4, 5` 号卡。
模型默认被保存为后缀为 `.model`的文件夹,保存在 `${directory_to_save_results}/checkpoints` 文件夹中。多层平均的注意力机制对齐结果被保存为 `.png` 图片,默认保存在 `${directory_to_save_results}/checkpoints/alignment_ave` 中。每一层的注意力机制对齐结果默认被保存在 `${directory_to_save_results}/checkpoints/alignment_layer{attention_layer_num}`文件夹中。默认每 10000 步保存一次用于查看。 模型 (模型参数保存为`*.pdparams` 文件,优化器被保存为 `*.pdopt` 文件)保存在 `${directory_to_save_results}/checkpoints` 文件夹中。多层平均的注意力机制对齐结果被保存为 `.png` 图片,默认保存在 `${directory_to_save_results}/checkpoints/alignment_ave` 中。每一层的注意力机制对齐结果默认被保存在 `${directory_to_save_results}/checkpoints/alignment_layer{attention_layer_num}`文件夹中。默认每 10000 步保存一次用于查看。
对 6 个给定的句子的语音合成结果保存在 `checkpoints/eval` 中,包含多层平均平均的注意力机制对齐结果,这被保存为名为 `step{step_num}_text{text_id}_single_alignment.png` 的图片;以及合成的音频文件,保存为名为 `step{step_num}_text{text_id}_single_predicted.wav` 的音频。 对 6 个给定的句子的语音合成结果保存在 `${directory_to_save_results}/checkpoints/eval` 中,包含多层平均平均的注意力机制对齐结果,这被保存为名为 `step{step_num}_text{text_id}_single_alignment.png` 的图片;以及合成的音频文件,保存为名为 `step{step_num}_text{text_id}_single_predicted.wav` 的音频。
### 使用 Tensorboard 查看训练 ### 使用 Tensorboard 查看训练
...@@ -200,7 +200,7 @@ A text-to-speech synthesis system typically consists of multiple stages, such as ...@@ -200,7 +200,7 @@ A text-to-speech synthesis system typically consists of multiple stages, such as
根据 [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654), 对于不同的数据集,会有不同的 position rate. 有两个不同的 position rate,一个用于 query 一个用于 key, 这在论文中称为 $\omega_1$ 和 $\omega_2$ ,在预设配置文件中的名字分别为 `query_position_rate``key_position_rate` 根据 [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654), 对于不同的数据集,会有不同的 position rate. 有两个不同的 position rate,一个用于 query 一个用于 key, 这在论文中称为 $\omega_1$ 和 $\omega_2$ ,在预设配置文件中的名字分别为 `query_position_rate``key_position_rate`
比如 LJSpeech 数据集的 `query_position_rate``key_position_rate` 分别为 `1.0``1.385`这些值可以使用 `compute_timestamp_ratio.py` 计算,命令如下,其中 `${data_root}` 是预处理后的数据集路径。 比如 LJSpeech 数据集的 `query_position_rate``key_position_rate` 分别为 `1.0``1.385`固定 `query_position_rate` 为 1.0,`key_position_rate` 可以使用 `compute_timestamp_ratio.py` 计算,命令如下,其中 `${data_root}` 是预处理后的数据集路径。
```bash ```bash
python compute_timestamp_ratio.py --preset=${preset_json_path} \ python compute_timestamp_ratio.py --preset=${preset_json_path} \
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册