提交 69b2a2b5 编写于 作者: L lifuchen

modified README of transformer_tts and fastspeech, remove dygraph.guard()

上级 d1ba42ea
# Fastspeech
PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263).
## Dataset
......@@ -20,6 +21,7 @@ mel-spectrogram sequence for parallel mel-spectrogram generation. We use the Tra
The model consists of encoder, decoder and length regulator three parts.
## Project Structure
```text
├── config # yaml configuration files
├── synthesis.py # script to synthesize waveform from text
......@@ -27,21 +29,26 @@ The model consists of encoder, decoder and length regulator three parts.
```
## Saving & Loading
`train.py` have 3 arguments in common, `--checkpooint`, `iteration` and `output`.
1. `output` is the directory for saving results.
During training, checkpoints are saved in `checkpoints/` in `output` and tensorboard log is save in `log/` in `output`.
During synthesis, results are saved in `samples/` in `output` and tensorboard log is save in `log/` in `output`.
`train_transformer.py` and `train_vocoer.py` have 3 arguments in common, `--checkpoint`, `--iteration` and `--output`.
1. `--output` is the directory for saving results.
During training, checkpoints are saved in `${output}/checkpoints` and tensorboard logs are saved in `${output}/log`.
During synthesis, results are saved in `${output}/samples` and tensorboard log is save in `${output}/log`.
2. `--checkpoint` is the path of a checkpoint and `--iteration` is the target step. They are used to load checkpoints in the following way.
- If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
2. `--checkpoint` and `--iteration` for loading from existing checkpoint. Loading existing checkpoiont follows the following rule:
If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory.
- If `--checkpoint` is not provided, we try to load the checkpoint of the target step specified by `--iteration` from the `${output}/checkpoints/` directory, e.g. if given `--iteration 120000`, the checkpoint `${output}/checkpoints/step-120000.*` will be load.
- If both `--checkpoint` and `--iteration` are not provided, we try to load the latest checkpoint from `${output}/checkpoints/` directory.
## Compute Phoneme Duration
A ground truth duration of each phoneme (number of frames in the spectrogram that correspond to that phoneme) should be provided when training a FastSpeech model.
We compute the ground truth duration of each phomemes in this way:
We compute the ground truth duration of each phomemes in the following way.
We extract the encoder-decoder attention alignment from a trained Transformer TTS model;
Each frame is considered corresponding to the phoneme that receive the most attention;
......@@ -56,12 +63,15 @@ python get_alignments.py \
--config=${CONFIG} \
--checkpoint_transformer=${CHECKPOINT} \
```
where `${DATAPATH}` is the path saved LJSpeech data, `${CHECKPOINT}` is the pretrain model path of TransformerTTS, `${CONFIG}` is the config yaml file of TransformerTTS checkpoint. It is necessary for you to prepare a pre-trained TranformerTTS checkpoint.
For more help on arguments:
For more help on arguments
``python alignments.py --help``.
Or you can use your own phoneme duration, you just need to process the data into the following format:
Or you can use your own phoneme duration, you just need to process the data into the following format.
```bash
{'fname1': alignment1,
'fname2': alignment2,
......@@ -70,7 +80,8 @@ Or you can use your own phoneme duration, you just need to process the data into
## Train FastSpeech
FastSpeech model can be trained with ``train.py``.
FastSpeech model can be trained by running ``train.py``.
```bash
python train.py \
--use_gpu=1 \
......@@ -79,11 +90,14 @@ python train.py \
--output='./experiment' \
--config='configs/ljspeech.yaml' \
```
Or you can run the script file directly.
```bash
sh train.sh
```
If you want to train on multiple GPUs, start training as follows:
If you want to train on multiple GPUs, start training in the following way.
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
......@@ -94,13 +108,17 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
--output='./experiment' \
--config='configs/ljspeech.yaml' \
```
If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
For more help on arguments:
For more help on arguments
``python train.py --help``.
## Synthesis
After training the FastSpeech, audio can be synthesized with ``synthesis.py``.
After training the FastSpeech, audio can be synthesized by running ``synthesis.py``.
```bash
python synthesis.py \
--use_gpu=1 \
......@@ -111,12 +129,15 @@ python synthesis.py \
--checkpoint_clarinet='../clarinet/checkpoint/step-500000' \
--output='./synthesis' \
```
We use Clarinet to synthesis wav, so it necessary for you to prepare a pre-trained [Clarinet checkpoint](https://paddlespeech.bj.bcebos.com/Parakeet/clarinet_ljspeech_ckpt_1.0.zip).
Or you can run the script file directly.
```bash
sh synthesis.sh
```
For more help on arguments:
For more help on arguments
``python synthesis.py --help``.
......@@ -61,6 +61,7 @@ def add_config_options_to_parser(parser):
def synthesis(text_input, args):
local_rank = dg.parallel.Env().local_rank
place = (fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace())
fluid.enable_dygraph(place)
with open(args.config) as f:
cfg = yaml.load(f, Loader=yaml.Loader)
......@@ -71,56 +72,53 @@ def synthesis(text_input, args):
writer = SummaryWriter(os.path.join(args.output, 'log'))
with dg.guard(place):
model = FastSpeech(cfg['network'], num_mels=cfg['audio']['num_mels'])
# Load parameters.
global_step = io.load_parameters(
model=model, checkpoint_path=args.checkpoint)
model.eval()
text = np.asarray(text_to_sequence(text_input))
text = np.expand_dims(text, axis=0)
pos_text = np.arange(1, text.shape[1] + 1)
pos_text = np.expand_dims(pos_text, axis=0)
text = dg.to_variable(text)
pos_text = dg.to_variable(pos_text)
_, mel_output_postnet = model(text, pos_text, alpha=args.alpha)
result = np.exp(mel_output_postnet.numpy())
mel_output_postnet = fluid.layers.transpose(
fluid.layers.squeeze(mel_output_postnet, [0]), [1, 0])
mel_output_postnet = np.exp(mel_output_postnet.numpy())
basis = librosa.filters.mel(cfg['audio']['sr'], cfg['audio']['n_fft'],
cfg['audio']['num_mels'])
inv_basis = np.linalg.pinv(basis)
spec = np.maximum(1e-10, np.dot(inv_basis, mel_output_postnet))
# synthesis use clarinet
wav_clarinet = synthesis_with_clarinet(
args.config_clarinet, args.checkpoint_clarinet, result, place)
writer.add_audio(text_input + '(clarinet)', wav_clarinet, 0,
cfg['audio']['sr'])
if not os.path.exists(os.path.join(args.output, 'samples')):
os.mkdir(os.path.join(args.output, 'samples'))
write(
os.path.join(
os.path.join(args.output, 'samples'), 'clarinet.wav'),
cfg['audio']['sr'], wav_clarinet)
#synthesis use griffin-lim
wav = librosa.core.griffinlim(
spec**cfg['audio']['power'],
hop_length=cfg['audio']['hop_length'],
win_length=cfg['audio']['win_length'])
writer.add_audio(text_input + '(griffin-lim)', wav, 0,
cfg['audio']['sr'])
write(
os.path.join(
os.path.join(args.output, 'samples'), 'grinffin-lim.wav'),
cfg['audio']['sr'], wav)
print("Synthesis completed !!!")
model = FastSpeech(cfg['network'], num_mels=cfg['audio']['num_mels'])
# Load parameters.
global_step = io.load_parameters(
model=model, checkpoint_path=args.checkpoint)
model.eval()
text = np.asarray(text_to_sequence(text_input))
text = np.expand_dims(text, axis=0)
pos_text = np.arange(1, text.shape[1] + 1)
pos_text = np.expand_dims(pos_text, axis=0)
text = dg.to_variable(text)
pos_text = dg.to_variable(pos_text)
_, mel_output_postnet = model(text, pos_text, alpha=args.alpha)
result = np.exp(mel_output_postnet.numpy())
mel_output_postnet = fluid.layers.transpose(
fluid.layers.squeeze(mel_output_postnet, [0]), [1, 0])
mel_output_postnet = np.exp(mel_output_postnet.numpy())
basis = librosa.filters.mel(cfg['audio']['sr'], cfg['audio']['n_fft'],
cfg['audio']['num_mels'])
inv_basis = np.linalg.pinv(basis)
spec = np.maximum(1e-10, np.dot(inv_basis, mel_output_postnet))
# synthesis use clarinet
wav_clarinet = synthesis_with_clarinet(
args.config_clarinet, args.checkpoint_clarinet, result, place)
writer.add_audio(text_input + '(clarinet)', wav_clarinet, 0,
cfg['audio']['sr'])
if not os.path.exists(os.path.join(args.output, 'samples')):
os.mkdir(os.path.join(args.output, 'samples'))
write(
os.path.join(os.path.join(args.output, 'samples'), 'clarinet.wav'),
cfg['audio']['sr'], wav_clarinet)
#synthesis use griffin-lim
wav = librosa.core.griffinlim(
spec**cfg['audio']['power'],
hop_length=cfg['audio']['hop_length'],
win_length=cfg['audio']['win_length'])
writer.add_audio(text_input + '(griffin-lim)', wav, 0, cfg['audio']['sr'])
write(
os.path.join(
os.path.join(args.output, 'samples'), 'grinffin-lim.wav'),
cfg['audio']['sr'], wav)
print("Synthesis completed !!!")
writer.close()
......
......@@ -63,6 +63,7 @@ def main(args):
global_step = 0
place = fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace()
fluid.enable_dygraph(place)
if not os.path.exists(args.output):
os.mkdir(args.output)
......@@ -70,94 +71,87 @@ def main(args):
writer = SummaryWriter(os.path.join(args.output,
'log')) if local_rank == 0 else None
with dg.guard(place):
model = FastSpeech(cfg['network'], num_mels=cfg['audio']['num_mels'])
model.train()
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=dg.NoamDecay(1 /
(cfg['train']['warm_up_step'] *
model = FastSpeech(cfg['network'], num_mels=cfg['audio']['num_mels'])
model.train()
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=dg.NoamDecay(1 / (cfg['train']['warm_up_step'] *
(cfg['train']['learning_rate']**2)),
cfg['train']['warm_up_step']),
parameter_list=model.parameters(),
grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][
'grad_clip_thresh']))
reader = LJSpeechLoader(
cfg['audio'],
place,
args.data,
args.alignments_path,
cfg['train']['batch_size'],
nranks,
local_rank,
shuffle=True).reader()
# Load parameters.
global_step = io.load_parameters(
model=model,
optimizer=optimizer,
checkpoint_dir=os.path.join(args.output, 'checkpoints'),
iteration=args.iteration,
checkpoint_path=args.checkpoint)
print("Rank {}: checkpoint loaded.".format(local_rank))
if parallel:
strategy = dg.parallel.prepare_context()
model = fluid.dygraph.parallel.DataParallel(model, strategy)
for epoch in range(cfg['train']['max_epochs']):
pbar = tqdm(reader)
for i, data in enumerate(pbar):
pbar.set_description('Processing at epoch %d' % epoch)
(character, mel, pos_text, pos_mel, alignment) = data
global_step += 1
#Forward
result = model(
character,
pos_text,
mel_pos=pos_mel,
length_target=alignment)
mel_output, mel_output_postnet, duration_predictor_output, _, _ = result
mel_loss = layers.mse_loss(mel_output, mel)
mel_postnet_loss = layers.mse_loss(mel_output_postnet, mel)
duration_loss = layers.mean(
layers.abs(
layers.elementwise_sub(duration_predictor_output,
alignment)))
total_loss = mel_loss + mel_postnet_loss + duration_loss
if local_rank == 0:
writer.add_scalar('mel_loss',
mel_loss.numpy(), global_step)
writer.add_scalar('post_mel_loss',
mel_postnet_loss.numpy(), global_step)
writer.add_scalar('duration_loss',
duration_loss.numpy(), global_step)
writer.add_scalar('learning_rate',
optimizer._learning_rate.step().numpy(),
global_step)
if parallel:
total_loss = model.scale_loss(total_loss)
total_loss.backward()
model.apply_collective_grads()
else:
total_loss.backward()
optimizer.minimize(total_loss)
model.clear_gradients()
# save checkpoint
if local_rank == 0 and global_step % cfg['train'][
'checkpoint_interval'] == 0:
io.save_parameters(
os.path.join(args.output, 'checkpoints'), global_step,
model, optimizer)
if local_rank == 0:
writer.close()
cfg['train']['warm_up_step']),
parameter_list=model.parameters(),
grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][
'grad_clip_thresh']))
reader = LJSpeechLoader(
cfg['audio'],
place,
args.data,
args.alignments_path,
cfg['train']['batch_size'],
nranks,
local_rank,
shuffle=True).reader()
# Load parameters.
global_step = io.load_parameters(
model=model,
optimizer=optimizer,
checkpoint_dir=os.path.join(args.output, 'checkpoints'),
iteration=args.iteration,
checkpoint_path=args.checkpoint)
print("Rank {}: checkpoint loaded.".format(local_rank))
if parallel:
strategy = dg.parallel.prepare_context()
model = fluid.dygraph.parallel.DataParallel(model, strategy)
for epoch in range(cfg['train']['max_epochs']):
pbar = tqdm(reader)
for i, data in enumerate(pbar):
pbar.set_description('Processing at epoch %d' % epoch)
(character, mel, pos_text, pos_mel, alignment) = data
global_step += 1
#Forward
result = model(
character, pos_text, mel_pos=pos_mel, length_target=alignment)
mel_output, mel_output_postnet, duration_predictor_output, _, _ = result
mel_loss = layers.mse_loss(mel_output, mel)
mel_postnet_loss = layers.mse_loss(mel_output_postnet, mel)
duration_loss = layers.mean(
layers.abs(
layers.elementwise_sub(duration_predictor_output,
alignment)))
total_loss = mel_loss + mel_postnet_loss + duration_loss
if local_rank == 0:
writer.add_scalar('mel_loss', mel_loss.numpy(), global_step)
writer.add_scalar('post_mel_loss',
mel_postnet_loss.numpy(), global_step)
writer.add_scalar('duration_loss',
duration_loss.numpy(), global_step)
writer.add_scalar('learning_rate',
optimizer._learning_rate.step().numpy(),
global_step)
if parallel:
total_loss = model.scale_loss(total_loss)
total_loss.backward()
model.apply_collective_grads()
else:
total_loss.backward()
optimizer.minimize(total_loss)
model.clear_gradients()
# save checkpoint
if local_rank == 0 and global_step % cfg['train'][
'checkpoint_interval'] == 0:
io.save_parameters(
os.path.join(args.output, 'checkpoints'), global_step,
model, optimizer)
if local_rank == 0:
writer.close()
if __name__ == '__main__':
......
# TransformerTTS
PaddlePaddle dynamic graph implementation of TransformerTTS, a neural TTS with Transformer. The implementation is based on [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895).
## Dataset
......@@ -9,7 +10,9 @@ We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://k
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Model Architecture
<div align="center" name="TransformerTTS model architecture">
<img src="./images/model_architecture.jpg" width=400 height=600 /> <br>
</div>
......@@ -20,6 +23,7 @@ TransformerTTS model architecture
The model adopts the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in [Tacotron2](https://arxiv.org/abs/1712.05884). The model consists of two main parts, encoder and decoder. We also implement the CBHG model of Tacotron as the vocoder part and convert the spectrogram into raw wave using Griffin-Lim algorithm.
## Project Structure
```text
├── config # yaml configuration files
├── data.py # dataset and dataloader settings for LJSpeech
......@@ -27,20 +31,27 @@ The model adopts the multi-head attention mechanism to replace the RNN structure
├── train_transformer.py # script for transformer model training
├── train_vocoder.py # script for vocoder model training
```
## Saving & Loading
`train_transformer.py` and `train_vocoer.py` have 3 arguments in common, `--checkpooint`, `iteration` and `output`.
1. `output` is the directory for saving results.
During training, checkpoints are saved in `checkpoints/` in `output` and tensorboard log is save in `log/` in `output`.
During synthesis, results are saved in `samples/` in `output` and tensorboard log is save in `log/` in `output`.
`train_transformer.py` and `train_vocoer.py` have 3 arguments in common, `--checkpoint`, `--iteration` and `--output`.
1. `--output` is the directory for saving results.
During training, checkpoints are saved in `${output}/checkpoints` and tensorboard logs are saved in `${output}/log`.
During synthesis, results are saved in `${output}/samples` and tensorboard log is save in `${output}/log`.
2. `--checkpoint` is the path of a checkpoint and `--iteration` is the target step. They are used to load checkpoints in the following way.
2. `--checkpoint` and `--iteration` for loading from existing checkpoint. Loading existing checkpoiont follows the following rule:
If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory.
- If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
- If `--checkpoint` is not provided, we try to load the checkpoint of the target step specified by `--iteration` from the `${output}/checkpoints/` directory, e.g. if given `--iteration 120000`, the checkpoint `${output}/checkpoints/step-120000.*` will be load.
- If both `--checkpoint` and `--iteration` are not provided, we try to load the latest checkpoint from `${output}/checkpoints/` directory.
## Train Transformer
TransformerTTS model can be trained with ``train_transformer.py``.
TransformerTTS model can be trained by running ``train_transformer.py``.
```bash
python train_trasformer.py \
--use_gpu=1 \
......@@ -48,11 +59,14 @@ python train_trasformer.py \
--output='./experiment' \
--config='configs/ljspeech.yaml' \
```
Or you can run the script file directly.
```bash
sh train_transformer.sh
```
If you want to train on multiple GPUs, you must start training as follows:
If you want to train on multiple GPUs, you must start training in the following way.
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
......@@ -67,11 +81,14 @@ If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-L
**Note: In order to ensure the training effect, we recommend using multi-GPU training to enlarge the batch size, and at least 16 samples in single batch per GPU.**
For more help on arguments:
For more help on arguments
``python train_transformer.py --help``.
## Train Vocoder
Vocoder model can be trained with ``train_vocoder.py``.
Vocoder model can be trained by running ``train_vocoder.py``.
```bash
python train_vocoder.py \
--use_gpu=1 \
......@@ -79,11 +96,14 @@ python train_vocoder.py \
--output='./vocoder' \
--config='configs/ljspeech.yaml' \
```
Or you can run the script file directly.
```bash
sh train_vocoder.sh
```
If you want to train on multiple GPUs, you must start training as follows:
If you want to train on multiple GPUs, you must start training in the following way.
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
......@@ -93,13 +113,17 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
--output='./vocoder' \
--config='configs/ljspeech.yaml' \
```
If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
For more help on arguments:
For more help on arguments
``python train_vocoder.py --help``.
## Synthesis
After training the TransformerTTS and vocoder model, audio can be synthesized with ``synthesis.py``.
After training the TransformerTTS and vocoder model, audio can be synthesized by running ``synthesis.py``.
```bash
python synthesis.py \
--max_len=300 \
......@@ -111,9 +135,11 @@ python synthesis.py \
```
Or you can run the script file directly.
```bash
sh synthesis.sh
```
For more help on arguments:
For more help on arguments
``python synthesis.py --help``.
......@@ -69,99 +69,97 @@ def synthesis(text_input, args):
writer = SummaryWriter(os.path.join(args.output, 'log'))
with dg.guard(place):
with fluid.unique_name.guard():
network_cfg = cfg['network']
model = TransformerTTS(
network_cfg['embedding_size'], network_cfg['hidden_size'],
network_cfg['encoder_num_head'],
network_cfg['encoder_n_layers'], cfg['audio']['num_mels'],
network_cfg['outputs_per_step'],
network_cfg['decoder_num_head'],
network_cfg['decoder_n_layers'])
# Load parameters.
global_step = io.load_parameters(
model=model, checkpoint_path=args.checkpoint_transformer)
model.eval()
with fluid.unique_name.guard():
model_vocoder = Vocoder(
cfg['train']['batch_size'], cfg['vocoder']['hidden_size'],
cfg['audio']['num_mels'], cfg['audio']['n_fft'])
# Load parameters.
global_step = io.load_parameters(
model=model_vocoder, checkpoint_path=args.checkpoint_vocoder)
model_vocoder.eval()
# init input
text = np.asarray(text_to_sequence(text_input))
text = fluid.layers.unsqueeze(dg.to_variable(text), [0])
mel_input = dg.to_variable(np.zeros([1, 1, 80])).astype(np.float32)
pos_text = np.arange(1, text.shape[1] + 1)
pos_text = fluid.layers.unsqueeze(dg.to_variable(pos_text), [0])
pbar = tqdm(range(args.max_len))
for i in pbar:
pos_mel = np.arange(1, mel_input.shape[1] + 1)
pos_mel = fluid.layers.unsqueeze(dg.to_variable(pos_mel), [0])
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
text, mel_input, pos_text, pos_mel)
mel_input = fluid.layers.concat(
[mel_input, postnet_pred[:, -1:, :]], axis=1)
mag_pred = model_vocoder(postnet_pred)
_ljspeech_processor = audio.AudioProcessor(
sample_rate=cfg['audio']['sr'],
num_mels=cfg['audio']['num_mels'],
min_level_db=cfg['audio']['min_level_db'],
ref_level_db=cfg['audio']['ref_level_db'],
n_fft=cfg['audio']['n_fft'],
win_length=cfg['audio']['win_length'],
hop_length=cfg['audio']['hop_length'],
power=cfg['audio']['power'],
preemphasis=cfg['audio']['preemphasis'],
signal_norm=True,
symmetric_norm=False,
max_norm=1.,
mel_fmin=0,
mel_fmax=None,
clip_norm=True,
griffin_lim_iters=60,
do_trim_silence=False,
sound_norm=False)
# synthesis with cbhg
wav = _ljspeech_processor.inv_spectrogram(
fluid.layers.transpose(
fluid.layers.squeeze(mag_pred, [0]), [1, 0]).numpy())
global_step = 0
for i, prob in enumerate(attn_probs):
for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j]) * 255)
writer.add_image(
'Attention_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
writer.add_audio(text_input + '(cbhg)', wav, 0, cfg['audio']['sr'])
if not os.path.exists(os.path.join(args.output, 'samples')):
os.mkdir(os.path.join(args.output, 'samples'))
write(
os.path.join(os.path.join(args.output, 'samples'), 'cbhg.wav'),
cfg['audio']['sr'], wav)
# synthesis with griffin-lim
wav = _ljspeech_processor.inv_melspectrogram(
fluid.layers.transpose(
fluid.layers.squeeze(postnet_pred, [0]), [1, 0]).numpy())
writer.add_audio(text_input + '(griffin)', wav, 0, cfg['audio']['sr'])
write(
os.path.join(os.path.join(args.output, 'samples'), 'griffin.wav'),
cfg['audio']['sr'], wav)
print("Synthesis completed !!!")
fluid.enable_dygraph(place)
with fluid.unique_name.guard():
network_cfg = cfg['network']
model = TransformerTTS(
network_cfg['embedding_size'], network_cfg['hidden_size'],
network_cfg['encoder_num_head'], network_cfg['encoder_n_layers'],
cfg['audio']['num_mels'], network_cfg['outputs_per_step'],
network_cfg['decoder_num_head'], network_cfg['decoder_n_layers'])
# Load parameters.
global_step = io.load_parameters(
model=model, checkpoint_path=args.checkpoint_transformer)
model.eval()
with fluid.unique_name.guard():
model_vocoder = Vocoder(
cfg['train']['batch_size'], cfg['vocoder']['hidden_size'],
cfg['audio']['num_mels'], cfg['audio']['n_fft'])
# Load parameters.
global_step = io.load_parameters(
model=model_vocoder, checkpoint_path=args.checkpoint_vocoder)
model_vocoder.eval()
# init input
text = np.asarray(text_to_sequence(text_input))
text = fluid.layers.unsqueeze(dg.to_variable(text), [0])
mel_input = dg.to_variable(np.zeros([1, 1, 80])).astype(np.float32)
pos_text = np.arange(1, text.shape[1] + 1)
pos_text = fluid.layers.unsqueeze(dg.to_variable(pos_text), [0])
pbar = tqdm(range(args.max_len))
for i in pbar:
pos_mel = np.arange(1, mel_input.shape[1] + 1)
pos_mel = fluid.layers.unsqueeze(dg.to_variable(pos_mel), [0])
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
text, mel_input, pos_text, pos_mel)
mel_input = fluid.layers.concat(
[mel_input, postnet_pred[:, -1:, :]], axis=1)
mag_pred = model_vocoder(postnet_pred)
_ljspeech_processor = audio.AudioProcessor(
sample_rate=cfg['audio']['sr'],
num_mels=cfg['audio']['num_mels'],
min_level_db=cfg['audio']['min_level_db'],
ref_level_db=cfg['audio']['ref_level_db'],
n_fft=cfg['audio']['n_fft'],
win_length=cfg['audio']['win_length'],
hop_length=cfg['audio']['hop_length'],
power=cfg['audio']['power'],
preemphasis=cfg['audio']['preemphasis'],
signal_norm=True,
symmetric_norm=False,
max_norm=1.,
mel_fmin=0,
mel_fmax=None,
clip_norm=True,
griffin_lim_iters=60,
do_trim_silence=False,
sound_norm=False)
# synthesis with cbhg
wav = _ljspeech_processor.inv_spectrogram(
fluid.layers.transpose(fluid.layers.squeeze(mag_pred, [0]), [1, 0])
.numpy())
global_step = 0
for i, prob in enumerate(attn_probs):
for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j]) * 255)
writer.add_image(
'Attention_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
writer.add_audio(text_input + '(cbhg)', wav, 0, cfg['audio']['sr'])
if not os.path.exists(os.path.join(args.output, 'samples')):
os.mkdir(os.path.join(args.output, 'samples'))
write(
os.path.join(os.path.join(args.output, 'samples'), 'cbhg.wav'),
cfg['audio']['sr'], wav)
# synthesis with griffin-lim
wav = _ljspeech_processor.inv_melspectrogram(
fluid.layers.transpose(
fluid.layers.squeeze(postnet_pred, [0]), [1, 0]).numpy())
writer.add_audio(text_input + '(griffin)', wav, 0, cfg['audio']['sr'])
write(
os.path.join(os.path.join(args.output, 'samples'), 'griffin.wav'),
cfg['audio']['sr'], wav)
print("Synthesis completed !!!")
writer.close()
......
......@@ -65,148 +65,145 @@ def main(args):
writer = SummaryWriter(os.path.join(args.output,
'log')) if local_rank == 0 else None
with dg.guard(place):
network_cfg = cfg['network']
model = TransformerTTS(
network_cfg['embedding_size'], network_cfg['hidden_size'],
network_cfg['encoder_num_head'], network_cfg['encoder_n_layers'],
cfg['audio']['num_mels'], network_cfg['outputs_per_step'],
network_cfg['decoder_num_head'], network_cfg['decoder_n_layers'])
model.train()
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=dg.NoamDecay(1 /
(cfg['train']['warm_up_step'] *
fluid.enable_dygraph(place)
network_cfg = cfg['network']
model = TransformerTTS(
network_cfg['embedding_size'], network_cfg['hidden_size'],
network_cfg['encoder_num_head'], network_cfg['encoder_n_layers'],
cfg['audio']['num_mels'], network_cfg['outputs_per_step'],
network_cfg['decoder_num_head'], network_cfg['decoder_n_layers'])
model.train()
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=dg.NoamDecay(1 / (cfg['train']['warm_up_step'] *
(cfg['train']['learning_rate']**2)),
cfg['train']['warm_up_step']),
parameter_list=model.parameters(),
grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][
'grad_clip_thresh']))
# Load parameters.
global_step = io.load_parameters(
model=model,
optimizer=optimizer,
checkpoint_dir=os.path.join(args.output, 'checkpoints'),
iteration=args.iteration,
checkpoint_path=args.checkpoint)
print("Rank {}: checkpoint loaded.".format(local_rank))
if parallel:
strategy = dg.parallel.prepare_context()
model = fluid.dygraph.parallel.DataParallel(model, strategy)
reader = LJSpeechLoader(
cfg['audio'],
place,
args.data,
cfg['train']['batch_size'],
nranks,
local_rank,
shuffle=True).reader()
for epoch in range(cfg['train']['max_epochs']):
pbar = tqdm(reader)
for i, data in enumerate(pbar):
pbar.set_description('Processing at epoch %d' % epoch)
character, mel, mel_input, pos_text, pos_mel = data
global_step += 1
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
character, mel_input, pos_text, pos_mel)
mel_loss = layers.mean(
layers.abs(layers.elementwise_sub(mel_pred, mel)))
post_mel_loss = layers.mean(
layers.abs(layers.elementwise_sub(postnet_pred, mel)))
loss = mel_loss + post_mel_loss
# Note: When used stop token loss the learning did not work.
if cfg['network']['stop_token']:
label = (pos_mel == 0).astype(np.float32)
stop_loss = cross_entropy(stop_preds, label)
loss = loss + stop_loss
if local_rank == 0:
writer.add_scalars('training_loss', {
'mel_loss': mel_loss.numpy(),
'post_mel_loss': post_mel_loss.numpy()
}, global_step)
cfg['train']['warm_up_step']),
parameter_list=model.parameters(),
grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][
'grad_clip_thresh']))
# Load parameters.
global_step = io.load_parameters(
model=model,
optimizer=optimizer,
checkpoint_dir=os.path.join(args.output, 'checkpoints'),
iteration=args.iteration,
checkpoint_path=args.checkpoint)
print("Rank {}: checkpoint loaded.".format(local_rank))
if parallel:
strategy = dg.parallel.prepare_context()
model = fluid.dygraph.parallel.DataParallel(model, strategy)
reader = LJSpeechLoader(
cfg['audio'],
place,
args.data,
cfg['train']['batch_size'],
nranks,
local_rank,
shuffle=True).reader()
for epoch in range(cfg['train']['max_epochs']):
pbar = tqdm(reader)
for i, data in enumerate(pbar):
pbar.set_description('Processing at epoch %d' % epoch)
character, mel, mel_input, pos_text, pos_mel = data
global_step += 1
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
character, mel_input, pos_text, pos_mel)
mel_loss = layers.mean(
layers.abs(layers.elementwise_sub(mel_pred, mel)))
post_mel_loss = layers.mean(
layers.abs(layers.elementwise_sub(postnet_pred, mel)))
loss = mel_loss + post_mel_loss
# Note: When used stop token loss the learning did not work.
if cfg['network']['stop_token']:
label = (pos_mel == 0).astype(np.float32)
stop_loss = cross_entropy(stop_preds, label)
loss = loss + stop_loss
if local_rank == 0:
writer.add_scalars('training_loss', {
'mel_loss': mel_loss.numpy(),
'post_mel_loss': post_mel_loss.numpy()
}, global_step)
if cfg['network']['stop_token']:
writer.add_scalar('stop_loss',
stop_loss.numpy(), global_step)
if parallel:
writer.add_scalars('alphas', {
'encoder_alpha':
model._layers.encoder.alpha.numpy(),
'decoder_alpha':
model._layers.decoder.alpha.numpy(),
}, global_step)
else:
writer.add_scalars('alphas', {
'encoder_alpha': model.encoder.alpha.numpy(),
'decoder_alpha': model.decoder.alpha.numpy(),
}, global_step)
writer.add_scalar('learning_rate',
optimizer._learning_rate.step().numpy(),
global_step)
if global_step % cfg['train']['image_interval'] == 1:
for i, prob in enumerate(attn_probs):
for j in range(cfg['network']['decoder_num_head']):
x = np.uint8(
cm.viridis(prob.numpy()[j * cfg['train'][
'batch_size'] // 2]) * 255)
writer.add_image(
'Attention_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
for i, prob in enumerate(attn_enc):
for j in range(cfg['network']['encoder_num_head']):
x = np.uint8(
cm.viridis(prob.numpy()[j * cfg['train'][
'batch_size'] // 2]) * 255)
writer.add_image(
'Attention_enc_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
for i, prob in enumerate(attn_dec):
for j in range(cfg['network']['decoder_num_head']):
x = np.uint8(
cm.viridis(prob.numpy()[j * cfg['train'][
'batch_size'] // 2]) * 255)
writer.add_image(
'Attention_dec_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
if cfg['network']['stop_token']:
writer.add_scalar('stop_loss',
stop_loss.numpy(), global_step)
if parallel:
loss = model.scale_loss(loss)
loss.backward()
model.apply_collective_grads()
writer.add_scalars('alphas', {
'encoder_alpha': model._layers.encoder.alpha.numpy(),
'decoder_alpha': model._layers.decoder.alpha.numpy(),
}, global_step)
else:
loss.backward()
optimizer.minimize(loss)
model.clear_gradients()
# save checkpoint
if local_rank == 0 and global_step % cfg['train'][
'checkpoint_interval'] == 0:
io.save_parameters(
os.path.join(args.output, 'checkpoints'), global_step,
model, optimizer)
if local_rank == 0:
writer.close()
writer.add_scalars('alphas', {
'encoder_alpha': model.encoder.alpha.numpy(),
'decoder_alpha': model.decoder.alpha.numpy(),
}, global_step)
writer.add_scalar('learning_rate',
optimizer._learning_rate.step().numpy(),
global_step)
if global_step % cfg['train']['image_interval'] == 1:
for i, prob in enumerate(attn_probs):
for j in range(cfg['network']['decoder_num_head']):
x = np.uint8(
cm.viridis(prob.numpy()[j * cfg['train'][
'batch_size'] // 2]) * 255)
writer.add_image(
'Attention_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
for i, prob in enumerate(attn_enc):
for j in range(cfg['network']['encoder_num_head']):
x = np.uint8(
cm.viridis(prob.numpy()[j * cfg['train'][
'batch_size'] // 2]) * 255)
writer.add_image(
'Attention_enc_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
for i, prob in enumerate(attn_dec):
for j in range(cfg['network']['decoder_num_head']):
x = np.uint8(
cm.viridis(prob.numpy()[j * cfg['train'][
'batch_size'] // 2]) * 255)
writer.add_image(
'Attention_dec_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
if parallel:
loss = model.scale_loss(loss)
loss.backward()
model.apply_collective_grads()
else:
loss.backward()
optimizer.minimize(loss)
model.clear_gradients()
# save checkpoint
if local_rank == 0 and global_step % cfg['train'][
'checkpoint_interval'] == 0:
io.save_parameters(
os.path.join(args.output, 'checkpoints'), global_step,
model, optimizer)
if local_rank == 0:
writer.close()
if __name__ == '__main__':
......
......@@ -63,79 +63,76 @@ def main(args):
writer = SummaryWriter(os.path.join(args.output,
'log')) if local_rank == 0 else None
with dg.guard(place):
model = Vocoder(cfg['train']['batch_size'],
cfg['vocoder']['hidden_size'],
cfg['audio']['num_mels'], cfg['audio']['n_fft'])
model.train()
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=dg.NoamDecay(1 /
(cfg['train']['warm_up_step'] *
fluid.enable_dygraph(place)
model = Vocoder(cfg['train']['batch_size'], cfg['vocoder']['hidden_size'],
cfg['audio']['num_mels'], cfg['audio']['n_fft'])
model.train()
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=dg.NoamDecay(1 / (cfg['train']['warm_up_step'] *
(cfg['train']['learning_rate']**2)),
cfg['train']['warm_up_step']),
parameter_list=model.parameters(),
grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][
'grad_clip_thresh']))
# Load parameters.
global_step = io.load_parameters(
model=model,
optimizer=optimizer,
checkpoint_dir=os.path.join(args.output, 'checkpoints'),
iteration=args.iteration,
checkpoint_path=args.checkpoint)
print("Rank {}: checkpoint loaded.".format(local_rank))
if parallel:
strategy = dg.parallel.prepare_context()
model = fluid.dygraph.parallel.DataParallel(model, strategy)
reader = LJSpeechLoader(
cfg['audio'],
place,
args.data,
cfg['train']['batch_size'],
nranks,
local_rank,
is_vocoder=True).reader()
for epoch in range(cfg['train']['max_epochs']):
pbar = tqdm(reader)
for i, data in enumerate(pbar):
pbar.set_description('Processing at epoch %d' % epoch)
mel, mag = data
mag = dg.to_variable(mag.numpy())
mel = dg.to_variable(mel.numpy())
global_step += 1
mag_pred = model(mel)
loss = layers.mean(
layers.abs(layers.elementwise_sub(mag_pred, mag)))
if parallel:
loss = model.scale_loss(loss)
loss.backward()
model.apply_collective_grads()
else:
loss.backward()
optimizer.minimize(loss)
model.clear_gradients()
if local_rank == 0:
writer.add_scalars('training_loss', {
'loss': loss.numpy(),
}, global_step)
# save checkpoint
if local_rank == 0 and global_step % cfg['train'][
'checkpoint_interval'] == 0:
io.save_parameters(
os.path.join(args.output, 'checkpoints'), global_step,
model, optimizer)
if local_rank == 0:
writer.close()
cfg['train']['warm_up_step']),
parameter_list=model.parameters(),
grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][
'grad_clip_thresh']))
# Load parameters.
global_step = io.load_parameters(
model=model,
optimizer=optimizer,
checkpoint_dir=os.path.join(args.output, 'checkpoints'),
iteration=args.iteration,
checkpoint_path=args.checkpoint)
print("Rank {}: checkpoint loaded.".format(local_rank))
if parallel:
strategy = dg.parallel.prepare_context()
model = fluid.dygraph.parallel.DataParallel(model, strategy)
reader = LJSpeechLoader(
cfg['audio'],
place,
args.data,
cfg['train']['batch_size'],
nranks,
local_rank,
is_vocoder=True).reader()
for epoch in range(cfg['train']['max_epochs']):
pbar = tqdm(reader)
for i, data in enumerate(pbar):
pbar.set_description('Processing at epoch %d' % epoch)
mel, mag = data
mag = dg.to_variable(mag.numpy())
mel = dg.to_variable(mel.numpy())
global_step += 1
mag_pred = model(mel)
loss = layers.mean(
layers.abs(layers.elementwise_sub(mag_pred, mag)))
if parallel:
loss = model.scale_loss(loss)
loss.backward()
model.apply_collective_grads()
else:
loss.backward()
optimizer.minimize(loss)
model.clear_gradients()
if local_rank == 0:
writer.add_scalars('training_loss', {'loss': loss.numpy(), },
global_step)
# save checkpoint
if local_rank == 0 and global_step % cfg['train'][
'checkpoint_interval'] == 0:
io.save_parameters(
os.path.join(args.output, 'checkpoints'), global_step,
model, optimizer)
if local_rank == 0:
writer.close()
if __name__ == '__main__':
......
......@@ -125,20 +125,15 @@ def load_parameters(model,
model_dict, optimizer_dict = dg.load_dygraph(checkpoint_path)
state_dict = model.state_dict()
dict_new = {}
# cast to desired data type, for mixed-precision training/inference.
for k, v in model_dict.items():
if k in state_dict and convert_np_dtype(v.dtype) != state_dict[
k].dtype:
model_dict[k] = v.astype(state_dict[k].numpy().dtype)
if k.startswith('_layers.'):
k = k[8:]
model.set_dict(model_dict)
if k in state_dict:
if convert_np_dtype(v.dtype) != state_dict[k].dtype:
v = v.astype(state_dict[k].numpy().dtype)
dict_new[k] = v
model.set_dict(dict_new)
print("[checkpoint] Rank {}: loaded model from {}.pdparams".format(
local_rank, checkpoint_path))
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册