未验证 提交 047092de 编写于 作者: Z zxcd 提交者: GitHub

add wav2vev2_zh aishell recipe, and speechbrain dataloader. (#2916)

上级 66a9cf8e
# Wav2vec2ASR with Aishell
This example contains code used to finetune [wav2vec2.0](https://https://arxiv.org/pdf/2006.11477.pdf) model with [Aishell dataset](http://www.openslr.org/resources/33)
## Overview
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| Stage | Function |
|:---- |:----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset<br> (5) Download the pretrained wav2vec2 model |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Infer the single audio file |
You can choose to run a range of stages by setting `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in `run.sh` in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
. ./path.sh
. ./cmd.sh
```
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of stages you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`audio file` denotes the file path of the single file you want to infer in stage 5
`ckpt` denotes the checkpoint prefix of the model, e.g. "wav2vec2ASR"
You can set the local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line:
```bash
bash run.sh --gpus 0,1 --avg_num 20
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in `run.sh` to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
```
After processing the data, the `data` directory will look like this:
```bash
data/
|-- dev.meta
|-- lang_char
| `-- vocab.txt
|-- manifest.dev
|-- manifest.dev.raw
|-- manifest.test
|-- manifest.test.raw
|-- manifest.train
|-- manifest.train.raw
|-- mean_std.json
|-- test.meta
|-- train.meta
|-- train.csv
|-- dev.csv
|-- test.csv
```
Stage 0 also downloads the Chinese pre-trained [wav2vec2](https://paddlespeech.bj.bcebos.com/wav2vec/chinese-wav2vec2-large.pdparams) model.
```bash
mkdir -p exp/wav2vec2
wget -P exp/wav2vec2 https://paddlespeech.bj.bcebos.com/wav2vec/chinese-wav2vec2-large.pdparams
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in `run.sh`. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `exp` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt}
fi
```
If you want to train the model, you can use the script below to execute stage 0 and stage 1:
```bash
bash run.sh --stage 0 --stop_stage 1
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/wav2vec2ASR.yaml wav2vec2ASR
```
## Stage 2: Top-k Models Averaging
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below. Note: We only train one epoch for wav2vec2ASR, thus the `avg_num` is set to 1.
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/wav2vec2ASR.yaml wav2vec2ASR
avg.sh best exp/wav2vec2ASR/checkpoints 1
```
## Stage 3: Model Testing
The test stage is to evaluate the model performance. The code of test stage is shown below:
```bash
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# test ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} ${decode_conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
```bash
bash run.sh --stage 0 --stop_stage 3
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/wav2vec2ASR.yaml wav2vec2ASR
avg.sh best exp/wav2vec2ASR/checkpoints 1
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/wav2vec2ASR.yaml conf/tuning/decode.yaml exp/wav2vec2ASR/checkpoints/avg_1
```
## Pretrained Model
You can get the pretrained wav2vec2ASR from [this](../../../docs/source/released_model.md).
using the `tar` scripts to unpack the model and then you can use the script to test the model.
For example:
```bash
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz
tar xzvf wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz
source path.sh
# If you have process the data and get the manifest file, you can skip the following 2 steps
bash local/data.sh --stage -1 --stop_stage -1
bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/wav2vec2ASR.yaml conf/tuning/decode.yaml exp/wav2vec2ASR/checkpoints/avg_1
```
The performance of the released models are shown in [here](./RESULTS.md).
## Stage 4: Single Audio File Inference
In some situations, you want to use the trained model to do the inference for the single audio file. You can use stage 5. The code is shown below
```bash
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# test a single .wav file
CUDA_VISIBLE_DEVICES=0 ./local/test_wav.sh ${conf_path} ${decode_conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} ${audio_file} || exit -1
fi
```
you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
```bash
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz
tar xzvf wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz
```
You can download the audio demo:
```bash
wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.wav -P data/
```
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
```bash
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/wav2vec2ASR.yaml conf/tuning/decode.yaml exp/wav2vec2ASR/checkpoints/avg_1 data/demo_002_en.wav
```
# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======
# Usage: <cmd>.pl [options] JOB=1:<nj> <log> <command...>
# e.g.
# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB
#
# Options:
# --time <time>: Limit the maximum time to execute.
# --mem <mem>: Limit the maximum memory usage.
# -–max-jobs-run <njob>: Limit the number parallel jobs. This is ignored for non-array jobs.
# --num-threads <ngpu>: Specify the number of CPU core.
# --gpu <ngpu>: Specify the number of GPU devices.
# --config: Change the configuration file from default.
#
# "JOB=1:10" is used for "array jobs" and it can control the number of parallel jobs.
# The left string of "=", i.e. "JOB", is replaced by <N>(Nth job) in the command and the log file name,
# e.g. "echo JOB" is changed to "echo 3" for the 3rd job and "echo 8" for 8th job respectively.
# Note that the number must start with a positive number, so you can't use "JOB=0:10" for example.
#
# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend.
# These options are mapping to specific options for each backend and
# it is configured by "conf/queue.conf" and "conf/slurm.conf" by default.
# If jobs failed, your configuration might be wrong for your environment.
#
#
# The official documentation for run.pl, queue.pl, slurm.pl, and ssh.pl:
# "Parallelization in Kaldi": http://kaldi-asr.org/doc/queue.html
# =========================================================~
# Select the backend used by run.sh from "local", "sge", "slurm", or "ssh"
cmd_backend='local'
# Local machine, without any Job scheduling system
if [ "${cmd_backend}" = local ]; then
# The other usage
export train_cmd="run.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="run.pl"
# Used for "*_recog.py"
export decode_cmd="run.pl"
# "qsub" (SGE, Torque, PBS, etc.)
elif [ "${cmd_backend}" = sge ]; then
# The default setting is written in conf/queue.conf.
# You must change "-q g.q" for the "queue" for your environment.
# To know the "queue" names, type "qhost -q"
# Note that to use "--gpu *", you have to setup "complex_value" for the system scheduler.
export train_cmd="queue.pl"
export cuda_cmd="queue.pl"
export decode_cmd="queue.pl"
# "sbatch" (Slurm)
elif [ "${cmd_backend}" = slurm ]; then
# The default setting is written in conf/slurm.conf.
# You must change "-p cpu" and "-p gpu" for the "partion" for your environment.
# To know the "partion" names, type "sinfo".
# You can use "--gpu * " by default for slurm and it is interpreted as "--gres gpu:*"
# The devices are allocated exclusively using "${CUDA_VISIBLE_DEVICES}".
export train_cmd="slurm.pl"
export cuda_cmd="slurm.pl"
export decode_cmd="slurm.pl"
elif [ "${cmd_backend}" = ssh ]; then
# You have to create ".queue/machines" to specify the host to execute jobs.
# e.g. .queue/machines
# host1
# host2
# host3
# Assuming you can login them without any password, i.e. You have to set ssh keys.
export train_cmd="ssh.pl"
export cuda_cmd="ssh.pl"
export decode_cmd="ssh.pl"
# This is an example of specifying several unique options in the JHU CLSP cluster setup.
# Users can modify/add their own command options according to their cluster environments.
elif [ "${cmd_backend}" = jhu ]; then
export train_cmd="queue.pl --mem 2G"
export cuda_cmd="queue-freegpu.pl --mem 2G --gpu 1 --config conf/gpu.conf"
export decode_cmd="queue.pl --mem 4G"
else
echo "$0: Error: Unknown cmd_backend=${cmd_backend}" 1>&2
return 1
fi
process:
# use raw audio
- type: wav_process
# Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modified from speechbrain 2023 (https://github.com/speechbrain/speechbrain/blob/develop/recipes/AISHELL-1/ASR/CTC/hparams/train_with_wav2vec.yaml)
# ############################################################################
# Model: CTC-wav2vec2
# Encoder: wav2vec2
# Decoder: -
# Tokens: Char
# losses: CTC
# Training: AISHELL-1
# Authors: Yingzhi WANG 2022
# ############################################################################
output_folder: !ref data
cer_file: !ref <output_folder>/cer.txt
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt
# Data files
data_folder: data/aishell # e,g./path/to/aishell
skip_prep: False
ckpt_interval_minutes: 15 # save checkpoint every N min
train_data: !ref <output_folder>/train.csv
valid_data: !ref <output_folder>/dev.csv
test_data: !ref <output_folder>/test.csv
wav2vec2_hub: TencentGameMate/chinese-wav2vec2-large
# Training parameters
number_of_epochs: 80
lr: 1.0
lr_wav2vec: 0.0001
sorting: ascending
auto_mix_prec: False
sample_rate: 16000
# With data_parallel batch_size is split into N jobs
# With DDP batch_size is multiplied by N jobs
# Must be 8 per GPU to fit 32GB of VRAM
batch_size: 5
test_batch_size: 1 # need set to 1 when decoding
dynamic_batching: False
dynamic_batch_sampler:
feats_hop_size: 0.01
max_batch_len: 15 # in terms of "duration" in annotations by default, second here
left_bucket_len: 200 # old implementation attributs
multiplier: 1.1 # old implementation attributs
shuffle_ex: False # if true re-creates batches at each epoch shuffling examples.
num_buckets: 10 # floor(log(max_batch_len/left_bucket_len, multiplier)) + 1
batch_ordering: ascending
num_workers: 6
# Dataloader options
train_dataloader_opts:
batch_size: !ref <batch_size>
num_workers: !ref <num_workers>
valid_dataloader_opts:
batch_size: !ref <test_batch_size>
num_workers: !ref <num_workers>
test_dataloader_opts:
batch_size: !ref <test_batch_size>
num_workers: !ref <num_workers>
wav2vec_output_dim: 1024
dnn_neurons: 1024
freeze_wav2vec: False
dropout: 0.15
tokenizer: !apply:transformers.BertTokenizer.from_pretrained
pretrained_model_name_or_path: bert-base-chinese
# bert-base-chinese tokens length
output_neurons: 21128
# Decoding parameters
# Be sure that the bos and eos index match with the BPEs ones
blank_index: 0
# AISHELL-1 has spaces between words in the transcripts,
# which Chinese writing normally does not do.
# If remove_spaces, spaces are removed
# from the transcript before computing CER.
# (e.g., 祝 可爱 的 你 —> 祝可爱的你)
remove_spaces: True
split_tokens: !apply:operator.not_ [!ref <remove_spaces>]
decode_batch_size: 1
error_rate_type: cer
decoding_method: ctc_greedy_search # 'ctc_greedy_search', 'ctc_prefix_beam_search'
beam_size: 10
############################################
# Network Architecture #
############################################
freeze_wav2vec2: False
normalize_wav: True
output_norm: True
init_type: 'kaiming_uniform' # !Warning: need to convergence
enc:
input_shape: 1024
dnn_blocks: 3
dnn_neurons: 1024
activation: True
normalization: True
dropout_rate: [0.15, 0.15, 0.0]
ctc:
enc_n_units: 1024
blank_id: 0
dropout_rate: 0.0
audio_augment:
speeds: [90, 100, 110]
spec_augment:
time_warp: True
time_warp_window: 5
time_warp_mode: bicubic
freq_mask: True
n_freq_mask: 2
time_mask: True
n_time_mask: 2
replace_with_zero: False
freq_mask_width: 30
time_mask_width: 40
wav2vec2_params_path: exp/wav2vec2/chinese-wav2vec2-large.pdparams
############################################
# Wav2Vec2.0 #
############################################
# vocab_size: 1000000
hidden_size: 1024
num_hidden_layers: 24
num_attention_heads: 16
intermediate_size: 4096
hidden_act: gelu
hidden_dropout: 0.1
activation_dropout: 0.0
attention_dropout: 0.1
feat_proj_dropout: 0.1
feat_quantizer_dropout: 0.0
final_dropout: 0.0
layerdrop: 0.1
initializer_range: 0.02
layer_norm_eps: 1e-5
feat_extract_norm: layer
feat_extract_activation: gelu
conv_dim: [512, 512, 512, 512, 512, 512, 512]
conv_stride: [5, 2, 2, 2, 2, 2, 2]
conv_kernel: [10, 3, 3, 3, 3, 2, 2]
conv_bias: True
num_conv_pos_embeddings: 128
num_conv_pos_embedding_groups: 16
do_stable_layer_norm: True
apply_spec_augment: False
mask_channel_length: 10
mask_channel_min_space: 1
mask_channel_other: 0.0
mask_channel_prob: 0.0
mask_channel_selection: static
mask_feature_length: 10
mask_feature_min_masks: 0
mask_feature_prob: 0.0
mask_time_length: 10
mask_time_min_masks: 2
mask_time_min_space: 1
mask_time_other: 0.0
mask_time_prob: 0.075
mask_time_selection: static
num_codevectors_per_group: 320
num_codevector_groups: 2
contrastive_logits_temperature: 0.1
num_negatives: 100
codevector_dim: 256
proj_codevector_dim: 256
diversity_loss_weight: 0.1
use_weighted_layer_sum: False
# pad_token_id: 0
# bos_token_id: 1
# eos_token_id: 2
add_adapter: False
adapter_kernel_size: 3
adapter_stride: 2
num_adapter_layers: 3
output_hidden_size: None
###########################################
# Data #
###########################################
train_manifest: data/manifest.train
dev_manifest: data/manifest.dev
test_manifest: data/manifest.test
vocab_filepath: data/lang_char/vocab.txt
###########################################
# Dataloader #
###########################################
unit_type: 'char'
mean_std_filepath:
preprocess_config: conf/preprocess.yaml
sortagrad: -1 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
batch_size: 5 # Different batch_size may cause large differences in results
maxlen_in: 51200000000 # if input length > maxlen-in batchsize is automatically reduced
maxlen_out: 1500000 # if output length > maxlen-out batchsize is automatically reduced
minibatches: 0 # for debug
batch_count: auto
batch_bins: 0
batch_frames_in: 0
batch_frames_out: 0
batch_frames_inout: 0
num_workers: 6
subsampling_factor: 1
num_encs: 1
dist_sampler: True
shortest_first: True
return_lens_rate: True
###########################################
# use speechbrain dataloader #
###########################################
use_sb_pipeline: True # whether use speechbrain pipeline. Default is True.
sb_pipeline_conf: conf/train_with_wav2vec.yaml
###########################################
# Training #
###########################################
n_epoch: 80
accum_grad: 1
global_grad_clip: 5.0
model_optim: adadelta
model_optim_conf:
lr: 1.0
weight_decay: 0.0
rho: 0.95
epsilon: 1.0e-8
wav2vec2_optim: adam
wav2vec2_optim_conf:
lr: 0.0001
weight_decay: 0.0
model_scheduler: newbobscheduler
model_scheduler_conf:
improvement_threshold: 0.0025
annealing_factor: 0.8
patient: 0
wav2vec2_scheduler: newbobscheduler
wav2vec2_scheduler_conf:
improvement_threshold: 0.0025
annealing_factor: 0.9
patient: 0
log_interval: 1
checkpoint:
kbest_n: 50
latest_n: 5
# Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modified from speechbrain 2023
# (https://github.com/speechbrain/speechbrain/blob/develop/recipes/AISHELL-1/aishell_prepare.py)
import argparse
import csv
import glob
import logging
import os
from paddlespeech.s2t.models.wav2vec2.io.dataio import read_audio
logger = logging.getLogger(__name__)
DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--data_folder",
default=DATA_HOME + "/Aishell",
type=str,
help="Directory to save the dataset. (default: %(default)s)")
parser.add_argument(
"--save_folder",
default="data/",
type=str,
help="Filepath prefix for output manifests. (default: %(default)s)")
parser.add_argument(
"--skip_prep",
default=False,
type=bool,
help="If True, skip data preparation. (default: %(default)s)")
args = parser.parse_args()
def prepare_aishell(data_folder, save_folder, skip_prep=False):
"""
This function prepares the AISHELL-1 dataset.
If the folder does not exist, the zip file will be extracted. If the zip file does not exist, it will be downloaded.
data_folder : path to AISHELL-1 dataset.
save_folder: path where to store the manifest csv files.
skip_prep: If True, skip data preparation.
"""
if skip_prep:
return
# Create filename-to-transcript dictionary
filename2transcript = {}
with open(
os.path.join(data_folder,
"data_aishell/transcript/aishell_transcript_v0.8.txt"),
"r", ) as f:
lines = f.readlines()
for line in lines:
key = line.split()[0]
value = " ".join(line.split()[1:])
filename2transcript[key] = value
splits = [
"train",
"dev",
"test",
]
ID_start = 0 # needed to have a unique ID for each audio
for split in splits:
new_filename = os.path.join(save_folder, split) + ".csv"
if os.path.exists(new_filename):
continue
logger.info("Preparing %s..." % new_filename)
csv_output = [["ID", "duration", "wav", "transcript"]]
entry = []
all_wavs = glob.glob(
os.path.join(data_folder, "data_aishell/wav") + "/" + split +
"/*/*.wav")
for i in range(len(all_wavs)):
filename = all_wavs[i].split("/")[-1].split(".wav")[0]
if filename not in filename2transcript:
continue
signal = read_audio(all_wavs[i])
duration = signal.shape[0] / 16000
transcript_ = filename2transcript[filename]
csv_line = [
ID_start + i,
str(duration),
all_wavs[i],
transcript_,
]
entry.append(csv_line)
csv_output = csv_output + entry
with open(new_filename, mode="w") as csv_f:
csv_writer = csv.writer(
csv_f, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in csv_output:
csv_writer.writerow(line)
msg = "\t%s successfully created!" % (new_filename)
logger.info(msg)
ID_start += len(all_wavs)
def main():
if args.data_folder.startswith('~'):
args.data_folder = os.path.expanduser(args.data_folder)
prepare_aishell(args.data_folder, args.save_folder, skip_prep=False)
print("Data csv prepare done!")
if __name__ == '__main__':
main()
#!/bin/bash
stage=-1
stop_stage=-1
dict_dir=data/lang_char
. ${MAIN_ROOT}/utils/parse_options.sh || exit -1;
mkdir -p data
mkdir -p ${dict_dir}
TARGET_DIR=${MAIN_ROOT}/dataset
mkdir -p ${TARGET_DIR}
if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
# download data, generate manifests
python3 ${TARGET_DIR}/aishell/aishell.py \
--manifest_prefix="data/manifest" \
--target_dir="${TARGET_DIR}/aishell"
#generate csv file for speechbrain dataloader
python3 local/aishell_prepare.py \
--data_folder="${TARGET_DIR}/aishell" \
--save_folder="data/"
if [ $? -ne 0 ]; then
echo "Prepare Aishell failed. Terminated."
exit 1
fi
for dataset in train dev test; do
mv data/manifest.${dataset} data/manifest.${dataset}.raw
done
fi
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# compute mean and stddev for normalizer
num_workers=$(nproc)
python3 ${MAIN_ROOT}/utils/compute_mean_std.py \
--manifest_path="data/manifest.train.raw" \
--spectrum_type="fbank" \
--feat_dim=80 \
--delta_delta=false \
--stride_ms=10 \
--window_ms=25 \
--sample_rate=16000 \
--use_dB_normalization=False \
--num_samples=-1 \
--num_workers=${num_workers} \
--output_path="data/mean_std.json"
if [ $? -ne 0 ]; then
echo "Compute mean and stddev failed. Terminated."
exit 1
fi
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# download data, generate manifests
# build vocabulary
python3 ${MAIN_ROOT}/utils/build_vocab.py \
--unit_type="char" \
--count_threshold=0 \
--vocab_path="${dict_dir}/vocab.txt" \
--manifest_paths "data/manifest.train.raw"
if [ $? -ne 0 ]; then
echo "Build vocabulary failed. Terminated."
exit 1
fi
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# format manifest with tokenids, vocab size
for dataset in train dev test; do
{
python3 ${MAIN_ROOT}/utils/format_data.py \
--cmvn_path "data/mean_std.json" \
--unit_type "char" \
--vocab_path="${dict_dir}/vocab.txt" \
--manifest_path="data/manifest.${dataset}.raw" \
--output_path="data/manifest.${dataset}"
if [ $? -ne 0 ]; then
echo "Formt mnaifest failed. Terminated."
exit 1
fi
} &
done
wait
fi
echo "Aishell data preparation done."
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
mkdir -p exp/wav2vec2
echo "Pretrained wav2vec2 model download"
wget -P exp/wav2vec2 https://paddlespeech.bj.bcebos.com/wav2vec/chinese-wav2vec2-large.pdparams
fi
exit 0
#!/bin/bash
set -e
ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
echo "using $ngpu gpus..."
expdir=exp
datadir=data
train_set=train_960
recog_set="test-clean test-other dev-clean dev-other"
recog_set="test-clean"
config_path=$1
decode_config_path=$2
ckpt_prefix=$3
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
# download language model
#bash local/download_lm_en.sh
#if [ $? -ne 0 ]; then
# exit 1
#fi
python3 utils/format_rsl.py \
--origin_ref data/manifest.test.raw \
--trans_ref data/manifest.test.text
for type in ctc_greedy_search; do
echo "decoding ${type}"
batch_size=1
python3 -u ${BIN_DIR}/test.py \
--ngpu ${ngpu} \
--config ${config_path} \
--decode_cfg ${decode_config_path} \
--result_file ${ckpt_prefix}.${type}.rsl \
--checkpoint_path ${ckpt_prefix} \
--opts decode.decoding_method ${type} \
--opts decode.decode_batch_size ${batch_size}
if [ $? -ne 0 ]; then
echo "Failed in evaluation!"
exit 1
fi
python3 utils/format_rsl.py \
--origin_hyp ${ckpt_prefix}.${type}.rsl \
--trans_hyp ${ckpt_prefix}.${type}.rsl.text
python3 utils/compute-wer.py --char=1 --v=1 \
data/manifest.test.text ${ckpt_prefix}.${type}.rsl.text > ${ckpt_prefix}.${type}.error
echo "decoding ${type} done."
done
for type in ctc_prefix_beam_search; do
echo "decoding ${type}"
batch_size=1
python3 -u ${BIN_DIR}/test.py \
--ngpu ${ngpu} \
--config ${config_path} \
--decode_cfg ${decode_config_path} \
--result_file ${ckpt_prefix}.${type}.rsl \
--checkpoint_path ${ckpt_prefix} \
--opts decode.decoding_method ${type} \
--opts decode.decode_batch_size ${batch_size}
if [ $? -ne 0 ]; then
echo "Failed in evaluation!"
exit 1
fi
python3 utils/format_rsl.py \
--origin_hyp ${ckpt_prefix}.${type}.rsl \
--trans_hyp ${ckpt_prefix}.${type}.rsl.text
python3 utils/compute-wer.py --char=1 --v=1 \
data/manifest.test-clean.text ${ckpt_prefix}.${type}.rsl.text > ${ckpt_prefix}.${type}.error
echo "decoding ${type} done."
done
echo "Finished"
exit 0
#!/bin/bash
if [ $# != 4 ];then
echo "usage: ${0} config_path decode_config_path ckpt_path_prefix audio_file"
exit -1
fi
ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
echo "using $ngpu gpus..."
config_path=$1
decode_config_path=$2
ckpt_prefix=$3
audio_file=$4
mkdir -p data
wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.wav -P data/
if [ $? -ne 0 ]; then
exit 1
fi
if [ ! -f ${audio_file} ]; then
echo "Plase input the right audio_file path"
exit 1
fi
chunk_mode=false
if [[ ${config_path} =~ ^.*chunk_.*yaml$ ]];then
chunk_mode=true
fi
# download language model
#bash local/download_lm_ch.sh
#if [ $? -ne 0 ]; then
# exit 1
#fi
for type in ctc_greedy_search; do
echo "decoding ${type}"
batch_size=1
output_dir=${ckpt_prefix}
mkdir -p ${output_dir}
python3 -u ${BIN_DIR}/test_wav.py \
--ngpu ${ngpu} \
--config ${config_path} \
--decode_cfg ${decode_config_path} \
--result_file ${output_dir}/${type}.rsl \
--checkpoint_path ${ckpt_prefix} \
--opts decode.decoding_method ${type} \
--opts decode.decode_batch_size ${batch_size} \
--audio_file ${audio_file}
if [ $? -ne 0 ]; then
echo "Failed in evaluation!"
exit 1
fi
done
exit 0
#!/bin/bash
if [ $# -lt 2 ] && [ $# -gt 3 ];then
echo "usage: CUDA_VISIBLE_DEVICES=0 ${0} config_path ckpt_name ips(optional)"
exit -1
fi
ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
echo "using $ngpu gpus..."
config_path=$1
ckpt_name=$2
resume=$3
ips=$4
if [ ! $ips ];then
ips_config=
else
ips_config="--ips="${ips}
fi
mkdir -p exp
# seed may break model convergence
seed=2
if [ ${seed} != 0 ]; then
export FLAGS_cudnn_deterministic=True
fi
# export FLAGS_cudnn_exhaustive_search=true
# export FLAGS_conv_workspace_size_limit=4000
# export FLAGS_allocator_strategy=naive_best_fit
if [ ${ngpu} == 0 ]; then
python3 -u ${BIN_DIR}/train.py \
--ngpu ${ngpu} \
--config ${config_path} \
--output exp/${ckpt_name} \
--seed ${seed} \
--resume ${resume}
else
python3 -m paddle.distributed.launch --log_dir=${ckpt_name} --gpus=${CUDA_VISIBLE_DEVICES} ${ips_config} ${BIN_DIR}/train.py \
--ngpu ${ngpu} \
--config ${config_path} \
--output exp/${ckpt_name} \
--seed ${seed} \
--resume ${resume}
fi
if [ ${seed} != 0 ]; then
unset FLAGS_cudnn_deterministic
fi
if [ $? -ne 0 ]; then
echo "Failed in training!"
exit 1
fi
exit 0
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/tools/sctk/bin:${PWD}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib/
MODEL=wav2vec2
export BIN_DIR=${MAIN_ROOT}/paddlespeech/s2t/exps/${MODEL}/bin
#!/bin/bash
set -e
. ./path.sh || exit 1;
. ./cmd.sh || exit 1;
gpus=0,1,2,3
stage=0
stop_stage=4
conf_path=conf/wav2vec2ASR.yaml
ips= #xx.xx.xx.xx,xx.xx.xx.xx
decode_conf_path=conf/tuning/decode.yaml
avg_num=1
resume= # xx e.g. 30
export FLAGS_cudnn_deterministic=1
. ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
audio_file=data/demo_002_en.wav
avg_ckpt=avg_${avg_num}
ckpt=$(basename ${conf_path} | awk -F'.' '{print $1}')
echo "checkpoint name ${ckpt}"git revert -v
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
bash ./local/data.sh || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `exp` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt} ${resume} ${ips}
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh last exp/${ckpt}/checkpoints ${avg_num}
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# greedy search decoder
CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} ${decode_conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# test a single .wav file
CUDA_VISIBLE_DEVICES=0 ./local/test_wav.sh ${conf_path} ${decode_conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} ${audio_file} || exit -1
fi
../../../utils
\ No newline at end of file
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
...@@ -17,17 +17,22 @@ import math ...@@ -17,17 +17,22 @@ import math
import os import os
import re import re
import time import time
from collections import defaultdict
from collections import OrderedDict from collections import OrderedDict
from contextlib import nullcontext from contextlib import nullcontext
import jsonlines import jsonlines
import numpy as np import numpy as np
import paddle import paddle
import transformers
from hyperpyyaml import load_hyperpyyaml
from paddle import distributed as dist from paddle import distributed as dist
from paddlespeech.s2t.frontend.featurizer import TextFeaturizer from paddlespeech.s2t.frontend.featurizer import TextFeaturizer
from paddlespeech.s2t.io.dataloader import DataLoaderFactory from paddlespeech.s2t.io.dataloader import DataLoaderFactory
from paddlespeech.s2t.io.speechbrain import data_pipeline
from paddlespeech.s2t.io.speechbrain import dataio
from paddlespeech.s2t.io.speechbrain import dataset
from paddlespeech.s2t.io.speechbrain.dataloader import make_dataloader
from paddlespeech.s2t.models.wav2vec2.processing.speech_augmentation import TimeDomainSpecAugment from paddlespeech.s2t.models.wav2vec2.processing.speech_augmentation import TimeDomainSpecAugment
from paddlespeech.s2t.models.wav2vec2.wav2vec2_ASR import Wav2vec2ASR from paddlespeech.s2t.models.wav2vec2.wav2vec2_ASR import Wav2vec2ASR
from paddlespeech.s2t.training.optimizer import OptimizerFactory from paddlespeech.s2t.training.optimizer import OptimizerFactory
...@@ -45,10 +50,96 @@ from paddlespeech.s2t.utils.utility import UpdateConfig ...@@ -45,10 +50,96 @@ from paddlespeech.s2t.utils.utility import UpdateConfig
logger = Log(__name__).getlog() logger = Log(__name__).getlog()
def clip_grad_norm_(
parameters,
max_norm,
norm_type=2.0,
error_if_nonfinite=False, ):
r"""Clips gradient norm of the iteratable parameters.
Norms are calculated together on all gradients, just as they are
connected into one vector. The gradient will be modified in place.
This API can only run in dynamic graph mode, not static graph mode.
Args:
parameters (Iterable[paddle.Tensor] or paddle.Tensor): Tensors or a single Tensor
that will be normalized gradients
max_norm (float or int): max norm of the gradients
norm_type (float or int): type of the used p-norm. Can be `inf` for
infinity norm.
error_if_nonfinite (bool): if True, throw an error if the total
norm of the gradients from :attr:`parameters` is `nan`,
`inf`, or `-inf`.
Returns:
Total norm of the parameter gradients (treated as a single vector).
Example:
.. code-block:: python
import paddle
x = paddle.uniform([10, 10], min=-1.0, max=1.0, dtype='float32')
max_norm = float(5.0)
linear = paddle.nn.Linear(in_features=10, out_features=10)
out = linear(x)
loss = paddle.mean(out)
loss.backward()
paddle.nn.utils.clip_grad_norm_(linear.parameters(), max_norm)
sdg = paddle.optimizer.SGD(learning_rate=0.1, parameters=linear.parameters())
sdg.step()
"""
if not paddle.in_dynamic_mode():
raise RuntimeError('this API can only run in dynamic mode.')
if isinstance(parameters, paddle.Tensor):
parameters = [parameters]
support_norm_type = [float("inf"), 0, 1, 2]
if norm_type not in support_norm_type:
raise ValueError(f'norm_type only support {support_norm_type}')
grads = [p.grad for p in parameters if p.grad is not None]
max_norm = float(max_norm)
norm_type = float(norm_type)
if len(grads) == 0:
return paddle.to_tensor(0.0)
if norm_type == float("inf"):
norms = [g.detach().abs().max() for g in grads]
total_norm = (norms[0]
if len(norms) == 1 else paddle.max(paddle.stack(norms)))
else:
total_norm = paddle.linalg.norm(
paddle.stack(
[paddle.linalg.norm(g.detach(), norm_type) for g in grads]),
norm_type, )
if error_if_nonfinite and paddle.logical_or(total_norm.isnan(),
total_norm.isinf()):
raise RuntimeError(
f'The total norm of {norm_type} order of the gradients from '
'`parameters` is non-finite, so it cannot be clipped. In any case, '
'disable this error and scale the gradient by non-finite norm, '
'set `error_if_nonfinite=False`')
clip_coef = max_norm / (total_norm + 1e-6)
# Note: when the coef is clamped to 1, it is redundant to multiply the clamped coef, but this
# avoids the `if clip_coef < 1:` condition.
clip_coef_clamped = paddle.clip(clip_coef, max=1.0)
with paddle.no_grad():
for _, p in enumerate(parameters):
g = p.grad
if g is not None:
p.grad = paddle.multiply(x=g, y=clip_coef_clamped)
return total_norm
class Wav2Vec2ASRTrainer(Trainer): class Wav2Vec2ASRTrainer(Trainer):
def __init__(self, config, args): def __init__(self, config, args):
super().__init__(config, args) super().__init__(config, args)
self.avg_train_loss = 0.0 self.avg_train_loss = 0.0
self.loss_isfinite = True # while flag is 'False', loss in Nan or inf, and can not be avg
self.use_sb = True # whether use speech brain dataloader
def update_average(self, batch_index, loss): def update_average(self, batch_index, loss):
"""Update running average of the loss. """Update running average of the loss.
...@@ -62,6 +153,9 @@ class Wav2Vec2ASRTrainer(Trainer): ...@@ -62,6 +153,9 @@ class Wav2Vec2ASRTrainer(Trainer):
if math.isfinite(loss): if math.isfinite(loss):
self.avg_train_loss -= self.avg_train_loss / (batch_index + 1) self.avg_train_loss -= self.avg_train_loss / (batch_index + 1)
self.avg_train_loss += loss / (batch_index + 1) self.avg_train_loss += loss / (batch_index + 1)
else:
self.loss_isfinite = False
logger.info('loss:{} in Nan or inf, error'.format(loss))
def before_train(self): def before_train(self):
from_scratch = self.resume_or_scratch() from_scratch = self.resume_or_scratch()
...@@ -81,14 +175,22 @@ class Wav2Vec2ASRTrainer(Trainer): ...@@ -81,14 +175,22 @@ class Wav2Vec2ASRTrainer(Trainer):
start = time.time() start = time.time()
# forward # forward
utt, wav, wavs_lens, target, target_lens = batch ## sb data pipeline
wavs_lens_rate = wavs_lens / wav.shape[1] if self.use_sb:
wav, wavs_lens_rate = batch['sig']
target, target_lens_rate = batch['tokens']
target_lens = (target_lens_rate *
target.shape[1]).round().astype(paddle.int64)
else:
utt, wav, wavs_lens, target, target_lens = batch
wavs_lens_rate = wavs_lens / wav.shape[1]
wav = wav[:, :, 0]
wav = wav[:, :, 0]
if hasattr(train_conf, 'audio_augment'): if hasattr(train_conf, 'audio_augment'):
wav = self.speech_augmentation(wav, wavs_lens_rate) wav = self.speech_augmentation(wav, wavs_lens_rate)
loss = self.model(wav, wavs_lens_rate, target, target_lens) loss = self.model(wav, wavs_lens_rate, target, target_lens)
# loss div by `batch_size * accum_grad` # loss div by `batch_size * accum_grad`
loss /= train_conf.accum_grad loss /= train_conf.accum_grad
# update self.avg_train_loss # update self.avg_train_loss
...@@ -108,10 +210,15 @@ class Wav2Vec2ASRTrainer(Trainer): ...@@ -108,10 +210,15 @@ class Wav2Vec2ASRTrainer(Trainer):
context = nullcontext context = nullcontext
with context(): with context():
loss.backward() loss.backward()
layer_tools.print_grads(self.model, print_func=None) layer_tools.print_grads(self.model, print_func=None)
# optimizer step old # optimizer step old
if (batch_index + 1) % train_conf.accum_grad == 0: if (batch_index + 1) % train_conf.accum_grad == 0:
#do global grad clip
if train_conf.global_grad_clip != 0:
clip_grad_norm_(self.model.parameters(),
train_conf.global_grad_clip)
self.model_optimizer.step() self.model_optimizer.step()
self.model_optimizer.clear_grad() self.model_optimizer.clear_grad()
if not train_conf.freeze_wav2vec2: if not train_conf.freeze_wav2vec2:
...@@ -123,10 +230,12 @@ class Wav2Vec2ASRTrainer(Trainer): ...@@ -123,10 +230,12 @@ class Wav2Vec2ASRTrainer(Trainer):
if not train_conf.freeze_wav2vec2: if not train_conf.freeze_wav2vec2:
self.wav2vec2_lr_scheduler.step() self.wav2vec2_lr_scheduler.step()
self.iteration += 1 self.iteration += 1
losses_np = {'loss': self.avg_train_loss * train_conf.accum_grad} losses_np = {'loss': self.avg_train_loss * train_conf.accum_grad}
iteration_time = time.time() - start iteration_time = time.time() - start
for k, v in losses_np.items(): for k, v in losses_np.items():
report(k, v) report(k, v)
report("loss_whitoutavg", float(loss))
report("batch_size", self.config.batch_size) report("batch_size", self.config.batch_size)
report("accum", train_conf.accum_grad) report("accum", train_conf.accum_grad)
report("step_cost", iteration_time) report("step_cost", iteration_time)
...@@ -148,24 +257,34 @@ class Wav2Vec2ASRTrainer(Trainer): ...@@ -148,24 +257,34 @@ class Wav2Vec2ASRTrainer(Trainer):
if not self.use_streamdata: if not self.use_streamdata:
logger.info( logger.info(
f"Valid Total Examples: {len(self.valid_loader.dataset)}") f"Valid Total Examples: {len(self.valid_loader.dataset)}")
valid_losses = defaultdict(list) valid_losses = {}
num_seen_utts = 1 step = 0
total_loss = 0.0 total_loss = 0.0
num_seen_utts = 1 # use update_average and no need for num_seen_utts here
for i, batch in enumerate(self.valid_loader): for i, batch in enumerate(self.valid_loader):
utt, wav, wavs_lens, target, target_lens = batch if self.use_sb:
wavs_lens_rate = wavs_lens / wav.shape[1] wav, wavs_lens_rate = batch['sig']
wav = wav[:, :, 0] target, target_lens_rate = batch['tokens']
target_lens = (target_lens_rate *
target.shape[1]).round().astype(paddle.int64)
else:
utt, wav, wavs_lens, target, target_lens = batch
wavs_lens_rate = wavs_lens / wav.shape[1]
wav = wav[:, :, 0]
loss = self.model(wav, wavs_lens_rate, target, target_lens) loss = self.model(wav, wavs_lens_rate, target, target_lens)
# use update_average
total_loss -= total_loss / (step + 1)
total_loss += loss / (step + 1)
if math.isfinite(float(loss)): if math.isfinite(float(loss)):
num_utts = batch[1].shape[0] step += 1
num_seen_utts += num_utts valid_losses['val_loss'] = float(loss)
total_loss += float(loss) * num_utts else:
valid_losses['val_loss'].append(float(loss)) logger.info('loss:{} in Nan or inf, error'.format(float(loss)))
if (i + 1) % self.config.log_interval == 0: if (i + 1) % self.config.log_interval == 0:
valid_dump = {k: np.mean(v) for k, v in valid_losses.items()} valid_losses['val_history_loss'] = float(total_loss)
valid_dump['val_history_loss'] = total_loss / num_seen_utts
# logging # logging
msg = f"Valid: Rank: {dist.get_rank()}, " msg = f"Valid: Rank: {dist.get_rank()}, "
...@@ -175,11 +294,11 @@ class Wav2Vec2ASRTrainer(Trainer): ...@@ -175,11 +294,11 @@ class Wav2Vec2ASRTrainer(Trainer):
msg += "batch: {}/{}, ".format(i + 1, msg += "batch: {}/{}, ".format(i + 1,
len(self.valid_loader)) len(self.valid_loader))
msg += ', '.join('{}: {:>.6f}'.format(k, v) msg += ', '.join('{}: {:>.6f}'.format(k, v)
for k, v in valid_dump.items()) for k, v in valid_losses.items())
logger.info(msg) logger.info(msg)
logger.info('Rank {} Val info val_loss {}'.format( logger.info(
dist.get_rank(), total_loss / num_seen_utts)) 'Rank {} Val info val_loss {}'.format(dist.get_rank(), total_loss))
return total_loss, num_seen_utts return total_loss, num_seen_utts
@mp_tools.rank_zero_only @mp_tools.rank_zero_only
...@@ -228,7 +347,7 @@ class Wav2Vec2ASRTrainer(Trainer): ...@@ -228,7 +347,7 @@ class Wav2Vec2ASRTrainer(Trainer):
logger.info("Saved scheduler state to {}".format(scheduler_path)) logger.info("Saved scheduler state to {}".format(scheduler_path))
info_path = re.sub('.pdparams$', '.json', params_path) info_path = re.sub('.pdparams$', '.json', params_path)
infos = {} if infos is None else infos infos = {} if infos is None else infos
with open(info_path, 'w') as fout: with open(info_path, 'w', encoding='utf8') as fout:
data = json.dumps(infos) data = json.dumps(infos)
fout.write(data) fout.write(data)
...@@ -245,7 +364,7 @@ class Wav2Vec2ASRTrainer(Trainer): ...@@ -245,7 +364,7 @@ class Wav2Vec2ASRTrainer(Trainer):
# lr will resotre from optimizer ckpt # lr will resotre from optimizer ckpt
resume_json_path = os.path.join(self.checkpoint_dir, resume_json_path = os.path.join(self.checkpoint_dir,
self.args.resume + '.json') self.args.resume + '.json')
with open(resume_json_path, 'r') as f: with open(resume_json_path, 'r', encoding='utf8') as f:
resume_json = json.load(f) resume_json = json.load(f)
self.iteration = 0 self.iteration = 0
self.epoch = resume_json["epoch"] self.epoch = resume_json["epoch"]
...@@ -340,14 +459,13 @@ class Wav2Vec2ASRTrainer(Trainer): ...@@ -340,14 +459,13 @@ class Wav2Vec2ASRTrainer(Trainer):
total_loss, num_seen_utts = self.valid() total_loss, num_seen_utts = self.valid()
if dist.get_world_size() > 1: if dist.get_world_size() > 1:
num_seen_utts = paddle.to_tensor(num_seen_utts) num_seen_utts = paddle.to_tensor(num_seen_utts)
# the default operator in all_reduce function is sum.
dist.all_reduce(num_seen_utts) dist.all_reduce(num_seen_utts)
total_loss = paddle.to_tensor(total_loss) total_loss = paddle.to_tensor(total_loss)
dist.all_reduce(total_loss) dist.all_reduce(total_loss)
cv_loss = total_loss / num_seen_utts cv_loss = total_loss / num_seen_utts
cv_loss = float(cv_loss) cv_loss = float(cv_loss)
else: else:
cv_loss = total_loss / num_seen_utts cv_loss = float(total_loss)
logger.info( logger.info(
'Epoch {} Val info val_loss {}'.format(self.epoch, cv_loss)) 'Epoch {} Val info val_loss {}'.format(self.epoch, cv_loss))
if self.visualizer: if self.visualizer:
...@@ -368,45 +486,183 @@ class Wav2Vec2ASRTrainer(Trainer): ...@@ -368,45 +486,183 @@ class Wav2Vec2ASRTrainer(Trainer):
if not self.config.freeze_wav2vec2: if not self.config.freeze_wav2vec2:
self.wav2vec2_lr_scheduler.step(cv_loss) self.wav2vec2_lr_scheduler.step(cv_loss)
self.save(tag=self.epoch, infos={'val_loss': cv_loss}) self.save(tag=self.epoch, infos={'val_loss': cv_loss})
self.avg_train_loss = 0.0
self.new_epoch() self.new_epoch()
def dataio_prepare(self, hparams):
"""This function prepares the datasets to be used in the brain class.
It also defines the data processing pipeline through user-defined functions."""
data_folder = hparams["data_folder"]
train_data = dataset.DynamicItemDataset.from_csv(
csv_path=hparams["train_data"],
replacements={"data_root": data_folder}, )
if hparams["sorting"] == "ascending":
# we sort training data to speed up training and get better results.
train_data = train_data.filtered_sorted(sort_key="duration")
# when sorting do not shuffle in dataloader ! otherwise is pointless
hparams["train_dataloader_opts"]["shuffle"] = False
elif hparams["sorting"] == "descending":
train_data = train_data.filtered_sorted(
sort_key="duration", reverse=True)
# when sorting do not shuffle in dataloader ! otherwise is pointless
hparams["train_dataloader_opts"]["shuffle"] = False
elif hparams["sorting"] == "random":
pass
else:
raise NotImplementedError(
"sorting must be random, ascending or descending")
valid_data = dataset.DynamicItemDataset.from_csv(
csv_path=hparams["valid_data"],
replacements={"data_root": data_folder}, )
valid_data = valid_data.filtered_sorted(sort_key="duration")
test_data = dataset.DynamicItemDataset.from_csv(
csv_path=hparams["test_data"],
replacements={"data_root": data_folder}, )
test_data = test_data.filtered_sorted(sort_key="duration")
datasets = [train_data, valid_data, test_data]
# Defining tokenizer and loading it
tokenizer = transformers.BertTokenizer.from_pretrained(
'bert-base-chinese')
self.tokenizer = tokenizer
# 2. Define audio pipeline:
@data_pipeline.takes("wav")
@data_pipeline.provides("sig")
def audio_pipeline(wav):
sig = dataio.read_audio(wav)
return sig
dataset.add_dynamic_item(datasets, audio_pipeline)
# 3. Define text pipeline:
@data_pipeline.takes("transcript")
@data_pipeline.provides("wrd", "tokens_list", "tokens")
def text_pipeline(wrd):
wrd = "".join(wrd.split(" "))
yield wrd
tokens_list = tokenizer(wrd)["input_ids"]
yield tokens_list
tokens = np.array(tokens_list, dtype="int64")
# tokens = paddle.to_tensor(tokens_list, dtype="int64")
yield tokens
dataset.add_dynamic_item(datasets, text_pipeline)
# 4. Set output:
dataset.set_output_keys(
datasets,
["id", "sig", "wrd", "tokens"], )
# 5. If Dynamic Batching is used, we instantiate the needed samplers.
train_batch_sampler = None
valid_batch_sampler = None
if hparams["dynamic_batching"]:
from sampler import DynamicBatchSampler # noqa
dynamic_hparams = hparams["dynamic_batch_sampler"]
num_buckets = dynamic_hparams["num_buckets"]
train_batch_sampler = DynamicBatchSampler(
train_data,
dynamic_hparams["max_batch_len"],
num_buckets=num_buckets,
length_func=lambda x: x["duration"],
shuffle=dynamic_hparams["shuffle_ex"],
batch_ordering=dynamic_hparams["batch_ordering"], )
valid_batch_sampler = DynamicBatchSampler(
valid_data,
dynamic_hparams["max_batch_len"],
num_buckets=num_buckets,
length_func=lambda x: x["duration"],
shuffle=dynamic_hparams["shuffle_ex"],
batch_ordering=dynamic_hparams["batch_ordering"], )
return (train_data, valid_data, test_data, tokenizer,
train_batch_sampler, valid_batch_sampler, )
def setup_dataloader(self): def setup_dataloader(self):
config = self.config.clone() config = self.config.clone()
self.use_streamdata = config.get("use_stream_data", False) self.use_streamdata = config.get("use_stream_data", False)
if self.train: self.use_sb = config.use_sb_pipeline
self.train_loader = DataLoaderFactory.get_dataloader( if self.use_sb:
'train', config, self.args) hparams_file = config.sb_pipeline_conf
self.valid_loader = DataLoaderFactory.get_dataloader( with open(hparams_file, 'r', encoding='utf8') as fin:
'valid', config, self.args) hparams = load_hyperpyyaml(fin, None)
logger.info("Setup train/valid Dataloader!")
(train_data, valid_data, test_data, tokenizer, train_bsampler,
valid_bsampler, ) = self.dataio_prepare(hparams)
train_dataloader_opts = hparams["train_dataloader_opts"]
valid_dataloader_opts = hparams["valid_dataloader_opts"]
if train_bsampler is not None:
train_dataloader_opts = {
"batch_sampler": train_bsampler,
"num_workers": hparams["num_workers"],
}
if valid_bsampler is not None:
valid_dataloader_opts = {"batch_sampler": valid_bsampler}
if self.train:
self.train_loader = make_dataloader(
train_data, stage='train', **train_dataloader_opts)
self.valid_loader = make_dataloader(
valid_data,
stage='val',
**valid_dataloader_opts, )
logger.info("Setup train/valid Dataloader!")
else:
self.test_loader = make_dataloader(
test_data, stage='test', **hparams["test_dataloader_opts"])
else: else:
decode_batch_size = config.get('decode', dict()).get( if self.train:
'decode_batch_size', 1) self.train_loader = DataLoaderFactory.get_dataloader(
self.test_loader = DataLoaderFactory.get_dataloader('test', config, 'train', config, self.args)
self.args) self.valid_loader = DataLoaderFactory.get_dataloader(
self.align_loader = DataLoaderFactory.get_dataloader( 'valid', config, self.args)
'align', config, self.args) logger.info("Setup train/valid Dataloader!")
logger.info("Setup test/align Dataloader!") else:
decode_batch_size = config.get('decode', dict()).get(
'decode_batch_size', 1)
self.test_loader = DataLoaderFactory.get_dataloader(
'test', config, self.args)
self.align_loader = DataLoaderFactory.get_dataloader(
'align', config, self.args)
logger.info("Setup test/align Dataloader!")
def setup_model(self): def setup_model(self):
config = self.config config = self.config
model_conf = config model_conf = config
with UpdateConfig(model_conf): with UpdateConfig(model_conf):
if self.train: if self.use_sb:
model_conf.input_dim = self.train_loader.feat_dim model_conf.output_dim = self.tokenizer.vocab_size
model_conf.output_dim = self.train_loader.vocab_size
else: else:
model_conf.input_dim = self.test_loader.feat_dim if self.train:
model_conf.output_dim = self.test_loader.vocab_size model_conf.input_dim = self.train_loader.feat_dim
model_conf.output_dim = self.train_loader.vocab_size
else:
model_conf.input_dim = self.test_loader.feat_dim
model_conf.output_dim = self.test_loader.vocab_size
model = Wav2vec2ASR.from_config(model_conf) model = Wav2vec2ASR.from_config(model_conf)
model_dict = paddle.load(config.wav2vec2_params_path) model_dict = paddle.load(config.wav2vec2_params_path)
model.wav2vec2.set_state_dict(model_dict) model.wav2vec2.set_state_dict(model_dict)
if self.parallel: if self.parallel:
model = paddle.DataParallel(model, find_unused_parameters=True) model = paddle.DataParallel(model, find_unused_parameters=True)
logger.info(f"{model}")
layer_tools.print_params(model, logger.info) layer_tools.print_params(model, logger.info)
self.model = model self.model = model
logger.info("Setup model!") logger.info("Setup model!")
...@@ -422,8 +678,11 @@ class Wav2Vec2ASRTrainer(Trainer): ...@@ -422,8 +678,11 @@ class Wav2Vec2ASRTrainer(Trainer):
train_config = config train_config = config
model_optim_type = train_config.model_optim model_optim_type = train_config.model_optim
model_optim_conf = train_config.model_optim_conf model_optim_conf = train_config.model_optim_conf
wav2vec2_optim_type = train_config.model_optim logger.info("optim_model:{},{}", model_optim_type, model_optim_conf)
wav2vec2_optim_type = train_config.wav2vec2_optim
wav2vec2_optim_conf = train_config.wav2vec2_optim_conf wav2vec2_optim_conf = train_config.wav2vec2_optim_conf
logger.info("optim_model:{},{}", wav2vec2_optim_type,
wav2vec2_optim_conf)
model_scheduler_type = train_config.model_scheduler model_scheduler_type = train_config.model_scheduler
model_scheduler_conf = train_config.model_scheduler_conf model_scheduler_conf = train_config.model_scheduler_conf
...@@ -449,11 +708,8 @@ class Wav2Vec2ASRTrainer(Trainer): ...@@ -449,11 +708,8 @@ class Wav2Vec2ASRTrainer(Trainer):
optim_conf, optim_conf,
parameters, parameters,
lr_scheduler=None, ): lr_scheduler=None, ):
train_config = config
optim_arg = dict(optim_conf) optim_arg = dict(optim_conf)
optim_arg.update({ optim_arg.update({
"grad_clip":
train_config.global_grad_clip,
"learning_rate": "learning_rate":
lr_scheduler if lr_scheduler else optim_conf.lr, lr_scheduler if lr_scheduler else optim_conf.lr,
"parameters": "parameters":
...@@ -475,10 +731,12 @@ class Wav2Vec2ASRTrainer(Trainer): ...@@ -475,10 +731,12 @@ class Wav2Vec2ASRTrainer(Trainer):
'params': 'params':
model.ctc.parameters() model.ctc.parameters()
}], model_lr_scheduler) }], model_lr_scheduler)
wav2vec2_optimizer_args = optimizer_args( wav2vec2_optimizer_args = optimizer_args(
config, wav2vec2_optim_type, wav2vec2_optim_conf, config, wav2vec2_optim_type, wav2vec2_optim_conf,
model._layers.wav2vec2.parameters() if self.parallel else model._layers.wav2vec2.parameters() if self.parallel else
model.wav2vec2.parameters(), wav2vec2_lr_scheduler) model.wav2vec2.parameters(), wav2vec2_lr_scheduler)
model_optimizer = OptimizerFactory.from_args(model_optim_type, model_optimizer = OptimizerFactory.from_args(model_optim_type,
model_optimizer_args) model_optimizer_args)
wav2vec2_optimizer = OptimizerFactory.from_args(wav2vec2_optim_type, wav2vec2_optimizer = OptimizerFactory.from_args(wav2vec2_optim_type,
...@@ -507,12 +765,7 @@ class Wav2Vec2ASRTester(Wav2Vec2ASRTrainer): ...@@ -507,12 +765,7 @@ class Wav2Vec2ASRTester(Wav2Vec2ASRTrainer):
trans.append(self.text_featurizer.defeaturize(ids.numpy().tolist())) trans.append(self.text_featurizer.defeaturize(ids.numpy().tolist()))
return trans return trans
def compute_metrics(self, def compute_metrics(self, id, audio, audio_len, texts, texts_len,
utts,
audio,
audio_len,
texts,
texts_len,
fout=None): fout=None):
decode_cfg = self.config.decode decode_cfg = self.config.decode
errors_sum, len_refs, num_ins = 0.0, 0, 0 errors_sum, len_refs, num_ins = 0.0, 0, 0
...@@ -529,7 +782,7 @@ class Wav2Vec2ASRTester(Wav2Vec2ASRTrainer): ...@@ -529,7 +782,7 @@ class Wav2Vec2ASRTester(Wav2Vec2ASRTrainer):
decode_time = time.time() - start_time decode_time = time.time() - start_time
for utt, target, result, rec_tids in zip( for utt, target, result, rec_tids in zip(
utts, target_transcripts, result_transcripts, result_tokenids): id, target_transcripts, result_transcripts, result_tokenids):
errors, len_ref = errors_func(target, result) errors, len_ref = errors_func(target, result)
errors_sum += errors errors_sum += errors
len_refs += len_ref len_refs += len_ref
...@@ -556,6 +809,49 @@ class Wav2Vec2ASRTester(Wav2Vec2ASRTrainer): ...@@ -556,6 +809,49 @@ class Wav2Vec2ASRTester(Wav2Vec2ASRTrainer):
num_frames=audio_len.sum().numpy().item(), num_frames=audio_len.sum().numpy().item(),
decode_time=decode_time) decode_time=decode_time)
def sb_compute_metrics(self, id, sig, wrd, tokens, fout=None):
decode_cfg = self.config.decode
errors_sum, len_refs, num_ins = 0.0, 0, 0
errors_func = error_rate.char_errors if decode_cfg.error_rate_type == 'cer' else error_rate.word_errors
error_rate_func = error_rate.cer if decode_cfg.error_rate_type == 'cer' else error_rate.wer
start_time = time.time()
target_transcripts = wrd
result_transcripts, result_tokenids = self.model.decode(
sig[0],
text_feature=self.tokenizer,
decoding_method=decode_cfg.decoding_method,
beam_size=decode_cfg.beam_size,
sb_pipeline=True)
decode_time = time.time() - start_time
for utt, target, result, rec_tids in zip(
id, target_transcripts, result_transcripts, result_tokenids):
errors, len_ref = errors_func(target, result)
errors_sum += errors
len_refs += len_ref
num_ins += 1
if fout:
fout.write({
"utt": utt,
"refs": [target],
"hyps": [result],
"hyps_tokenid": [rec_tids],
})
logger.info(f"Utt: {utt}")
logger.info(f"Ref: {target}")
logger.info(f"Hyp: {result}")
logger.info("One example error rate [%s] = %f" % (
decode_cfg.error_rate_type, error_rate_func(target, result)))
return dict(
errors_sum=errors_sum,
len_refs=len_refs,
num_ins=num_ins, # num examples
error_rate=errors_sum / len_refs,
error_rate_type=decode_cfg.error_rate_type,
num_frames=sig[1].sum().numpy().item(),
decode_time=decode_time)
@mp_tools.rank_zero_only @mp_tools.rank_zero_only
@paddle.no_grad() @paddle.no_grad()
def test(self): def test(self):
...@@ -571,9 +867,13 @@ class Wav2Vec2ASRTester(Wav2Vec2ASRTrainer): ...@@ -571,9 +867,13 @@ class Wav2Vec2ASRTester(Wav2Vec2ASRTrainer):
vocab_list = self.vocab_list vocab_list = self.vocab_list
decode_batch_size = decode_cfg.decode_batch_size decode_batch_size = decode_cfg.decode_batch_size
with jsonlines.open(self.args.result_file, 'w') as fout: with jsonlines.open(
self.args.result_file, 'w', encoding='utf8') as fout:
for i, batch in enumerate(self.test_loader): for i, batch in enumerate(self.test_loader):
metrics = self.compute_metrics(*batch, fout=fout) if self.use_sb:
metrics = self.sb_compute_metrics(**batch, fout=fout)
else:
metrics = self.compute_metrics(*batch, fout=fout)
num_frames += metrics['num_frames'] num_frames += metrics['num_frames']
num_time += metrics["decode_time"] num_time += metrics["decode_time"]
errors_sum += metrics['errors_sum'] errors_sum += metrics['errors_sum']
...@@ -595,7 +895,7 @@ class Wav2Vec2ASRTester(Wav2Vec2ASRTrainer): ...@@ -595,7 +895,7 @@ class Wav2Vec2ASRTester(Wav2Vec2ASRTrainer):
err_meta_path = os.path.splitext(self.args.result_file)[0] + '.err' err_meta_path = os.path.splitext(self.args.result_file)[0] + '.err'
err_type_str = "{}".format(error_rate_type) err_type_str = "{}".format(error_rate_type)
with open(err_meta_path, 'w') as f: with open(err_meta_path, 'w', encoding='utf8') as f:
data = json.dumps({ data = json.dumps({
"epoch": "epoch":
self.epoch, self.epoch,
......
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modified from speechbrain 2023 (https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/dataio/batch.py)
"""Batch collation
Authors
* Aku Rouhe 2020
"""
import collections
import paddle
from paddlespeech.s2t.io.speechbrain.data_utils import batch_pad_right
from paddlespeech.s2t.io.speechbrain.data_utils import mod_default_collate
PaddedData = collections.namedtuple("PaddedData", ["data", "lengths"])
class PaddedBatch:
"""Collate_fn when examples are dicts and have variable-length sequences.
Different elements in the examples get matched by key.
All numpy tensors get converted to paddle.Tensor
Then, by default, all paddle.Tensor valued elements get padded and support
collective pin_memory() and to() calls.
Regular Python data types are just collected in a list.
Arguments
---------
examples : list
List of example dicts, as produced by Dataloader.
padded_keys : list, None
(Optional) List of keys to pad on. If None, pad all paddle.Tensors
device_prep_keys : list, None
(Optional) Only these keys participate in collective memory pinning and moving with
to().
If None, defaults to all items with paddle.Tensor values.
padding_func : callable, optional
Called with a list of tensors to be padded together. Needs to return
two tensors: the padded data, and another tensor for the data lengths.
padding_kwargs : dict
(Optional) Extra kwargs to pass to padding_func. E.G. mode, value
nonpadded_stack : bool
Whether to apply Tensor stacking on values that didn't get padded.
This stacks if it can, but doesn't error out if it cannot.
Default:True, usually does the right thing.
"""
def __init__(
self,
examples,
padded_keys=None,
device_prep_keys=None,
padding_func=batch_pad_right,
padding_kwargs={},
nonpadded_stack=True, ):
self.__length = len(examples)
self.__keys = list(examples[0].keys())
self.__padded_keys = []
self.__device_prep_keys = []
for key in self.__keys:
values = [example[key] for example in examples]
# Default convert usually does the right thing (numpy2tensor etc.)
values = paddle.to_tensor(values)
if (padded_keys is not None and key in padded_keys) or (
padded_keys is None and
isinstance(values[0], paddle.Tensor)):
# Padding and PaddedData
self.__padded_keys.append(key)
padded = PaddedData(*padding_func(values, **padding_kwargs))
setattr(self, key, padded)
else:
if nonpadded_stack:
values = mod_default_collate(values)
setattr(self, key, values)
if (device_prep_keys is not None and key in device_prep_keys) or (
device_prep_keys is None and
isinstance(values[0], paddle.Tensor)):
self.__device_prep_keys.append(key)
def __len__(self):
return self.__length
def __getitem__(self, key):
if key in self.__keys:
return getattr(self, key)
else:
raise KeyError(f"Batch doesn't have key: {key}")
def __iter__(self):
"""Iterates over the different elements of the batch.
"""
return iter((getattr(self, key) for key in self.__keys))
# Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modified from speechbrain 2023 (https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/utils/data_pipeline.py)
"""A pipeline for data transformations.
Author:
* Aku Rouhe
"""
import inspect
from dataclasses import dataclass
from paddlespeech.s2t.io.speechbrain.depgraph import DependencyGraph
@dataclass
class StaticItem:
"""Data class that represents a static item.
Static items are in-memory items so they don't need to be computed
dynamically.
"""
key: str
class DynamicItem:
"""Essentially represents a data transformation function.
A DynamicItem takes some arguments and computes its value dynamically when
called. A straight-forward use-case is to load something from disk
dynamically; take the path and provide the loaded data.
Instances of this class are often created implicitly via the
@takes and @provides decorators or otherwise from specifying the taken and
provided arguments and the function.
A counterpart is the GeneratorDynamicItem, which should be used for
generator functions.
Arguments
---------
takes : list
The keys of the items that this needs to compute its output.
func : callable
The function that is used to compute the output.
provides : list
The keys that this provides.
"""
def __init__(self, takes=[], func=None, provides=[]):
self.takes = takes
self.func = func
self.provides = provides
def __call__(self, *args):
return self.func(*args)
# The next methods are more about supporting GeneratorDynamicItems
def next_takes(self):
"""The next argkeys to provide to this, when called."""
# Regular function DynamicItems always just need the same set of args
return self.takes
def next_provides(self):
"""The next keys that this provides, when called."""
# Regular function DynamicItems always just provide the same set of keys
return self.provides
def provided_in_order(self):
"""Assuming that this may need to be called multiple times; which keys
does it provide at that call. Returns a list, with len equal to the
number of times that this may be called."""
# Regular function DynamicItems are only called once:
return [self.provides]
def reset(self):
"""Signals that this will not be called any more times on this pipeline
call."""
# Regular function DynamicItems don't need special resets.
pass
class GeneratorDynamicItem(DynamicItem):
"""Essentially represents a multi-step data transformation.
This is the generator function counterpart for DynamicItem (which should be
used for regular functions).
A GeneratorDynamicItem first takes some arguments and then uses those in
multiple steps to incrementally compute some values when called.
A typical use-case is a pipeline of transformations on data: e.g. taking in
text as a string, and first a tokenized version, and then on the second
call providing an integer-encoded version. This can be used even though the
integer-encoder needs to be trained on the first outputs.
The main benefit is to be able to define the pipeline in a clear function,
even if parts of the pipeline depend on others for their initialization.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Doesn't generate electricity, only stores the currently active
# generator:
self.current_generator = None
self.num_provided_items = 0
def __call__(self, *args):
if self.num_provided_items == len(self.provides):
raise RuntimeError("DynamicItemPipeline called too many times!")
if not self.current_generator:
self.current_generator = self.func(*args)
# NOTE: Not supporting sending new values to the pipeline.
out = next(self.current_generator)
self.num_provided_items += 1
return out
def next_takes(self):
"""The next argkeys to provide to this, when called."""
if not self.current_generator:
return self.takes
else:
return []
def next_provides(self):
"""The next keys that this provides, when called."""
keys = self.provides[self.num_provided_items]
# Support multiple yielded values like:
# @yields("wav_read", ["left_ch", "right_ch"])
if isinstance(keys, str):
return [keys]
else:
return keys
def provided_in_order(self):
"""Assuming that this may need to be called multiple times; which keys
does it provide at that call. Returns a list, with len equal to the
number of times that this may be called."""
in_order = []
for keys in self.provides:
# Support multiple yielded values like:
# @provides("wav_read", ["left_ch", "right_ch"])
if isinstance(keys, str):
in_order.append([keys])
else:
in_order.append(keys)
return in_order
def reset(self):
"""Signals that this will not be called any more times on this pipeline
call."""
if self.current_generator is not None:
self.current_generator.close()
self.current_generator = None
self.num_provided_items = 0
def takes(*argkeys):
"""Decorator which makes a DynamicItem and specifies its argkeys.
If the wrapped object is a generator function (has a yield statement),
Creates a GeneratorDynamicItem. If the object is already a DynamicItem,
just specifies the argkeys for that. Otherwise creates a new regular
DynamicItem, with argkeys specified.
The args are always passed to the function at the start. Generators could
support sending new arguments, but for such use cases, simply create a new
dynamic item. The GeneratorDynamicItem class is meant for pipelines which
take in an input and transform it in multiple ways, where the intermediate
representations may be needed for e.g. fitting a BPE segmenter.
Example
-------
>>> @takes("text")
... def tokenize(text):
... return text.strip().lower().split()
>>> tokenize.provides = ["tokenized"]
>>> tokenize('\tThis Example gets tokenized')
['this', 'example', 'gets', 'tokenized']
"""
def decorator(obj):
"""Decorator definition."""
if isinstance(obj, DynamicItem):
if obj.takes:
raise ValueError("Can't overwrite DynamicItem.takes")
obj.takes = argkeys
return obj
elif inspect.isgeneratorfunction(obj):
return GeneratorDynamicItem(takes=argkeys, func=obj)
else:
return DynamicItem(takes=argkeys, func=obj)
return decorator
takes_decorator = takes # Just for DataPipeline.add_dynamic_item
def provides(*output_keys):
"""Decorator which makes a DynamicItem and specifies what keys it provides.
If the wrapped object is a generator function (has a yield statement),
Creates a GeneratorDynamicItem. If the object is already a DynamicItem,
just specifies the provided keys for that. Otherwise creates a new regular
DynamicItem, with provided keys specified.
NOTE
----
The behavior is slightly different for generators and regular functions, if
many output keys are specified, e.g. @provides("signal", "mfcc"). Regular
functions should return a tuple with len equal to len(output_keys), while
generators should yield the items one by one.
>>> @provides("signal", "feat")
... def read_feat():
... wav = [.1,.2,-.1]
... feat = [s**2 for s in wav]
... return wav, feat
>>> @provides("signal", "feat")
... def read_feat():
... wav = [.1,.2,-.1]
... yield wav
... feat = [s**2 for s in wav]
... yield feat
If multiple keys are yielded at once, write e.g.,
>>> @provides("wav_read", ["left_channel", "right_channel"])
... def read_multi_channel():
... wav = [[.1,.2,-.1],[.2,.1,-.1]]
... yield wav
... yield wav[0], wav[1]
"""
def decorator(obj):
"""Decorator definition."""
if isinstance(obj, DynamicItem):
if obj.provides:
raise ValueError("Can't overwrite DynamicItem provides-list.")
obj.provides = output_keys
return obj
elif inspect.isgeneratorfunction(obj):
return GeneratorDynamicItem(func=obj, provides=output_keys)
else:
return DynamicItem(func=obj, provides=output_keys)
return decorator
provides_decorator = provides # Just for DataPipeline.add_dynamic_item
class DataPipeline:
"""Organises data transformations into a pipeline.
Example
-------
>>> pipeline = DataPipeline(
... static_data_keys=["text"],
... dynamic_items=[
... {"func": lambda x: x.lower(), "takes": "text", "provides": "foo"},
... {"func": lambda x: x[::-1], "takes": "foo", "provides": "bar"},
... ],
... output_keys=["bar"],
... )
>>> pipeline({"text": "Test"})
{'bar': 'tset'}
"""
def __init__(self, static_data_keys, dynamic_items=[], output_keys=[]):
self.dg = DependencyGraph()
self._exec_order = None
self.key_to_node = {}
self.unaccounted_keys = {}
self.dynamic_items = []
self.output_mapping = {}
self.add_static_keys(static_data_keys)
self.add_dynamic_items(dynamic_items)
self.set_output_keys(output_keys)
def add_static_keys(self, static_keys):
"""Informs the pipeline about static items.
Static items are the ones provided to __call__ as data.
"""
for key in static_keys:
node_id = self.dg.add_node(data=StaticItem(key=key))
self.key_to_node[key] = node_id
def add_dynamic_items(self, dynamic_items):
"""Add multiple dynamic items at once."""
for item in dynamic_items:
try:
self.add_dynamic_item(**item)
except TypeError:
self.add_dynamic_item(item)
def add_dynamic_item(self, func, takes=None, provides=None):
"""Adds a dynamic item to the Pipeline.
Two calling conventions. For DynamicItem objects, just use:
add_dynamic_item(dynamic_item)
But otherwise, should use:
add_dynamic_item(func, takes, provides)
Arguments
---------
func : callable, DynamicItem
If a DynamicItem is given, adds that directly. Otherwise a
DynamicItem is created, and this specifies the callable to use. If
a generator function is given, then create a GeneratorDynamicItem.
Otherwise creates a normal DynamicItem.
takes : list, str
List of keys. When func is called, each key is resolved to
either an entry in the data or the output of another dynamic_item.
The func is then called with these as positional arguments,
in the same order as specified here.
A single key can be given as a bare string.
provides : str, list
For regular functions, the key or list of keys that it provides.
If you give a generator function, key or list of keys that it
yields, in order. Also see the provides decorator.
A single key can be given as a bare string.
"""
if isinstance(func, DynamicItem):
if takes is not None or provides is not None:
raise ValueError("If providing a DynamicItem directly, don't "
"specify takes or provides")
else:
self._add_dynamic_item_object(func)
return
if isinstance(takes, str):
takes = [takes]
if isinstance(provides, str):
provides = [provides]
di = takes_decorator(*takes)(provides_decorator(*provides)(func))
self._add_dynamic_item_object(di)
def _add_dynamic_item_object(self, obj):
"""Internally adds the object.
There is a node in the dependency graph for each call of the
DynamicItem. Each call may return multiple keys and depend on multiple
keys. An internal dict maps key to the id of the node that produces it.
"""
if not obj.provides:
raise ValueError("Won't add redundant dynamic item which doesn't "
"provide anything.")
depended = []
for key in obj.takes:
# Might not be accounted for, yet:
if key not in self.key_to_node:
dependee_keys = self.unaccounted_keys.setdefault(key, [])
dependee_keys.extend(obj.next_provides())
else:
depended.append(self.key_to_node[key])
for provided in obj.provided_in_order():
node_id = self.dg.add_node(data=obj)
for key in provided:
self.key_to_node[key] = node_id
# This key may also be unaccounted for, so account for it now:
if key in self.unaccounted_keys:
for dependee_key in self.unaccounted_keys[key]:
dependee_node = self.key_to_node[dependee_key]
self.dg.add_edge(dependee_node, node_id)
del self.unaccounted_keys[key] # Now accounted for!
for dep_id in depended:
self.dg.add_edge(node_id, dep_id)
# Next call will depend on this call:
depended = [node_id]
# Keep a reference to the item in this object, as well:
self.dynamic_items.append(obj)
def set_output_keys(self, keys):
"""Use this to change the output keys.
Also re-evaluates execution order.
So if you request different outputs, some parts of the
data pipeline may be skipped.
Arguments
---------
keys : dict, list, None
List of keys (str) to produce in output.
If a dict is given; it is used to map internal keys to output keys.
From the output_keys dict key:value pairs the key appears outside,
and value is the internal key.
"""
self.output_mapping = self._output_keys_to_mapping(keys)
self._exec_order = None
@staticmethod
def _output_keys_to_mapping(keys):
# Ensure a mapping (accept a list for convenience, too)
if keys is None:
output_mapping = {}
elif isinstance(keys, dict):
output_mapping = keys
else:
output_mapping = {key: key for key in keys}
return output_mapping
def compute_outputs(self, data):
"""
Arguments
---------
data : dict
Dictionary with data entries by key.
Returns
-------
dict
With the keys that were set.
"""
if self._exec_order is None:
self._prepare_run(data)
return self._compute(data, self._exec_order, self.output_mapping)
def compute_specific(self, keys, data):
"""Compute output of specific item, without changing output_keys."""
output_mapping = self._output_keys_to_mapping(keys)
order = self.dg.get_evaluation_order(
selected_keys=self.get_selected_node_ids(keys))
return self._compute(data, order, output_mapping)
def _compute(self, data, order, output_mapping):
if self.unaccounted_keys:
MSG = "These keys are still unaccounted for in the data pipeline: "
MSG += ", ".join(self.unaccounted_keys)
raise RuntimeError(MSG)
intermediate = {}
for node_id, edges, item in order:
if isinstance(item, StaticItem):
# Static item in data.
# Just check that key is found.
try:
data[item.key]
continue
except KeyError:
raise KeyError(f"Expected key {item.key} in data!")
# A dynamic item, which we should compute:
args = [
data[argkey] if argkey in data else intermediate[argkey]
for argkey in item.next_takes()
]
# This needs to be called BEFORE the dynamic item is called.
provided_keys = item.next_provides()
values = item(*args) # Call the DynamicItem to produce output
# If there is just one output value, wrap in a list so that
# it can be zipped as well:
if len(provided_keys) == 1:
values = [values]
intermediate.update(zip(provided_keys, values))
for dynamic_item in self.dynamic_items:
dynamic_item.reset()
return {
outkey: data[inkey] if inkey in data else intermediate[inkey]
for outkey, inkey in output_mapping.items()
}
def get_selected_node_ids(self, selected_keys):
"""Translates selected keys to dependency graph keys."""
return [self.key_to_node[key] for key in selected_keys]
def __call__(self, data):
return self.compute_outputs(data)
def _prepare_run(self, data):
self._exec_order = list(
self.dg.get_evaluation_order(
self.get_selected_node_ids(self.output_mapping.values())))
# Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modified from speechbrain 2023 (https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/utils/data_utils.py)
import collections.abc
import csv
import os
import pathlib
import re
import shutil
import urllib.request
import numpy as np
import paddle
import tqdm
def batch_pad_right(array: list, mode="constant", value=0):
"""Given a list of paddle tensors it batches them together by padding to the right
on each dimension in order to get same length for all.
Parameters
----------
array : list
List of tensor we wish to pad together.
mode : str
Padding mode see numpy.pad documentation.
value : float
Padding value see numpy.pad documentation.
Returns
-------
batched : numpy array
Padded numpy array.
valid_vals : list
List containing proportion for each dimension of original, non-padded values.
"""
if not len(array):
raise IndexError("Tensors list must not be empty")
if len(array) == 1:
# if there is only one tensor in the batch we simply unsqueeze it.
return np.expand_dims(array[0], 0), np.array([1.0], dtype="float32")
if not (any(
[array[i].ndim == array[0].ndim for i in range(1, len(array))])):
raise IndexError("All array must have same number of dimensions")
# FIXME we limit the support here: we allow padding of only the first dimension
# need to remove this when feat extraction is updated to handle multichannel.
max_shape = []
for dim in range(array[0].ndim):
if dim != 0:
if not all(
[x.shape[dim] == array[0].shape[dim] for x in array[1:]]):
raise EnvironmentError(
"Tensors should have same dimensions except for the first one"
)
max_shape.append(max([x.shape[dim] for x in array]))
batched = []
valid = []
for t in array:
# for each tensor we apply pad_right_to
padded, valid_percent = pad_right_to(
t, max_shape, mode=mode, value=value)
batched.append(padded)
valid.append(valid_percent[0])
batched = np.stack(batched)
return batched, np.array(valid, dtype="float32")
np_str_obj_array_pattern = re.compile(r"[SaUO]")
def pad_right_to(
array: np.ndarray,
target_shape: (list, tuple),
mode="constant",
value=0, ):
"""
This function takes a numpy of arbitrary shape and pads it to target
shape by appending values on the right.
Parameters
----------
array : input numpy array
Input tensor whose dimension we need to pad.
target_shape : (list, tuple)
Target shape we want for the target tensor its len must be equal to tensor.ndim
mode : str
Pad mode, please refer to numpy.pad documentation.
value : float
Pad value, please refer to numpy.pad documentation.
Returns
-------
array : numpy array
Padded numpy array.
valid_vals : list
List containing proportion for each dimension of original, non-padded values.
"""
assert len(target_shape) == array.ndim
pads = [] # this contains the abs length of the padding for each dimension.
valid_vals = [] # this contains the relative lengths for each dimension.
i = len(target_shape) - 1 # iterating over target_shape ndims
j = 0
while i >= 0:
assert (target_shape[i] >= array.shape[i]
), "Target shape must be >= original shape for every dim"
pads.extend([0, target_shape[i] - array.shape[i]])
valid_vals.append(array.shape[j] / target_shape[j])
i -= 1
j += 1
array = np.pad(array, pads, mode, constant_values=(value, value))
return array, valid_vals
def mod_default_collate(batch):
"""Makes a tensor from list of batch values.
Note that this doesn't need to zip(*) values together
as PaddedBatch connects them already (by key).
Here the idea is not to error out.
"""
elem = batch[0]
elem_type = type(elem)
if isinstance(elem, paddle.Tensor):
out = None
try:
if paddle.io.get_worker_info() is not None:
# If we're in a background process, concatenate directly into a
# shared memory tensor to avoid an extra copy
numel = sum([x.numel() for x in batch])
storage = elem.storage()._new_shared(numel)
out = elem.new(storage)
return paddle.stack(batch, 0, name=out)
except RuntimeError: # Unequal size:
return batch
elif (elem_type.__module__ == "numpy" and elem_type.__name__ != "str_" and
elem_type.__name__ != "string_"):
try:
if (elem_type.__name__ == "ndarray" or
elem_type.__name__ == "memmap"):
# array of string classes and object
if np_str_obj_array_pattern.search(elem.dtype.str) is not None:
return batch
return mod_default_collate(
[paddle.to_tensor(b, dtype=b.dtype) for b in batch])
elif elem.shape == (): # scalars
return paddle.to_tensor(batch, dtype=batch.dtype)
except RuntimeError: # Unequal size
return batch
elif isinstance(elem, float):
return paddle.to_tensor(batch, dtype=paddle.float64)
elif isinstance(elem, int):
return paddle.to_tensor(batch, dtype=paddle.int64)
else:
return batch
# Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modified from speechbrain 2023 (https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/dataio/dataio.py)
"""
Data reading and writing.
Authors
* Mirco Ravanelli 2020
* Aku Rouhe 2020
* Ju-Chieh Chou 2020
* Samuele Cornell 2020
* Abdel HEBA 2020
"""
import csv
import hashlib
import json
import logging
import os
import pickle
import re
import time
import numpy as np
import soundfile
logger = logging.getLogger(__name__)
import paddle
def load_data_json(json_path, replacements={}):
"""Loads JSON and recursively formats string values.
Arguments
----------
json_path : str
Path to CSV file.
replacements : dict
(Optional dict), e.g., {"data_folder": "/home/PaddleSpeech/data"}.
This is used to recursively format all string values in the data.
Returns
-------
dict
JSON data with replacements applied.
"""
with open(json_path, "r") as f:
out_json = json.load(f)
_recursive_format(out_json, replacements)
return out_json
def _recursive_format(data, replacements):
# Data: dict or list, replacements : dict
# Replaces string keys in replacements by their values
# at all levels of data (in str values)
# Works in-place.
if isinstance(data, dict):
for key, item in data.items():
if isinstance(item, dict) or isinstance(item, list):
_recursive_format(item, replacements)
elif isinstance(item, str):
data[key] = item.format_map(replacements)
# If not dict, list or str, do nothing
if isinstance(data, list):
for i, item in enumerate(data):
if isinstance(item, dict) or isinstance(item, list):
_recursive_format(item, replacements)
elif isinstance(item, str):
data[i] = item.format_map(replacements)
# If not dict, list or str, do nothing
def load_data_csv(csv_path, replacements={}):
"""Loads CSV and formats string values.
Uses the legacy CSV data format, where the CSV must have an
'ID' field.
If there is a field called duration, it is interpreted as a float.
The rest of the fields are left as they are (legacy _format and _opts fields
are not used to load the data in any special way).
Bash-like string replacements with $to_replace are supported.
Arguments
----------
csv_path : str
Path to CSV file.
replacements : dict
(Optional dict), e.g., {"data_folder": "/home/PaddleSpeech/data"}
This is used to recursively format all string values in the data.
Returns
-------
dict
CSV data with replacements applied.
"""
with open(csv_path, newline="") as csvfile:
result = {}
reader = csv.DictReader(csvfile, skipinitialspace=True)
variable_finder = re.compile(r"\$([\w.]+)")
for row in reader:
# ID:
try:
data_id = row["ID"]
del row["ID"] # This is used as a key in result, instead.
except KeyError:
raise KeyError("CSV has to have an 'ID' field, with unique ids"
" for all data points")
if data_id in result:
raise ValueError(f"Duplicate id: {data_id}")
# Replacements:
for key, value in row.items():
try:
row[key] = variable_finder.sub(
lambda match: str(replacements[match[1]]), value)
except KeyError:
raise KeyError(f"The item {value} requires replacements "
"which were not supplied.")
# Duration:
if "duration" in row:
row["duration"] = float(row["duration"])
result[data_id] = row
return result
def read_audio(waveforms_obj):
"""General audio loading, based on a custom notation.
Expected use case is in conjunction with Datasets
specified by JSON.
The custom notation:
The annotation can be just a path to a file:
"/path/to/wav1.wav"
Or can specify more options in a dict:
{"file": "/path/to/wav2.wav",
"start": 8000,
"stop": 16000
}
Arguments
----------
waveforms_obj : str, dict
Audio reading annotation, see above for format.
Returns
-------
paddle.Tensor
Audio tensor with shape: (samples, ).
"""
if isinstance(waveforms_obj, str):
audio, _ = soundfile.read(waveforms_obj, dtype="float32")
return audio
path = waveforms_obj["file"]
start = waveforms_obj.get("start", 0)
# Default stop to start -> if not specified, num_frames becomes 0
stop = waveforms_obj.get("stop", start)
num_frames = stop - start
audio, fs = soundfile.read(
path, start=start, stop=start + num_frames, dtype="float32")
return audio
def read_audio_multichannel(waveforms_obj):
"""General audio loading, based on a custom notation.
Expected use case is in conjunction with Datasets
specified by JSON.
The custom notation:
The annotation can be just a path to a file:
"/path/to/wav1.wav"
Multiple (possibly multi-channel) files can be specified, as long as they
have the same length:
{"files": [
"/path/to/wav1.wav",
"/path/to/wav2.wav"
]
}
Or you can specify a single file more succinctly:
{"files": "/path/to/wav2.wav"}
Offset number samples and stop number samples also can be specified to read
only a segment within the files.
{"files": [
"/path/to/wav1.wav",
"/path/to/wav2.wav"
]
"start": 8000
"stop": 16000
}
Arguments
----------
waveforms_obj : str, dict
Audio reading annotation, see above for format.
Returns
-------
paddle.Tensor
Audio tensor with shape: (samples, ).
"""
if isinstance(waveforms_obj, str):
audio, _ = soundfile.read(waveforms_obj, dtype="float32")
audio = paddle.to_tensor(audio)
return audio
files = waveforms_obj["files"]
if not isinstance(files, list):
files = [files]
waveforms = []
start = waveforms_obj.get("start", 0)
# Default stop to start -> if not specified, num_frames becomes 0
stop = waveforms_obj.get("stop", start - 1)
num_frames = stop - start
for f in files:
audio, fs = soundfile.read(
path, start=start, stop=start + num_frames, dtype="float32")
audio = paddle.to_tensor(audio)
waveforms.append(audio)
out = paddle.concat(waveforms, 0)
return out
def write_audio(filepath, audio, samplerate):
"""Write audio on disk. It is basically a wrapper to support saving
audio signals in format (audio, channels).
Arguments
---------
filepath: path
Path where to save the audio file.
audio : paddle.Tensor
Audio file in the expected format (signal, channels).
samplerate: int
Sample rate (e.g., 16000).
"""
if len(audio.shape) == 2:
audio = audio.transpose([1, 0])
elif len(audio.shape) == 1:
audio = audio.unsqueeze(0)
soundfile.write(filepath, audio, samplerate)
def load_pickle(pickle_path):
"""Utility function for loading .pkl pickle files.
Arguments
---------
pickle_path : str
Path to pickle file.
Returns
-------
out : object
Python object loaded from pickle.
"""
with open(pickle_path, "rb") as f:
out = pickle.load(f)
return out
def to_floatTensor(x: (list, tuple, np.ndarray)):
"""
Arguments
---------
x : (list, tuple, np.ndarray)
Input data to be converted to paddle float.
Returns
-------
tensor : paddle.tensor
Data now in paddle.tensor float datatype.
"""
return paddle.to_tensor(x, dtype='float32')
def to_doubleTensor(x: (list, tuple, np.ndarray)):
"""
Arguments
---------
x : (list, tuple, np.ndarray)
Input data to be converted to paddle double.
Returns
-------
tensor : paddle.tensor
Data now in paddle.tensor double datatype.
"""
return paddle.to_tensor(x, dtype='float64')
def to_longTensor(x: (list, tuple, np.ndarray)):
"""
Arguments
---------
x : (list, tuple, np.ndarray)
Input data to be converted to paddle long.
Returns
-------
tensor : paddle.tensor
Data now in paddle.tensor long datatype.
"""
return paddle.to_tensor(x, dtype='int64')
def convert_index_to_lab(batch, ind2lab):
"""Convert a batch of integer IDs to string labels.
Arguments
---------
batch : list
List of lists, a batch of sequences.
ind2lab : dict
Mapping from integer IDs to labels.
Returns
-------
list
List of lists, same size as batch, with labels from ind2lab.
"""
return [[ind2lab[int(index)] for index in seq] for seq in batch]
def relative_time_to_absolute(batch, relative_lens, rate):
"""Converts relative length to the absolute duration.
Operates on batch level.
Arguments
---------
batch : paddle.tensor
Sequences to determine the duration for.
relative_lens : paddle.tensor
The relative length of each sequence in batch. The longest sequence in
the batch needs to have relative length 1.0.
rate : float
The rate at which sequence elements occur in real-world time. Sample
rate, if batch is raw wavs (recommended) or 1/frame_shift if batch is
features. This has to have 1/s as the unit.
Returns
------:
paddle.tensor
Duration of each sequence in seconds.
"""
max_len = batch.shape[1]
durations = paddle.round(relative_lens * max_len) / rate
return durations
class IterativeCSVWriter:
"""Write CSV files a line at a time.
Arguments
---------
outstream : file-object
A writeable stream
data_fields : list
List of the optional keys to write. Each key will be expanded,
producing three fields: key, key_format, key_opts.
"""
def __init__(self, outstream, data_fields, defaults={}):
self._outstream = outstream
self.fields = ["ID", "duration"] + self._expand_data_fields(data_fields)
self.defaults = defaults
self._outstream.write(",".join(self.fields))
def set_default(self, field, value):
"""Sets a default value for the given CSV field.
Arguments
---------
field : str
A field in the CSV.
value
The default value.
"""
if field not in self.fields:
raise ValueError(f"{field} is not a field in this CSV!")
self.defaults[field] = value
def write(self, *args, **kwargs):
"""Writes one data line into the CSV.
Arguments
---------
*args
Supply every field with a value in positional form OR.
**kwargs
Supply certain fields by key. The ID field is mandatory for all
lines, but others can be left empty.
"""
if args and kwargs:
raise ValueError(
"Use either positional fields or named fields, but not both.")
if args:
if len(args) != len(self.fields):
raise ValueError("Need consistent fields")
to_write = [str(arg) for arg in args]
if kwargs:
if "ID" not in kwargs:
raise ValueError("I'll need to see some ID")
full_vals = self.defaults.copy()
full_vals.update(kwargs)
to_write = [str(full_vals.get(field, "")) for field in self.fields]
self._outstream.write("\n")
self._outstream.write(",".join(to_write))
def write_batch(self, *args, **kwargs):
"""Writes a batch of lines into the CSV.
Here each argument should be a list with the same length.
Arguments
---------
*args
Supply every field with a value in positional form OR.
**kwargs
Supply certain fields by key. The ID field is mandatory for all
lines, but others can be left empty.
"""
if args and kwargs:
raise ValueError(
"Use either positional fields or named fields, but not both.")
if args:
if len(args) != len(self.fields):
raise ValueError("Need consistent fields")
for arg_row in zip(*args):
self.write(*arg_row)
if kwargs:
if "ID" not in kwargs:
raise ValueError("I'll need to see some ID")
keys = kwargs.keys()
for value_row in zip(*kwargs.values()):
kwarg_row = dict(zip(keys, value_row))
self.write(**kwarg_row)
@staticmethod
def _expand_data_fields(data_fields):
expanded = []
for data_field in data_fields:
expanded.append(data_field)
expanded.append(data_field + "_format")
expanded.append(data_field + "_opts")
return expanded
def write_txt_file(data, filename, sampling_rate=None):
"""Write data in text format.
Arguments
---------
data : str, list, paddle.tensor, numpy.ndarray
The data to write in the text file.
filename : str
Path to file where to write the data.
sampling_rate : None
Not used, just here for interface compatibility.
Returns
-------
None
"""
del sampling_rate # Not used.
# Check if the path of filename exists
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, "w") as fout:
if isinstance(data, paddle.Tensor):
data = data.tolist()
if isinstance(data, np.ndarray):
data = data.tolist()
if isinstance(data, list):
for line in data:
print(line, file=fout)
if isinstance(data, str):
print(data, file=fout)
def write_stdout(data, filename=None, sampling_rate=None):
"""Write data to standard output.
Arguments
---------
data : str, list, paddle.Tensor, numpy.ndarray
The data to write in the text file.
filename : None
Not used, just here for compatibility.
sampling_rate : None
Not used, just here for compatibility.
Returns
-------
None
"""
# Managing paddle.Tensor
if isinstance(data, paddle.Tensor):
data = data.tolist()
# Managing np.ndarray
if isinstance(data, np.ndarray):
data = data.tolist()
if isinstance(data, list):
for line in data:
print(line)
if isinstance(data, str):
print(data)
def length_to_mask(length, max_len=None, dtype=None, device=None):
"""Creates a binary mask for each sequence.
Arguments
---------
length : LongTensor
Containing the length of each sequence in the batch. Must be 1D.
max_len : int
Max length for the mask, also the size of the second dimension.
dtype : dtype, default: None
The dtype of the generated mask.
device: device, default: None
The device to put the mask variable.
Returns
-------
mask : tensor
The binary mask.
"""
assert len(length.shape) == 1
if max_len is None:
max_len = length.max().long().item() # using arange to generate mask
mask = paddle.arange(
max_len, dtype=length.dtype).expand(
[len(length), max_len]) < length.unsqueeze(1)
if dtype is None:
dtype = length.dtype
if device is None:
device = length.device
mask = paddle.to_tensor(mask, dtype=dtype)
return mask
def read_kaldi_lab(kaldi_ali, kaldi_lab_opts):
"""Read labels in kaldi format.
Uses kaldi IO.
Arguments
---------
kaldi_ali : str
Path to directory where kaldi alignments are stored.
kaldi_lab_opts : str
A string that contains the options for reading the kaldi alignments.
Returns
-------
lab : dict
A dictionary containing the labels.
Note
----
This depends on kaldi-io-for-python. Install it separately.
See: https://github.com/vesis84/kaldi-io-for-python
```
"""
# EXTRA TOOLS
try:
import kaldi_io
except ImportError:
raise ImportError("Could not import kaldi_io. Install it to use this.")
# Reading the Kaldi labels
lab = {
k: v
for k, v in kaldi_io.read_vec_int_ark(
"gunzip -c " + kaldi_ali + "/ali*.gz | " + kaldi_lab_opts + " " +
kaldi_ali + "/final.mdl ark:- ark:-|")
}
return lab
def get_md5(file):
"""Get the md5 checksum of an input file.
Arguments
---------
file : str
Path to file for which compute the checksum.
Returns
-------
md5
Checksum for the given filepath.
"""
# Lets read stuff in 64kb chunks!
BUF_SIZE = 65536
md5 = hashlib.md5()
# Computing md5
with open(file, "rb") as f:
while True:
data = f.read(BUF_SIZE)
if not data:
break
md5.update(data)
return md5.hexdigest()
def save_md5(files, out_file):
"""Saves the md5 of a list of input files as a pickled dict into a file.
Arguments
---------
files : list
List of input files from which we will compute the md5.
outfile : str
The path where to store the output pkl file.
Returns
-------
None
"""
# Initialization of the dictionary
md5_dict = {}
# Computing md5 for all the files in the list
for file in files:
md5_dict[file] = get_md5(file)
# Saving dictionary in pkl format
save_pkl(md5_dict, out_file)
def save_pkl(obj, file):
"""Save an object in pkl format.
Arguments
---------
obj : object
Object to save in pkl format
file : str
Path to the output file
sampling_rate : int
Sampling rate of the audio file, TODO: this is not used?
"""
with open(file, "wb") as f:
pickle.dump(obj, f)
def load_pkl(file):
"""Loads a pkl file.
For an example, see `save_pkl`.
Arguments
---------
file : str
Path to the input pkl file.
Returns
-------
The loaded object.
"""
# Deals with the situation where two processes are trying
# to access the same label dictionary by creating a lock
count = 100
while count > 0:
if os.path.isfile(file + ".lock"):
time.sleep(1)
count -= 1
else:
break
try:
open(file + ".lock", "w").close()
with open(file, "rb") as f:
return pickle.load(f)
finally:
if os.path.isfile(file + ".lock"):
os.remove(file + ".lock")
def prepend_bos_token(label, bos_index):
"""Create labels with <bos> token at the beginning.
Arguments
---------
label : IntTensor
Containing the original labels. Must be of size: [batch_size, max_length].
bos_index : int
The index for <bos> token.
Returns
-------
new_label : tensor
The new label with <bos> at the beginning.
"""
new_label = label.long().clone()
batch_size = label.shape[0]
bos = new_label.new_zeros(batch_size, 1).fill_(bos_index)
new_label = paddle.concat([bos, new_label], axis=1)
return new_label
def append_eos_token(label, length, eos_index):
"""Create labels with <eos> token appended.
Arguments
---------
label : IntTensor
Containing the original labels. Must be of size: [batch_size, max_length]
length : LongTensor
Containing the original length of each label sequences. Must be 1D.
eos_index : int
The index for <eos> token.
Returns
-------
new_label : tensor
The new label with <eos> appended.
"""
new_label = paddle.to_tensor(label, dtype="int32").clone()
batch_size = label.shape[0]
pad = paddle.zeros([batch_size, 1], dtype=new_label.dtype)
new_label = paddle.concat([new_label, pad], dim=1)
new_label[paddle.arange(batch_size), paddle.to_tensor(
length, dtype="int64")] = eos_index
return new_label
def merge_char(sequences, space="_"):
"""Merge characters sequences into word sequences.
Arguments
---------
sequences : list
Each item contains a list, and this list contains a character sequence.
space : string
The token represents space. Default: _
Returns
-------
The list contains word sequences for each sentence.
"""
results = []
for seq in sequences:
words = "".join(seq).split(space)
results.append(words)
return results
def merge_csvs(data_folder, csv_lst, merged_csv):
"""Merging several csv files into one file.
Arguments
---------
data_folder : string
The folder to store csv files to be merged and after merging.
csv_lst : list
Filenames of csv file to be merged.
merged_csv : string
The filename to write the merged csv file.
"""
write_path = os.path.join(data_folder, merged_csv)
if os.path.isfile(write_path):
logger.info("Skipping merging. Completed in previous run.")
with open(os.path.join(data_folder, csv_lst[0])) as f:
header = f.readline()
lines = []
for csv_file in csv_lst:
with open(os.path.join(data_folder, csv_file)) as f:
for i, line in enumerate(f):
if i == 0:
# Checking header
if line != header:
raise ValueError("Different header for "
f"{csv_lst[0]} and {csv}.")
continue
lines.append(line)
with open(write_path, "w") as f:
f.write(header)
for line in lines:
f.write(line)
logger.info(f"{write_path} is created.")
def split_word(sequences, space="_"):
"""Split word sequences into character sequences.
Arguments
---------
sequences : list
Each item contains a list, and this list contains a words sequence.
space : string
The token represents space. Default: _
Returns
-------
The list contains word sequences for each sentence.
"""
results = []
for seq in sequences:
chars = list(space.join(seq))
results.append(chars)
return results
# Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modified from speechbrain 2023 (https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/dataio/dataloader.py)
"""Paddle compatible DataLoaders
Essentially we extend Paddle DataLoader by adding the ability to save the
data loading state, so that a checkpoint may be saved in the middle of an
epoch.
Authors:
* Aku Rouhe 2020
"""
import collections
import functools
import logging
import warnings
import paddle
from paddle.io import DataLoader
from paddlespeech.s2t.io.speechbrain.data_utils import batch_pad_right
from paddlespeech.s2t.io.speechbrain.data_utils import mod_default_collate
from paddlespeech.s2t.io.speechbrain.dataset import DynamicItemDataset
from paddlespeech.s2t.io.speechbrain.sampler import ReproducibleRandomSampler
PaddedData = collections.namedtuple("PaddedData", ["data", "lengths"])
import numpy
class Wav2vec2DataLoader(DataLoader):
def __init__(self,
dataset,
batch_size=1,
shuffle=False,
sampler=None,
batch_sampler=None,
num_workers=0,
collate_fn=None,
pin_memory=False,
drop_last=False,
timeout=0,
worker_init_fn=None,
multiprocessing_context=None,
generator=None):
if isinstance(dataset[0], (tuple, list)):
return_list = True
else:
return_list = False
super().__init__(
dataset,
feed_list=None,
places=None,
return_list=return_list,
batch_sampler=batch_sampler,
batch_size=batch_size,
shuffle=shuffle,
drop_last=drop_last,
collate_fn=collate_fn,
num_workers=num_workers,
use_buffer_reader=True,
use_shared_memory=False,
timeout=timeout,
worker_init_fn=worker_init_fn)
if sampler is not None:
self.batch_sampler.sampler = sampler
def PaddedBatch(
examples,
padded_keys=None,
device_prep_keys=None,
padding_func=batch_pad_right,
padding_kwargs={},
nonpadded_stack=True, ):
__length = len(examples)
__keys = list(examples[0].keys())
__padded_keys = []
__device_prep_keys = []
res = {}
for key in __keys:
values = [example[key] for example in examples]
# Default convert usually does the right thing (numpy2tensor etc.)
# values = default_convert(values)
if (padded_keys is not None and key in padded_keys) or (
padded_keys is None and isinstance(values[0], numpy.ndarray)):
# Padding and PaddedData
__padded_keys.append(key)
padded = PaddedData(*padding_func(values, **padding_kwargs))
res[key] = padded
else:
# Default collate usually does the right thing
# (convert lists of equal sized tensors to batch tensors, etc.)
if nonpadded_stack:
values = mod_default_collate(values)
res[key] = values
if (device_prep_keys is not None and key in device_prep_keys) or (
device_prep_keys is None and
isinstance(values[0], paddle.Tensor)):
__device_prep_keys.append(key)
return res
def make_dataloader(dataset, stage, **loader_kwargs):
"""Makes a basic DataLoader.
For DynamicItemDatasets (which return dicts), use
PaddedBatch as the default collate_fn.
Shuffling gets implemented by ReproducibleRandomSampler.
If the Dataset is not an IterableDataset, the DataLoader
is a SaveableDataLoader.
If the Dataset is a webdataset.dataset.Composable, set default
batch_size = None.
Can also loop over the underlying dataloader continuously,
and stop iterations at nominal epoch lengths.
Arguments
---------
dataset : Dataset
The dataset to make a DataLoader for.
looped_nominal_epoch : None, int
If an integer is given, loop the underlying DataLoader infinitely and
set a nominal epoch length in batches (or whatever the DataLoader
yields).
**loader_kwargs : dict
Keyword args to DataLoader, see Paddle DataLoader for
options.
Returns
-------
DataLoader
If looped_nominal_epoch is None
LoopedLoader
If looped_nominal_epoch is not None
"""
# PaddedBatch as default collation for DynamicItemDataset
if "collate_fn" not in loader_kwargs and isinstance(dataset,
DynamicItemDataset):
loader_kwargs["collate_fn"] = PaddedBatch
# Reproducible random sampling
if loader_kwargs.get("shuffle", False):
if loader_kwargs.get("sampler") is not None:
raise ValueError("Cannot specify both shuffle=True and a "
"sampler in loader_kwargs")
sampler = ReproducibleRandomSampler(dataset)
loader_kwargs["sampler"] = sampler
# Should delete shuffle because you can't set both Sampler and
# shuffle
# NOTE: the dict of loader options may get used elsewhere!
# However, this del doesn't touch those because loader_kwargs comes
# from a **kwargs dict.
del loader_kwargs["shuffle"]
# Create the loader
dataloader = Wav2vec2DataLoader(dataset, **loader_kwargs)
return dataloader
# Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modified from speechbrain 2023 (https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/dataio/dataset.py)
import contextlib
import copy
import logging
from types import MethodType
from paddle.io import Dataset
from paddlespeech.s2t.io.speechbrain.data_pipeline import DataPipeline
from paddlespeech.s2t.io.speechbrain.dataio import load_data_csv
from paddlespeech.s2t.io.speechbrain.dataio import load_data_json
logger = logging.getLogger(__name__)
class DynamicItemDataset(Dataset):
"""Dataset that reads, wrangles, and produces dicts.
Each data point dict provides some items (by key), for example, a path to a
wavefile with the key "wav_file". When a data point is fetched from this
Dataset, more items are produced dynamically, based on pre-existing items
and other dynamic created items. For example, a dynamic item could take the
wavfile path and load the audio from the disk.
The dynamic items can depend on other dynamic items: a suitable evaluation
order is used automatically, as long as there are no circular dependencies.
A specified list of keys is collected in the output dict. These can be items
in the original data or dynamic items. If some dynamic items are not
requested, nor depended on by other requested items, they won't be computed.
So for example if a user simply wants to iterate over the text, the
time-consuming audio loading can be skipped.
About the format:
Takes a dict of dicts as the collection of data points to read/wrangle.
The top level keys are data point IDs.
Each data point (example) dict should have the same keys, corresponding to
different items in that data point.
Altogether the data collection could look like this:
>>> data = {
... "spk1utt1": {
... "wav_file": "/path/to/spk1utt1.wav",
... "text": "hello world",
... "speaker": "spk1",
... },
... "spk1utt2": {
... "wav_file": "/path/to/spk1utt2.wav",
... "text": "how are you world",
... "speaker": "spk1",
... }
... }
NOTE
----
The top-level key, the data point id, is implicitly added as an item
in the data point, with the key "id"
Each dynamic item is configured by three things: a key, a func, and a list
of argkeys. The key should be unique among all the items (dynamic or not) in
each data point. The func is any callable, and it returns the dynamic item's
value. The callable is called with the values of other items as specified
by the argkeys list (as positional args, passed in the order specified by
argkeys).
Arguments
---------
data : dict
Dictionary containing single data points (e.g. utterances).
dynamic_items : list, optional
Configuration for the dynamic items produced when fetching an example.
List of DynamicItems or dicts with the format::
func: <callable> # To be called
takes: <list> # key or list of keys of args this takes
provides: key # key or list of keys that this provides
output_keys : dict, list, optional
List of keys (either directly available in data or dynamic items)
to include in the output dict when data points are fetched.
If a dict is given; it is used to map internal keys to output keys.
From the output_keys dict key:value pairs the key appears outside,
and value is the internal key.
"""
def __init__(
self,
data,
dynamic_items=[],
output_keys=[], ):
self.data = data
self.data_ids = list(self.data.keys())
static_keys = list(self.data[self.data_ids[0]].keys())
if "id" in static_keys:
raise ValueError("The key 'id' is reserved for the data point id.")
else:
static_keys.append("id")
self.pipeline = DataPipeline(static_keys, dynamic_items)
self.set_output_keys(output_keys)
def __len__(self):
return len(self.data_ids)
def __getitem__(self, index):
data_id = self.data_ids[index]
data_point = self.data[data_id]
return self.pipeline.compute_outputs({"id": data_id, **data_point})
def add_dynamic_item(self, func, takes=None, provides=None):
"""Makes a new dynamic item available on the dataset.
Two calling conventions. For DynamicItem objects, just use:
add_dynamic_item(dynamic_item).
But otherwise, should use:
add_dynamic_item(func, takes, provides).
Arguments
---------
func : callable, DynamicItem
If a DynamicItem is given, adds that directly. Otherwise a
DynamicItem is created, and this specifies the callable to use. If
a generator function is given, then create a GeneratorDynamicItem.
Otherwise creates a normal DynamicItem.
takes : list, str
List of keys. When func is called, each key is resolved to
either an entry in the data or the output of another dynamic_item.
The func is then called with these as positional arguments,
in the same order as specified here.
A single arg can be given directly.
provides : str
Unique key or keys that this provides.
"""
self.pipeline.add_dynamic_item(func, takes, provides)
def set_output_keys(self, keys):
"""Use this to change the output keys.
These are the keys that are actually evaluated when a data point
is fetched from the dataset.
Arguments
---------
keys : dict, list
List of keys (str) to produce in output.
If a dict is given; it is used to map internal keys to output keys.
From the output_keys dict key:value pairs the key appears outside,
and value is the internal key.
"""
self.pipeline.set_output_keys(keys)
@contextlib.contextmanager
def output_keys_as(self, keys):
"""Context manager to temporarily set output keys.
NOTE
----
Not thread-safe. While in this context manager, the output keys
are affected for any call.
"""
saved_output = self.pipeline.output_mapping
self.pipeline.set_output_keys(keys)
yield self
self.pipeline.set_output_keys(saved_output)
def filtered_sorted(
self,
key_min_value={},
key_max_value={},
key_test={},
sort_key=None,
reverse=False,
select_n=None, ):
"""Get a filtered and/or sorted version of this, shares static data.
The reason to implement these operations in the same method is that
computing some dynamic items may be expensive, and this way the
filtering and sorting steps don't need to compute the dynamic items
twice.
Arguments
---------
key_min_value : dict
Map from key (in data or in dynamic items) to limit, will only keep
data_point if data_point[key] >= limit
key_max_value : dict
Map from key (in data or in dynamic items) to limit, will only keep
data_point if data_point[key] <= limit
key_test : dict
Map from key (in data or in dynamic items) to func, will only keep
data_point if bool(func(data_point[key])) == True
sort_key : None, str
If not None, sort by data_point[sort_key]. Default is ascending
order.
reverse : bool
If True, sort in descending order.
select_n : None, int
If not None, only keep (at most) the first n filtered data_points.
The possible sorting is applied, but only on the first n data
points found. Meant for debugging.
Returns
-------
FilteredSortedDynamicItemDataset
Shares the static data, but has its own output keys and
dynamic items (initially deep copied from this, so they have the
same dynamic items available)
NOTE
----
Temporarily changes the output keys!
"""
filtered_sorted_ids = self._filtered_sorted_ids(
key_min_value,
key_max_value,
key_test,
sort_key,
reverse,
select_n, )
return FilteredSortedDynamicItemDataset(
self, filtered_sorted_ids) # NOTE: defined below
def _filtered_sorted_ids(
self,
key_min_value={},
key_max_value={},
key_test={},
sort_key=None,
reverse=False,
select_n=None, ):
"""Returns a list of data ids, fulfilling the sorting and filtering."""
def combined_filter(computed):
"""Applies filter."""
for key, limit in key_min_value.items():
# NOTE: docstring promises >= so using that.
# Mathematically could also use < for nicer syntax, but
# maybe with some super special weird edge case some one can
# depend on the >= operator
if computed[key] >= limit:
continue
return False
for key, limit in key_max_value.items():
if computed[key] <= limit:
continue
return False
for key, func in key_test.items():
if bool(func(computed[key])):
continue
return False
return True
temp_keys = (set(key_min_value.keys()) | set(key_max_value.keys()) |
set(key_test.keys()) |
set([] if sort_key is None else [sort_key]))
filtered_ids = []
with self.output_keys_as(temp_keys):
for i, data_id in enumerate(self.data_ids):
if select_n is not None and len(filtered_ids) == select_n:
break
data_point = self.data[data_id]
data_point["id"] = data_id
computed = self.pipeline.compute_outputs(data_point)
if combined_filter(computed):
if sort_key is not None:
# Add (main sorting index, current index, data_id)
# So that we maintain current sorting and don't compare
# data_id values ever.
filtered_ids.append((computed[sort_key], i, data_id))
else:
filtered_ids.append(data_id)
if sort_key is not None:
filtered_sorted_ids = [
tup[2] for tup in sorted(filtered_ids, reverse=reverse)
]
else:
filtered_sorted_ids = filtered_ids
return filtered_sorted_ids
@classmethod
def from_json(cls,
json_path,
replacements={},
dynamic_items=[],
output_keys=[]):
"""Load a data prep JSON file and create a Dataset based on it."""
data = load_data_json(json_path, replacements)
return cls(data, dynamic_items, output_keys)
@classmethod
def from_csv(cls,
csv_path,
replacements={},
dynamic_items=[],
output_keys=[]):
"""Load a data prep CSV file and create a Dataset based on it."""
data = load_data_csv(csv_path, replacements)
return cls(data, dynamic_items, output_keys)
@classmethod
def from_arrow_dataset(cls,
dataset,
replacements={},
dynamic_items=[],
output_keys=[]):
"""Loading a prepared huggingface dataset"""
# define an unbound method to generate puesdo keys
def keys(self):
"Returns the keys."
return [i for i in range(dataset.__len__())]
# bind this method to arrow dataset
dataset.keys = MethodType(keys, dataset)
return cls(dataset, dynamic_items, output_keys)
class FilteredSortedDynamicItemDataset(DynamicItemDataset):
"""Possibly filtered, possibly sorted DynamicItemDataset.
Shares the static data (reference).
Has its own dynamic_items and output_keys (deepcopy).
"""
def __init__(self, from_dataset, data_ids):
self.data = from_dataset.data
self.data_ids = data_ids
self.pipeline = copy.deepcopy(from_dataset.pipeline)
@classmethod
def from_json(cls,
json_path,
replacements={},
dynamic_items=None,
output_keys=None):
raise TypeError("Cannot create SubsetDynamicItemDataset directly!")
@classmethod
def from_csv(cls,
csv_path,
replacements={},
dynamic_items=None,
output_keys=None):
raise TypeError("Cannot create SubsetDynamicItemDataset directly!")
def add_dynamic_item(datasets, func, takes=None, provides=None):
"""Helper for adding the same item to multiple datasets."""
for dataset in datasets:
dataset.add_dynamic_item(func, takes, provides)
def set_output_keys(datasets, output_keys):
"""Helper for setting the same item to multiple datasets."""
for dataset in datasets:
dataset.set_output_keys(output_keys)
# Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modified from speechbrain 2023 (https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/utils/depgraph.py)
"""A dependency graph for finding evaluation order.
Authors:
* Aku Rouhe 2020
"""
import collections
import uuid
class CircularDependencyError(ValueError):
"""
An error caused by running into circular dependencies while searching for
an evaluation order in a DependencyGraph.
"""
pass
DGNode = collections.namedtuple("DGNode", ["key", "edges", "data"])
# A node in DependencyGraph.
class DependencyGraph:
"""General-purpose dependency graph.
Essentially a directed acyclic graph.
Usually used to find an evaluation order for e.g. variable substitution
The relation that an edge between A and B represents is:
"A depends on B, i.e. B should be evaluated before A"
Nodes can be added explicitly or they can be created implicitly
while adding edges.
Nodes have keys, which should be some hashable value that identifies
the elements the graph represents in your use case. E.G. they can just
be the variable name you want to substitute.
However, if needed, more generally you can attach any data to a node
(e.g. a path in your tree), and if so desired, a unique key can be
created for you. You'll only need to know that key while adding edges
to/from it.
Implicit keys and explicit keys can also be mixed.
"""
def __init__(self):
self.digraph = []
self.key2ind = {}
# Guard for manual duplicates (but not implicitly added ones)
self._manually_added_keys = []
@staticmethod
def get_unique_key():
"""Returns a unique hashable identifier."""
return uuid.uuid4()
def add_node(self, key=None, data=None):
"""Adds a node explicitly.
Arguments
---------
key : hashable, optional
If not given, a key is created for you.
data : Any, optional
Any additional data you wish to attach to this node.
Returns
-------
hashable
The key that was used (either yours or generated).
Raises
------
ValueError
If node with the given key has already been added explicitly
(with this method, not "add_edge").
"""
if key is None:
key = self.get_unique_key()
elif key in self._manually_added_keys:
raise ValueError("Adding duplicate node: {key}".format(key=key))
else:
self._manually_added_keys.append(key)
if key in self.key2ind: # Implicitly added already; don't add again.
ind = self.key2ind[key]
node = self.digraph[ind]
# All that this operation can do is add data:
self.digraph[ind] = DGNode(node.key, node.edges, data)
return key
self.key2ind[key] = len(self.digraph)
self.digraph.append(DGNode(key, [], data))
return key
def add_edge(self, from_key, to_key):
"""Adds an edge, and implicitly also creates nodes for keys which have
not been seen before. This will not let you add data to your nodes.
The relation encodes: "from_key depends on to_key"
(to_key must be evaluated before from_key).
Arguments
---------
from_key : hashable
The key which depends on.
to_key : hashable
The key which is depended on.
Returns
-------
None
"""
from_ind = self._get_ind_and_add_if_new(from_key)
to_ind = self._get_ind_and_add_if_new(to_key)
edges_list = self.digraph[from_ind].edges
if to_ind not in edges_list:
edges_list.append(to_ind)
def _get_ind_and_add_if_new(self, key):
# Used internally to implicitly add nodes for unseen keys
if key not in self.key2ind:
self.key2ind[key] = len(self.digraph)
self.digraph.append(DGNode(key, [], None))
return self.key2ind[key]
def is_valid(self):
"""Checks if an evaluation order can be found.
A dependency graph is evaluatable if there are no circular
dependencies, i.e., the graph is acyclic.
Returns
-------
bool
Indicating if the graph is evaluatable.
"""
return not self._find_first_cycle()
def get_evaluation_order(self, selected_keys=None):
"""Finds one valid evaluation order.
There can be many different valid
orders.
NOTE: Generates output one DGNode at a time. May generate DGNodes
before it finds a circular dependency. If you really need to know
whether an order can be found, check is_valid() first. However,
the algorithm for finding cycles is essentially the same as the one
used for finding an evaluation order, so for very large graphs...
Ah well, but maybe then you should be using some other solution
anyway.
Arguments
---------
selected_keys : list, None
List of keys. If not None, only the selected keys are guaranteed
in the evaluation order (along with the keys they depend on).
Yields
------
DGNode
The added DGNodes in a valid evaluation order.
See the DGNode namedtuple above.
Raises
------
CircularDependencyError
If a circular dependency is found.
"""
seen_ever = set()
def toposort(root_ind, visited):
"""Implementation of topsort."""
nonlocal seen_ever
here = visited + [root_ind]
if root_ind in visited:
raise CircularDependencyError("{cycle}".format(
cycle=" -> ".join(str(self.digraph[i].key) for i in here)))
if root_ind in seen_ever:
return # Yield nothing
seen_ever = seen_ever.union(set([root_ind]))
for to_ind in self.digraph[root_ind].edges:
for ind in toposort(to_ind, visited=here):
yield ind
yield root_ind
if selected_keys is None:
start_inds = range(len(self.digraph))
else:
start_inds = [self.key2ind[key] for key in selected_keys]
for start_ind in start_inds:
for ind in toposort(start_ind, []):
yield self.digraph[ind]
def _find_first_cycle(self):
"""Depth-first search based algorithm for finding cycles in the graph."""
seen_ever = set()
def cycle_dfs(root_ind, visited):
"""Implementation of cycle_dfs."""
nonlocal seen_ever
print(root_ind, visited)
here = visited + [root_ind]
if root_ind in visited:
return here
if root_ind in seen_ever:
return []
seen_ever = seen_ever.union(set([root_ind]))
for to_ind in self.digraph[root_ind].edges:
cycle = cycle_dfs(to_ind, here)
if cycle:
return cycle
return []
for ind in range(len(self.digraph)):
if ind not in seen_ever:
cycle = cycle_dfs(ind, [])
if cycle:
return cycle
return []
def __contains__(self, key):
# Allows the syntax:
# 'key' in dependency_graph
return key in self.key2ind
# Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modified from speechbrain 2023 (https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/core.py)
import paddlespeech.s2t.io.speechbrain.dataloader
def _train_loader_specifics(self, dataset, loader_kwargs):
sampler = loader_kwargs.get("sampler", None)
# Shuffling should really only matter for the train stage. Shuffling
# will also lead to more padding in batches if the order was otherwise
# sorted by length.
shuffle = loader_kwargs.get("shuffle", False)
if shuffle and not self.distributed_launch:
if sampler is not None:
raise ValueError("Cannot specify both shuffle=True"
"and a sampler in loader_kwargs")
sampler = ReproducibleRandomSampler(dataset)
self.train_sampler = sampler
loader_kwargs["sampler"] = self.train_sampler
# Delete the shuffle flag, since you cannot specify both a sampler and
# shuffling:
del loader_kwargs["shuffle"]
# Possibly make a DistributedSampler or a wrapper for some other sampler
if self.distributed_launch and not isinstance(dataset, IterableDataset):
drop_last = loader_kwargs.get("drop_last", False)
# num_replicas arg is equal to world_size
# and retrieved automatically within
# DistributedSampler obj.
if sampler is not None:
self.train_sampler = DistributedSamplerWrapper(
sampler,
rank=self.rank,
drop_last=drop_last,
shuffle=shuffle, )
# with DistributedSamplerWrapper, one must disable shuffling for dataloader
loader_kwargs["shuffle"] = False
loader_kwargs["sampler"] = self.train_sampler
elif loader_kwargs.get("batch_sampler") is None:
# no sampler and batch-sampler
self.train_sampler = DistributedSampler(
dataset, rank=self.rank, shuffle=True, drop_last=drop_last)
# with DistributedSamplerWrapper, one must disable shuffling for dataloader
loader_kwargs["shuffle"] = False
loader_kwargs["sampler"] = self.train_sampler
else: # batch_sampler was specified
self.train_sampler = DistributedSamplerWrapper(
loader_kwargs.get("batch_sampler", None),
rank=self.rank,
shuffle=True, )
loader_kwargs["batch_sampler"] = self.train_sampler
elif self.distributed_launch and isinstance(dataset, IterableDataset):
logger.warning("Cannot automatically solve distributed sampling "
"for IterableDataset.")
return loader_kwargs
def make_dataloader(self, dataset, stage, **loader_kwargs):
"""Creates DataLoaders for Datasets.
This is used by ``fit()`` and ``evaluate()`` if they just receive
Datasets.
Alternatively, this can be called from outside the Brain subclass.
In that case, the DataLoader should be passed to ``fit()`` in place
of the dataset.
The Stage.TRAIN DataLoader is handled specially. It has extra args for
shuffle and drop_last. In DDP a DistributedSampler is created (unless
the dataset is an IterableDataset).
NOTE
----
Some important DataLoader arguments are passed via **loader_kwargs,
e.g., batch_size, num_workers, pin_memory.
NOTE
----
By default, ``evaluate()`` specifies ckpt_prefix=None to stop the test
DataLoader being added to the checkpointer. If you need to add a
recoverable after saving checkpoints (e.g., at test time, after
checkpointing the training), and still be able to recover reasonably,
you should probably specify ``allow_partial_load=True``.
Arguments
---------
dataset : Dataset
A set of data to use to create data loader. If the Dataset is a
DynamicItemDataset, PaddedBatch is used as the default collate_fn,
unless specified in loader_kwargs.
stage : Stage
The stage of the experiment: Stage.TRAIN, Stage.VALID, Stage.TEST
ckpt_prefix : str, None
Prefix to use for SaveableDataLoader Checkpoint name. The Stage
name is added to this to create the full key. Set to None to not
save the DataLoader.
**loader_kwargs : dict
Additional keyword arguments to the DataLoader.
E.g., batch_size, num_workers, pin_memory.
"""
dataloader_ = dataloader.make_dataloader(dataset, **loader_kwargs)
return dataloader_
# Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modified from speechbrain 2023 (https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/dataio/sampler.py)
"""compatible samplers.
These determine the order of iteration through a dataset.
Authors:
* Aku Rouhe 2020
* Samuele Cornell 2020
* Ralf Leibold 2020
* Artem Ploujnikov 2021
* Andreas Nautsch 2021
"""
import logging
from collections import Counter
from typing import List
import numpy as np
import paddle
from paddle.io import RandomSampler
from paddle.io import Sampler
from paddle.io import WeightedRandomSampler
from scipy.stats import lognorm
from paddlespeech.s2t.io.speechbrain.dataset import DynamicItemDataset
logger = logging.getLogger(__name__)
class ReproducibleRandomSampler(RandomSampler):
"""A modification of RandomSampler which always returns the same values.
Also look at `paddle.io.RandomSampler`. This has mostly
the same behaviour and arguments, except for adding 'seed' and 'epoch' and
not supporting 'generator'.
Note
----
Call `set_epoch` before every epoch. Otherwise, the sampler will produce the
same sequence of indices every epoch.
Arguments
---------
data_source : Dataset
The data source to sample indices for.
seed : int
The base seed to use for the random number generator. It is recommended
to use a value which has a good mix of 0 and 1 bits.
epoch : int
The epoch to start at.
"""
def __init__(self, data_source, seed=563375142, epoch=0, **kwargs):
if "generator" in kwargs:
MSG = ("Cannot give a separate generator when using " +
"ReproducibleRandomSampler")
raise ValueError(MSG)
super().__init__(data_source, **kwargs)
self.seed = int(seed)
self.epoch = epoch
self.gen = paddle.seed(1)
def set_epoch(self, epoch):
"""
You can also just access self.epoch, but we maintain this interface
to mirror paddle.io.DistributedBatchSampler
"""
self.epoch = epoch
def __iter__(self):
self.gen.manual_seed(self.seed + self.epoch)
return super().__iter__()
class ReproducibleWeightedRandomSampler(WeightedRandomSampler):
"""A reproducible modification of WeightedRandomSampler.
Also look at `paddle.io.WeightedRandomSampler`. This has the
the same behaviour and arguments, except for adding 'seed' and 'epoch' and
not supporting 'generator'.
Note
----
Call `set_epoch` before every epoch. Otherwise, the sampler will produce the
same sequence of indices every epoch.
Arguments
---------
weights : sequence of float
Weights for each index. Doesn't need to sum to one.
num_samples : int
Number of samples to draw
replacement : bool
To draw with replacement or not (within an epoch of num_samples).
seed : int
The base seed to use for the random number generator. It is recommended
to use a value which has a good mix of 0 and 1 bits.
epoch : int
The epoch to start at.
"""
def __init__(
self,
weights,
num_samples,
replacement,
seed=129491412,
epoch=0,
**kwargs, ):
if "generator" in kwargs:
MSG = ("Cannot give a separate generator when using " +
"ReproducibleRandomSampler")
raise ValueError(MSG)
super().__init__(weights, num_samples, replacement, **kwargs)
self.seed = int(seed)
self.epoch = epoch
self.gen = paddle.seed(1)
def set_epoch(self, epoch):
"""
You can also just access self.epoch, but we maintain this interface
to mirror paddle.io.DistributedBatchSampler
"""
self.epoch = epoch
def __iter__(self):
self.gen.manual_seed(self.seed + self.epoch)
return super().__iter__()
class DynamicBatchSampler(Sampler):
"""This BatchSampler batches examples together by grouping them by their length.
Every example in the batch have approximately the same length and
thus padding is minimized.
This enables faster training on datasets
where length of examples can vary significantly (e.g Librispeech).
Inspired by: https://www.tensorflow.org/api_docs/python/tf/data/experimental/bucket_by_sequence_length
Dynamic batching is performed by specifying a max_batch_length which is the
upper limit for the sum of the length of examples in a batch:
e.g., if ex1 has length 4, ex2 length 5 and if max_batch_length is set to 6
ex1 and ex2 will be placed, alone, in two distinct batches.
Length for each example can be obtained in two manners.
If the input dataset is a DynamicItemDataset it can be obtained by specifying a
length_func. Default assumes a "duration" entry is in the annotation.
Length for each example can also be passed to this class upon instantiation
by specifying a list containing the length for each example and passing it to
lengths_list.
Examples are grouped together by defining a set of possible discrete intervals
(buckets). Examples whose length fall into these intervals can be batched together.
The number of buckets can be specified by using the arg num_buckets.
There is usually an optimal range for the value of this argument.
If num_buckets == 1, all examples can be batched together. You have maximum randomization
but your training speed will be slower due to the fact that a large amount of the values will be padding
as long and short examples can be batched together.
As the number of buckets grows only examples with similar
length can be grouped together.
This trades-off speed with randomization.
TLDR: Low number -> better randomization, High number -> faster training.
NOTE THAT: if set too high the training speed will decrease. If num_buckets -> number of examples in the
dataset the batch size will be small impacting training speed and possibly performance.
The buckets can also be specified by passing a list to the bucket_boundaries
argument instead of specifying a left_bucket_length and a bucket_length_multiplier.
"""
def __init__(
self,
dataset,
max_batch_length: int,
num_buckets: int=None,
length_func=lambda x: x["duration"],
shuffle: bool=True,
batch_ordering: str="random",
max_batch_ex: int=None,
bucket_boundaries: List[int]=[],
lengths_list: List[int]=None,
seed: int=42,
epoch: int=0,
drop_last: bool=False,
verbose: bool=False, ):
self._dataset = dataset
self._ex_lengths = {}
ex_ids = self._dataset.data_ids
self.verbose = verbose
# We do not put a default on num_buckets to encourage users to play with this parameter
if num_buckets is None and len(bucket_boundaries) == 0:
raise RuntimeError(
"Please specify either num_buckets or bucket boundaries."
"Check the docs, and/or the tutorial !")
if lengths_list is not None:
# take length of examples from this argument and bypass length_key
for indx in range(len(lengths_list)):
self._ex_lengths[str(indx)] = lengths_list[indx]
else:
# use length func
if not isinstance(dataset, DynamicItemDataset):
raise NotImplementedError(
"Dataset should be a DynamicItemDataset when using length function"
)
for indx in range(len(self._dataset)):
self._ex_lengths[str(indx)] = length_func(
self._dataset.data[ex_ids[indx]])
if len(bucket_boundaries) > 0:
if not all([x >= 0 for x in bucket_boundaries]):
raise ValueError(
"All elements in bucket boundaries should be non-negative (>= 0)."
)
if not len(set(bucket_boundaries)) == len(bucket_boundaries):
raise ValueError(
"Bucket_boundaries should not contain duplicates.")
np.testing.assert_array_equal(
np.array(bucket_boundaries),
np.array(sorted(bucket_boundaries)),
err_msg="The arg bucket_boundaries should be an ascending sorted list of non negative values values!",
)
self._bucket_boundaries = np.array(sorted(bucket_boundaries))
else:
# use num_buckets
self._bucket_boundaries = np.array(
self._get_boundaries_through_warping(
max_batch_length=max_batch_length,
num_quantiles=num_buckets, ))
self._max_batch_length = max_batch_length
self._shuffle_ex = shuffle
self._batch_ordering = batch_ordering
self._seed = seed
self._drop_last = drop_last
if max_batch_ex is None:
max_batch_ex = np.inf
self._max_batch_ex = max_batch_ex
# Calculate bucket lengths - how often does one bucket boundary fit into max_batch_length?
self._bucket_lens = [
max(1, int(max_batch_length / self._bucket_boundaries[i]))
for i in range(len(self._bucket_boundaries))
] + [1]
self._epoch = epoch
self._generate_batches()
def get_durations(self, batch):
"""Gets durations of the elements in the batch."""
return [self._ex_lengths[str(idx)] for idx in batch]
def _get_boundaries_through_warping(
self,
max_batch_length: int,
num_quantiles: int, ) -> List[int]:
# NOTE: the following lines do not cover that there is only one example in the dataset
# warp frames (duration) distribution of train data
logger.info("Batch quantisation in latent space")
# linspace set-up
num_boundaries = num_quantiles + 1
# create latent linearly equal spaced buckets
latent_boundaries = np.linspace(
1 / num_boundaries,
num_quantiles / num_boundaries,
num_quantiles, )
# get quantiles using lognormal distribution
quantiles = lognorm.ppf(latent_boundaries, 1)
# scale up to to max_batch_length
bucket_boundaries = quantiles * max_batch_length / quantiles[-1]
# compute resulting bucket length multipliers
length_multipliers = [
bucket_boundaries[x + 1] / bucket_boundaries[x]
for x in range(num_quantiles - 1)
]
# logging
logger.info(
"Latent bucket boundary - buckets: {} - length multipliers: {}".
format(
list(map("{:.2f}".format, bucket_boundaries)),
list(map("{:.2f}".format, length_multipliers)), ))
return list(sorted(bucket_boundaries))
def _permute_batches(self):
if self._batch_ordering == "random":
# deterministically shuffle based on epoch and seed
gen = paddle.seed(1)
gen.manual_seed(self._seed + self._epoch)
sampler = paddle.randperm(
len(self._batches)).tolist() # type: ignore
tmp = []
for idx in sampler:
tmp.append(self._batches[idx])
self._batches = tmp
elif self._batch_ordering == "ascending":
self._batches = sorted(
self._batches,
key=lambda x: max([self._ex_lengths[str(idx)] for idx in x]), )
elif self._batch_ordering == "descending":
self._batches = sorted(
self._batches,
key=lambda x: max([self._ex_lengths[str(idx)] for idx in x]),
reverse=True, )
else:
raise NotImplementedError
def _generate_batches(self):
logger.info("DynamicBatchSampler: Generating dynamic batches")
if self._shuffle_ex:
# deterministically shuffle based on epoch and seed
gen = paddle.seed(1)
gen.manual_seed(self._seed + self._epoch)
sampler = paddle.randperm(
len(self._dataset)).tolist() # type: ignore
else:
# take examples as they are: e.g. they have been sorted
sampler = range(len(self._dataset)) # type: ignore
self._batches = []
bucket_batches = [[] for i in self._bucket_lens]
stats_tracker = [{
"min": np.inf,
"max": -np.inf,
"tot": 0,
"n_ex": 0
} for i in self._bucket_lens]
for idx in sampler:
# length of pre-sampled audio
item_len = self._ex_lengths[str(idx)]
# bucket to fill up most padding
bucket_id = np.searchsorted(self._bucket_boundaries, item_len)
# fill audio's duration into that bucket
bucket_batches[bucket_id].append(idx)
stats_tracker[bucket_id]["min"] = min(
stats_tracker[bucket_id]["min"], item_len)
stats_tracker[bucket_id]["max"] = max(
stats_tracker[bucket_id]["max"], item_len)
stats_tracker[bucket_id]["tot"] += item_len
stats_tracker[bucket_id]["n_ex"] += 1
# track #samples - why not duration/#frames; rounded up?
# keep track of durations, if necessary
if (len(bucket_batches[bucket_id]) >= self._bucket_lens[bucket_id]
or len(bucket_batches[bucket_id]) >= self._max_batch_ex):
self._batches.append(bucket_batches[bucket_id])
bucket_batches[bucket_id] = []
# keep track of durations
# Dump remaining batches
if not self._drop_last:
for batch in bucket_batches:
if batch:
self._batches.append(batch)
self._permute_batches() # possibly reorder batches
if self._epoch == 0: # only log at first epoch
# frames per batch & their padding remaining
boundaries = [0] + self._bucket_boundaries.tolist()
for bucket_indx in range(len(self._bucket_boundaries)):
try:
num_batches = stats_tracker[bucket_indx]["tot"] // (
self._max_batch_length)
pad_factor = (stats_tracker[bucket_indx]["max"] -
stats_tracker[bucket_indx]["min"]) / (
stats_tracker[bucket_indx]["tot"] /
stats_tracker[bucket_indx]["n_ex"])
except ZeroDivisionError:
num_batches = 0
pad_factor = 0
logger.info((
"DynamicBatchSampler: Bucket {} with boundary {:.1f}-{:.1f} and "
+
"batch_size {}: Num Examples {:.1f}, Num Full Batches {:.3f}, Pad Factor {:.3f}."
).format(
bucket_indx,
boundaries[bucket_indx],
boundaries[bucket_indx + 1],
self._bucket_lens[bucket_indx],
stats_tracker[bucket_indx]["n_ex"],
num_batches,
pad_factor * 100, ))
if self.verbose:
batch_stats = {
"tot_frames": [],
"tot_pad_frames": [],
"pad_%": [],
}
for batch in self._batches:
tot_frames = sum(
[self._ex_lengths[str(idx)] for idx in batch])
batch_stats["tot_frames"].append(tot_frames)
max_frames = max(
[self._ex_lengths[str(idx)] for idx in batch])
tot_pad = sum([
max_frames - self._ex_lengths[str(idx)] for idx in batch
])
batch_stats["tot_pad_frames"].append(tot_pad)
batch_stats["pad_%"].append(tot_pad / tot_frames * 100)
padding_details = "Batch {} with {:.1f} frames with {} files - {:.1f} padding, {:.2f} (%) of total."
padding_details = "DynamicBatchSampler: " + padding_details
for i in range(len(self._batches)):
logger.info(
padding_details.format(
i,
batch_stats["tot_frames"][i],
len(self._batches[i]),
batch_stats["tot_pad_frames"][i],
batch_stats["pad_%"][i], ))
def __iter__(self):
for batch in self._batches:
yield batch
if self._shuffle_ex: # re-generate examples if ex_ordering == "random"
self._generate_batches()
if self._batch_ordering == "random":
# we randomly permute the batches only --> faster
self._permute_batches()
def set_epoch(self, epoch):
"""
You can also just access self.epoch, but we maintain this interface
to mirror paddle.io.DistributedBatchSampler
"""
self._epoch = epoch
self._generate_batches()
def __len__(self):
return len(self._batches)
class BalancingDataSampler(ReproducibleWeightedRandomSampler):
"""A data sampler that takes a single key from the dataset and
ensures an approximately equal distribution by that key
Arguments
---------
dataset: DynamicItemDataset
the dataset form which samples will be drawn
key: str
the key from which samples will be taken
num_samples : int
Number of samples to draw
replacement : bool
To draw with replacement or not (within an epoch of num_samples).
seed : int
The base seed to use for the random number generator. It is recommended
to use a value which has a good mix of 0 and 1 bits.
epoch : int
The epoch to start at.
"""
def __init__(
self,
dataset,
key,
num_samples=None,
replacement=True,
seed=563375142,
epoch=0,
**kwargs, ):
self.dataset = dataset
self.key = key
if not num_samples:
num_samples = len(dataset)
weights = self._compute_weights()
super().__init__(weights, num_samples, replacement, seed, epoch,
**kwargs)
def _compute_weights(self):
with self.dataset.output_keys_as([self.key]):
class_ids = [item[self.key] for item in self.dataset]
class_counter = Counter(class_ids)
weights = 1 / paddle.to_tensor(
[class_counter[class_id] for class_id in class_ids])
return weights
# Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modified from speechbrain 2023 (https://github.com/speechbrain/speechbrain/blob/develop/recipes/AISHELL-1/ASR/CTC/train_with_wav2vec.py)
import data_pipeline
import dataio
import numpy
import paddle
import tqdm
import transformers
from dataloader import make_dataloader
from hyperpyyaml import load_hyperpyyaml
import dataset
def dataio_prepare(hparams):
"""This function prepares the datasets to be used in the brain class.
It also defines the data processing pipeline through user-defined functions."""
data_folder = hparams["data_folder"]
train_data = dataset.DynamicItemDataset.from_csv(
csv_path=hparams["train_data"],
replacements={"data_root": data_folder}, )
if hparams["sorting"] == "ascending":
# we sort training data to speed up training and get better results.
train_data = train_data.filtered_sorted(sort_key="duration")
# when sorting do not shuffle in dataloader ! otherwise is pointless
hparams["train_dataloader_opts"]["shuffle"] = False
elif hparams["sorting"] == "descending":
train_data = train_data.filtered_sorted(
sort_key="duration", reverse=True)
# when sorting do not shuffle in dataloader ! otherwise is pointless
hparams["train_dataloader_opts"]["shuffle"] = False
elif hparams["sorting"] == "random":
pass
else:
raise NotImplementedError(
"sorting must be random, ascending or descending")
valid_data = dataset.DynamicItemDataset.from_csv(
csv_path=hparams["valid_data"],
replacements={"data_root": data_folder}, )
valid_data = valid_data.filtered_sorted(sort_key="duration")
test_data = dataset.DynamicItemDataset.from_csv(
csv_path=hparams["test_data"],
replacements={"data_root": data_folder}, )
test_data = test_data.filtered_sorted(sort_key="duration")
datasets = [train_data, valid_data, test_data]
# Defining tokenizer and loading it
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-chinese')
# 2. Define audio pipeline:
@data_pipeline.takes("wav")
@data_pipeline.provides("sig")
def audio_pipeline(wav):
sig = dataio.read_audio(wav)
return sig
dataset.add_dynamic_item(datasets, audio_pipeline)
# 3. Define text pipeline:
@data_pipeline.takes("transcript")
@data_pipeline.provides("wrd", "tokens_list", "tokens")
def text_pipeline(wrd):
wrd = "".join(wrd.split(" "))
yield wrd
tokens_list = tokenizer(wrd)["input_ids"]
yield tokens_list
tokens = numpy.array(tokens_list, dtype="int64")
yield tokens
dataset.add_dynamic_item(datasets, text_pipeline)
# 4. Set output:
dataset.set_output_keys(
datasets,
["id", "sig", "wrd", "tokens"], )
# 5. If Dynamic Batching is used, we instantiate the needed samplers.
train_batch_sampler = None
valid_batch_sampler = None
if hparams["dynamic_batching"]:
from sampler import DynamicBatchSampler # noqa
dynamic_hparams = hparams["dynamic_batch_sampler"]
num_buckets = dynamic_hparams["num_buckets"]
train_batch_sampler = DynamicBatchSampler(
train_data,
dynamic_hparams["max_batch_len"],
num_buckets=num_buckets,
length_func=lambda x: x["duration"],
shuffle=dynamic_hparams["shuffle_ex"],
batch_ordering=dynamic_hparams["batch_ordering"], )
valid_batch_sampler = DynamicBatchSampler(
valid_data,
dynamic_hparams["max_batch_len"],
num_buckets=num_buckets,
length_func=lambda x: x["duration"],
shuffle=dynamic_hparams["shuffle_ex"],
batch_ordering=dynamic_hparams["batch_ordering"], )
return (train_data, valid_data, test_data, tokenizer, train_batch_sampler,
valid_batch_sampler, )
hparams_file = 'train_with_wav2vec.yaml'
with open(hparams_file) as fin:
hparams = load_hyperpyyaml(fin, None)
(train_data, valid_data, test_data, tokenizer, train_bsampler,
valid_bsampler, ) = dataio_prepare(hparams)
train_dataloader_opts = hparams["train_dataloader_opts"]
valid_dataloader_opts = hparams["valid_dataloader_opts"]
if train_bsampler is not None:
train_dataloader_opts = {
"batch_sampler": train_bsampler,
"num_workers": hparams["num_workers"],
}
if valid_bsampler is not None:
valid_dataloader_opts = {"batch_sampler": valid_bsampler}
train_set = make_dataloader(train_data, stage='train', **train_dataloader_opts)
valid_set = make_dataloader(
valid_data,
stage='train',
**valid_dataloader_opts, )
for batch in valid_set:
print(batch)
print('done') # exit()
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
......
# Authors # Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# * Peter Plantinga 2020 # Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
# * Francois Grondin 2020
# * William Aris 2020
# * Samuele Cornell 2020
# * Sarthak Yadav 2022
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
...@@ -17,7 +12,16 @@ ...@@ -17,7 +12,16 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# Modified from speechbrain(https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/processing/signal_processing.py) # Modified from speechbrain 2023 (https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/processing/signal_processing.py)
"""
Low level signal processing utilities
Authors
* Peter Plantinga 2020
* Francois Grondin 2020
* William Aris 2020
* Samuele Cornell 2020
* Sarthak Yadav 2022
"""
import numpy as np import numpy as np
import paddle import paddle
......
# Authors # Copyright (c) 2023 speechbrain Authors. All Rights Reserved.
# * Peter Plantinga 2020
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
...@@ -14,6 +13,18 @@ ...@@ -14,6 +13,18 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# Modified from speechbrain(https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/processing/speech_augmentation.py) # Modified from speechbrain(https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/processing/speech_augmentation.py)
"""Classes for mutating speech data for data augmentation.
This module provides classes that produce realistic distortions of speech
data for the purpose of training speech processing models. The list of
distortions includes adding noise, adding reverberation, changing speed,
and more. All the classes are of type `torch.nn.Module`. This gives the
possibility to have end-to-end differentiability and
backpropagate the gradient through them. In addition, all operations
are expected to be performed on the GPU (where available) for efficiency.
Authors
* Peter Plantinga 2020
"""
import math import math
import paddle import paddle
...@@ -64,7 +75,6 @@ class SpeedPerturb(nn.Layer): ...@@ -64,7 +75,6 @@ class SpeedPerturb(nn.Layer):
# Initialize index of perturbation # Initialize index of perturbation
self.samp_index = 0 self.samp_index = 0
# Initialize resamplers # Initialize resamplers
self.resamplers = [] self.resamplers = []
for speed in self.speeds: for speed in self.speeds:
...@@ -89,7 +99,6 @@ class SpeedPerturb(nn.Layer): ...@@ -89,7 +99,6 @@ class SpeedPerturb(nn.Layer):
# Don't perturb (return early) 1-`perturb_prob` portion of the batches # Don't perturb (return early) 1-`perturb_prob` portion of the batches
if paddle.rand([1]) > self.perturb_prob: if paddle.rand([1]) > self.perturb_prob:
return waveform.clone() return waveform.clone()
# Perform a random perturbation # Perform a random perturbation
self.samp_index = paddle.randint(len(self.speeds), shape=(1, ))[0] self.samp_index = paddle.randint(len(self.speeds), shape=(1, ))[0]
...@@ -456,10 +465,6 @@ class DropFreq(nn.Layer): ...@@ -456,10 +465,6 @@ class DropFreq(nn.Layer):
high=self.drop_count_high + 1, high=self.drop_count_high + 1,
shape=(1, ), ) shape=(1, ), )
# Pick a frequency to drop
drop_range = self.drop_freq_high - self.drop_freq_low
drop_frequency = (
paddle.rand(drop_count) * drop_range + self.drop_freq_low)
# Filter parameters # Filter parameters
filter_length = 101 filter_length = 101
pad = filter_length // 2 pad = filter_length // 2
...@@ -467,13 +472,19 @@ class DropFreq(nn.Layer): ...@@ -467,13 +472,19 @@ class DropFreq(nn.Layer):
# Start with delta function # Start with delta function
drop_filter = paddle.zeros([1, filter_length, 1]) drop_filter = paddle.zeros([1, filter_length, 1])
drop_filter[0, pad, 0] = 1 drop_filter[0, pad, 0] = 1
# Subtract each frequency
for frequency in drop_frequency: if drop_count.shape == 0:
notch_kernel = notch_filter( # Pick a frequency to drop
frequency, drop_range = self.drop_freq_high - self.drop_freq_low
filter_length, drop_frequency = (
self.drop_width, ) paddle.rand(drop_count) * drop_range + self.drop_freq_low)
drop_filter = convolve1d(drop_filter, notch_kernel, pad) # Subtract each frequency
for frequency in drop_frequency:
notch_kernel = notch_filter(
frequency,
filter_length,
self.drop_width, )
drop_filter = convolve1d(drop_filter, notch_kernel, pad)
# Apply filter # Apply filter
dropped_waveform = convolve1d(dropped_waveform, drop_filter, pad) dropped_waveform = convolve1d(dropped_waveform, drop_filter, pad)
...@@ -736,8 +747,7 @@ class SpecAugment(paddle.nn.Layer): ...@@ -736,8 +747,7 @@ class SpecAugment(paddle.nn.Layer):
# compute center and corresponding window # compute center and corresponding window
c = paddle.randint(window, time - window, (1, ))[0] c = paddle.randint(window, time - window, (1, ))[0]
w = paddle.randint(c - window, c + window, (1, ))[0] + 1 w = paddle.randint(c - window, c + window, (1, ))[0] + 1
# c = 5
# w = 10
left = paddle.nn.functional.interpolate( left = paddle.nn.functional.interpolate(
x[:, :, :c], x[:, :, :c],
(w, x.shape[3]), (w, x.shape[3]),
......
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
...@@ -57,18 +57,24 @@ class Wav2vec2ASR(nn.Layer): ...@@ -57,18 +57,24 @@ class Wav2vec2ASR(nn.Layer):
def forward(self, wav, wavs_lens_rate, target, target_lens): def forward(self, wav, wavs_lens_rate, target, target_lens):
if self.normalize_wav: if self.normalize_wav:
wav = F.layer_norm(wav, wav.shape) wav = F.layer_norm(wav, wav.shape)
# Extract wav2vec output # Extract wav2vec output
out = self.wav2vec2(wav)[0] out = self.wav2vec2(wav)[0]
# We normalize the output if required # We normalize the output if required
if self.output_norm: if self.output_norm:
out = F.layer_norm(out, out.shape) out = F.layer_norm(out, out.shape)
if self.train and hasattr(self.config, 'spec_augment'):
if self.training and hasattr(self.config, 'spec_augment'):
feats = self.spec_augment(out) feats = self.spec_augment(out)
else: else:
feats = out feats = out
x = self.enc(feats) x = self.enc(feats)
x_lens = (wavs_lens_rate * x.shape[1]).round().astype(paddle.int64) x_lens = (wavs_lens_rate * x.shape[1]).round().astype(paddle.int64)
ctc_loss = self.ctc(x, x_lens, target, target_lens) ctc_loss = self.ctc(x, x_lens, target, target_lens)
return ctc_loss return ctc_loss
@paddle.no_grad() @paddle.no_grad()
...@@ -77,50 +83,53 @@ class Wav2vec2ASR(nn.Layer): ...@@ -77,50 +83,53 @@ class Wav2vec2ASR(nn.Layer):
text_feature: Dict[str, int], text_feature: Dict[str, int],
decoding_method: str, decoding_method: str,
beam_size: int, beam_size: int,
tokenizer: str=None): sb_pipeline=False):
batch_size = feats.shape[0] batch_size = feats.shape[0]
if decoding_method == 'ctc_prefix_beam_search' and batch_size > 1: if decoding_method == 'ctc_prefix_beam_search' and batch_size > 1:
raise ValueError( logger.error(
f"decoding mode {decoding_method} must be running with batch_size == 1" f"decoding mode {decoding_method} must be running with batch_size == 1"
) )
logger.error(f"current batch_size is {batch_size}")
if decoding_method == 'ctc_greedy_search': if decoding_method == 'ctc_greedy_search':
if tokenizer is None: if not sb_pipeline:
hyps = self.ctc_greedy_search(feats) hyps = self.ctc_greedy_search(feats)
res = [text_feature.defeaturize(hyp) for hyp in hyps] res = [text_feature.defeaturize(hyp) for hyp in hyps]
res_tokenids = [hyp for hyp in hyps] res_tokenids = [hyp for hyp in hyps]
else: else:
hyps = self.ctc_greedy_search(feats) hyps = self.ctc_greedy_search(feats.unsqueeze(-1))
res = [] res = []
res_tokenids = [] res_tokenids = []
for sequence in hyps: for sequence in hyps:
# Decode token terms to words # Decode token terms to words
predicted_tokens = text_feature.convert_ids_to_tokens( predicted_tokens = text_feature.convert_ids_to_tokens(
sequence) sequence)
tmp_res = [] tmp_res = []
tmp_res_tokenids = [] tmp_res_tokenids = []
for c in predicted_tokens: for c in predicted_tokens:
if c == "[CLS]": if c == "[CLS]":
continue continue
elif c == "[SEP]" or c == "[PAD]": elif c == "[SEP]" or c == "[PAD]":
break break
else: else:
tmp_res.append(c) tmp_res.append(c)
tmp_res_tokenids.append(text_feature.vocab[c]) tmp_res_tokenids.append(text_feature.vocab[c])
res.append(''.join(tmp_res)) res.append(''.join(tmp_res))
res_tokenids.append(tmp_res_tokenids) res_tokenids.append(tmp_res_tokenids)
# ctc_prefix_beam_search and attention_rescoring only return one # ctc_prefix_beam_search and attention_rescoring only return one
# result in List[int], change it to List[List[int]] for compatible # result in List[int], change it to List[List[int]] for compatible
# with other batch decoding mode # with other batch decoding mode
elif decoding_method == 'ctc_prefix_beam_search': elif decoding_method == 'ctc_prefix_beam_search':
assert feats.shape[0] == 1 assert feats.shape[0] == 1
if tokenizer is None: if not sb_pipeline:
hyp = self.ctc_prefix_beam_search(feats, beam_size) hyp = self.ctc_prefix_beam_search(feats, beam_size)
res = [text_feature.defeaturize(hyp)] res = [text_feature.defeaturize(hyp)]
res_tokenids = [hyp] res_tokenids = [hyp]
else: else:
hyp = self.ctc_prefix_beam_search(feats, beam_size) hyp = self.ctc_prefix_beam_search(
feats.unsqueeze(-1), beam_size)
res = [] res = []
res_tokenids = [] res_tokenids = []
predicted_tokens = text_feature.convert_ids_to_tokens(hyp) predicted_tokens = text_feature.convert_ids_to_tokens(hyp)
...@@ -290,13 +299,10 @@ class Wav2vec2Base(nn.Layer): ...@@ -290,13 +299,10 @@ class Wav2vec2Base(nn.Layer):
@classmethod @classmethod
def from_config(cls, configs: dict): def from_config(cls, configs: dict):
"""init model. """init model.
Args: Args:
configs (dict): config dict. configs (dict): config dict.
Raises: Raises:
ValueError: raise when using not support encoder type. ValueError: raise when using not support encoder type.
Returns: Returns:
nn.Layer: Wav2Vec2Base nn.Layer: Wav2Vec2Base
""" """
......
...@@ -67,6 +67,8 @@ base = [ ...@@ -67,6 +67,8 @@ base = [
"pyyaml", "pyyaml",
"paddleslim>=2.3.4", "paddleslim>=2.3.4",
"paddleaudio>=1.1.0", "paddleaudio>=1.1.0",
"hyperpyyaml",
"transformers",
] ]
server = ["pattern_singleton", "websockets"] server = ["pattern_singleton", "websockets"]
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册