提交 0defc658 编写于 作者: H Hui Zhang

update aishell/librispeech transformer result; wenetspeech pretrain conformer result

上级 a7858551
...@@ -19,3 +19,13 @@ Need set `decoding.decoding_chunk_size=16` when decoding. ...@@ -19,3 +19,13 @@ Need set `decoding.decoding_chunk_size=16` when decoding.
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | ctc_greedy_search | 16, -1 | - | 0.070806 | | conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | ctc_greedy_search | 16, -1 | - | 0.070806 |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | ctc_prefix_beam_search | 16, -1 | - | 0.070739 | | conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | ctc_prefix_beam_search | 16, -1 | - | 0.070739 |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | attention_rescoring | 16, -1 | - | 0.059400 | | conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | attention_rescoring | 16, -1 | - | 0.059400 |
## Transformer
| Model | Params | Config | Augmentation| Test set | Decode method | Loss | WER |
| --- | --- | --- | --- | --- | --- | --- | --- |
| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | attention | 3.858648955821991 | 0.057293 |
| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | ctc_greedy_search | 3.858648955821991 | 0.061837 |
| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | ctc_prefix_beam_search | 3.858648955821991 | 0.061685 |
| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | attention_rescoring | 3.858648955821991 | 0.053844 |
\ No newline at end of file
...@@ -73,7 +73,7 @@ model: ...@@ -73,7 +73,7 @@ model:
training: training:
n_epoch: 240 n_epoch: 120
accum_grad: 2 accum_grad: 2
global_grad_clip: 5.0 global_grad_clip: 5.0
optim: adam optim: adam
......
...@@ -23,8 +23,6 @@ fi ...@@ -23,8 +23,6 @@ fi
# exit 1 # exit 1
#fi #fi
for type in attention_rescoring; do for type in attention_rescoring; do
echo "decoding ${type}" echo "decoding ${type}"
batch_size=1 batch_size=1
......
...@@ -21,7 +21,7 @@ ...@@ -21,7 +21,7 @@
## Transformer ## Transformer
| Model | Params | Config | Augmentation| Test set | Decode method | Loss | WER | | Model | Params | Config | Augmentation| Test set | Decode method | Loss | WER |
| --- | --- | --- | --- | --- | --- | --- | --- | | --- | --- | --- | --- | --- | --- | --- | --- |
| transformer | 32.52 M | conf/transformer.yaml | spec_aug + shift | test-clean | attention | 7.404532432556152 | 0.056204 | | transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | attention | 6.805267604192098, | 0.049795 |
| transformer | 32.52 M | conf/transformer.yaml | spec_aug + shift | test-clean | ctc_greedy_search | 7.404532432556152 | 0.058658 | | transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | ctc_greedy_search | 6.805267604192098, | 0.054892 |
| transformer | 32.52 M | conf/transformer.yaml | spec_aug + shift | test-clean | ctc_prefix_beam_search | 7.404532432556152 | 0.058278 | | transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | ctc_prefix_beam_search | 6.805267604192098, | 0.054531 |
| transformer | 32.52 M | conf/transformer.yaml | spec_aug + shift | test-clean | attention_rescoring | 7.404532432556152 | 0.045591 | | transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | attention_rescoring | 6.805267604192098, | 0.042244 |
\ No newline at end of file
# [WenetSpeech](https://github.com/wenet-e2e/WenetSpeech)
A 10000+ Hours Multi-domain Chinese Corpus for Speech Recognition
## Description
### Creation
All the data are collected from YouTube and Podcast. Optical character recognition (OCR) and automatic speech recognition (ASR) techniques are adopted to label each YouTube and Podcast recording, respectively. To improve the quality of the corpus, we use a novel end-to-end label error detection method to further validate and filter the data.
### Categories
In summary, WenetSpeech groups all data into 3 categories, as the following table shows:
| Set | Hours | Confidence | Usage |
|------------|-------|-------------|---------------------------------------|
| High Label | 10005 | >=0.95 | Supervised Training |
| Weak Label | 2478 | [0.6, 0.95] | Semi-supervised or noise training |
| Unlabel | 9952 | / | Unsupervised training or Pre-training |
| In Total | 22435 | / | All above |
### High Label Data
We classify the high label into 10 groups according to its domain, speaking style, and scenarios.
| Domain | Youtube | Podcast | Total |
|-------------|---------|---------|--------|
| audiobook | 0 | 250.9 | 250.9 |
| commentary | 112.6 | 135.7 | 248.3 |
| documentary | 386.7 | 90.5 | 477.2 |
| drama | 4338.2 | 0 | 4338.2 |
| interview | 324.2 | 614 | 938.2 |
| news | 0 | 868 | 868 |
| reading | 0 | 1110.2 | 1110.2 |
| talk | 204 | 90.7 | 294.7 |
| variety | 603.3 | 224.5 | 827.8 |
| others | 144 | 507.5 | 651.5 |
| Total | 6113 | 3892 | 10005 |
As shown in the following table, we provide 3 training subsets, namely `S`, `M` and `L` for building ASR systems on different data scales.
| Training Subsets | Confidence | Hours |
|------------------|-------------|-------|
| L | [0.95, 1.0] | 10005 |
| M | 1.0 | 1000 |
| S | 1.0 | 100 |
### Evaluation Sets
| Evaluation Sets | Hours | Source | Description |
|-----------------|-------|--------------|-----------------------------------------------------------------------------------------|
| DEV | 20 | Internet | Specially designed for some speech tools which require cross-validation set in training |
| TEST\_NET | 23 | Internet | Match test |
| TEST\_MEETING | 15 | Real meeting | Mismatch test which is a far-field, conversational, spontaneous, and meeting dataset |
\ No newline at end of file
# WenetSpeech
## Conformer
| Model | Params | Config | Augmentation| Test set | Decode method | Loss | WER |
| --- | --- | --- | --- | --- | --- | --- | --- |
| conformer | 32.52 M | conf/conformer.yaml | spec_aug | dev | attention | | |
| conformer | 32.52 M | conf/conformer.yaml | spec_aug | test net | ctc_greedy_search | | |
| conformer | 32.52 M | conf/conformer.yaml | spec_aug | test meeting | ctc_prefix_beam_search | | |
| conformer | 32.52 M | conf/conformer.yaml | spec_aug | test net | attention_rescoring | | |
## Conformer Pretrain Model
Pretrain model from http://mobvoi-speech-public.ufile.ucloud.cn/public/wenet/wenetspeech/20211025_conformer_exp.tar.gz
| Model | Params | Config | Augmentation| Test set | Decode method | Loss | WER |
| --- | --- | --- | --- | --- | --- | --- | --- |
| conformer | 32.52 M | conf/conformer.yaml | spec_aug | aishell1 | attention | - | 0.048456 |
| conformer | 32.52 M | conf/conformer.yaml | spec_aug | aishell1 | ctc_greedy_search | - | 0.052534 |
| conformer | 32.52 M | conf/conformer.yaml | spec_aug | aishell1 | ctc_prefix_beam_search | - | 0.052915 |
| conformer | 32.52 M | conf/conformer.yaml | spec_aug | aishell1 | attention_rescoring | - | 0.047904 |
\ No newline at end of file
文件模式从 100644 更改为 100755
decode_modes="attention_rescoring ctc_greedy_search" #!/bin/bash
\ No newline at end of file
if [ $# != 2 ];then
echo "usage: ${0} config_path ckpt_path_prefix"
exit -1
fi
ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
echo "using $ngpu gpus..."
config_path=$1
ckpt_prefix=$2
chunk_mode=false
if [[ ${config_path} =~ ^.*chunk_.*yaml$ ]];then
chunk_mode=true
fi
# download language model
#bash local/download_lm_ch.sh
#if [ $? -ne 0 ]; then
# exit 1
#fi
for type in attention ctc_greedy_search; do
echo "decoding ${type}"
if [ ${chunk_mode} == true ];then
# stream decoding only support batchsize=1
batch_size=1
else
batch_size=64
fi
output_dir=${ckpt_prefix}
mkdir -p ${output_dir}
python3 -u ${BIN_DIR}/test.py \
--nproc ${ngpu} \
--config ${config_path} \
--result_file ${output_dir}/${type}.rsl \
--checkpoint_path ${ckpt_prefix} \
--opts decoding.decoding_method ${type} \
--opts decoding.batch_size ${batch_size}
if [ $? -ne 0 ]; then
echo "Failed in evaluation!"
exit 1
fi
done
for type in ctc_prefix_beam_search attention_rescoring; do
echo "decoding ${type}"
batch_size=1
output_dir=${ckpt_prefix}
mkdir -p ${output_dir}
python3 -u ${BIN_DIR}/test.py \
--nproc ${ngpu} \
--config ${config_path} \
--result_file ${output_dir}/${type}.rsl \
--checkpoint_path ${ckpt_prefix} \
--opts decoding.decoding_method ${type} \
--opts decoding.batch_size ${batch_size}
if [ $? -ne 0 ]; then
echo "Failed in evaluation!"
exit 1
fi
done
exit 0
文件模式从 100644 更改为 100755
...@@ -92,7 +92,9 @@ class TextFeaturizer(): ...@@ -92,7 +92,9 @@ class TextFeaturizer():
tokens = self.tokenize(text) tokens = self.tokenize(text)
ids = [] ids = []
for token in tokens: for token in tokens:
token = token if token in self.vocab_dict else self.unk if token not in self.vocab_dict:
logger.debug(f"Text Token: {token} -> {self.unk}")
token = self.unk
ids.append(self.vocab_dict[token]) ids.append(self.vocab_dict[token])
return ids return ids
......
...@@ -860,7 +860,7 @@ class U2Model(U2DecodeModel): ...@@ -860,7 +860,7 @@ class U2Model(U2DecodeModel):
int, nn.Layer, nn.Layer, nn.Layer: vocab size, encoder, decoder, ctc int, nn.Layer, nn.Layer, nn.Layer: vocab size, encoder, decoder, ctc
""" """
# cmvn # cmvn
if configs['cmvn_file'] is not None: if 'cmvn_file' in configs and configs['cmvn_file']:
mean, istd = load_cmvn(configs['cmvn_file'], mean, istd = load_cmvn(configs['cmvn_file'],
configs['cmvn_file_type']) configs['cmvn_file_type'])
global_cmvn = GlobalCMVN( global_cmvn = GlobalCMVN(
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册