bae905c9 · bae905c9 · bae905c9 · bae905c9 · bae905c9 · bae905c9
53 changed file
--- a/README.md
+++ b/README.md
@@ -178,6 +178,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
  - 🧩  *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).

 ### Recent Update
+- 👑 2023.05.31: Add [WavLM ASR-en](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/librispeech/asr5), WavLM fine-tuning for ASR on LibriSpeech.
 - 👑 2023.05.04: Add [HuBERT ASR-en](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/librispeech/asr4), HuBERT fine-tuning for ASR on LibriSpeech.
 - ⚡ 2023.04.28: Fix [0-d tensor](https://github.com/PaddlePaddle/PaddleSpeech/pull/3214), with the upgrade of paddlepaddle==2.5, the problem of modifying 0-d tensor has been solved.
 - 👑 2023.04.25: Add [AMP for U2 conformer](https://github.com/PaddlePaddle/PaddleSpeech/pull/3167).

--- a/README_cn.md
+++ b/README_cn.md
@@ -183,6 +183,8 @@
  - 🧩 级联模型应用: 作为传统语音任务的扩展，我们结合了自然语言处理、计算机视觉等任务，实现更接近实际需求的产业级应用。

 ### 近期更新
+- 👑 2023.05.31: 新增 [WavLM ASR-en](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/librispeech/asr5), 基于WavLM的英语识别微调，使用LibriSpeech数据集
+- 👑 2023.05.04: 新增 [HuBERT ASR-en](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/librispeech/asr4), 基于HuBERT的英语识别微调，使用LibriSpeech数据集
 - ⚡ 2023.04.28: 修正 [0-d tensor](https://github.com/PaddlePaddle/PaddleSpeech/pull/3214), 配合PaddlePaddle2.5升级修改了0-d tensor的问题。
 - 👑 2023.04.25: 新增 [U2 conformer 的 AMP 训练](https://github.com/PaddlePaddle/PaddleSpeech/pull/3167).
 - 👑 2023.04.06: 新增 [srt格式字幕生成功能](./demos/streaming_asr_server)。

--- a/audio/setup.py
+++ b/audio/setup.py
@@ -34,12 +34,12 @@ from tools import setup_helpers

 ROOT_DIR = Path(__file__).parent.resolve()

-VERSION = '1.1.0'
+VERSION = '1.2.0'
 COMMITID = 'none'

 base = [
    "kaldiio",
-    "librosa==0.8.1",
+    "librosa>=0.10.0",
    "pathos",
    "pybind11",
    "parameterized",

--- a/demos/speech_ssl/README.md
+++ b/demos/speech_ssl/README.md
@@ -36,7 +36,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
  ```
  Arguments:
  - `input`(required): Audio file to recognize.
-  - `model`: Model type of asr task. Default: `wav2vec2`, choices: [wav2vec2, hubert].
+  - `model`: Model type of asr task. Default: `wav2vec2`, choices: [wav2vec2, hubert, wavlm].
  - `task`: Output type. Default: `asr`.
  - `lang`: Model language. Default: `en`.
  - `sample_rate`: Sample rate of the model. Default: `16000`.

--- a/demos/speech_ssl/README_cn.md
+++ b/demos/speech_ssl/README_cn.md
@@ -36,7 +36,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
  ```
  参数：
  - `input`(必须输入)：用于识别的音频文件。
-  - `model`：ASR 任务的模型，默认值：`wav2vec2`, 可选项：[wav2vec2, hubert]。
+  - `model`：ASR 任务的模型，默认值：`wav2vec2`, 可选项：[wav2vec2, hubert, wavlm]。
  - `task`：输出类别，默认值：`asr`。
  - `lang`：模型语言，默认值：`en`。
  - `sample_rate`：音频采样率，默认值：`16000`。

--- a/examples/librispeech/asr5/README.md
+++ b/examples/librispeech/asr5/README.md
+# WavLM2ASR with Librispeech
+This example contains code used to finetune [WavLM](https://arxiv.org/abs/2110.13900) model with [Librispeech dataset](http://www.openslr.org/resources/12)
+## Overview
+All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
+| Stage | Function                                                     |
+|:---- |:----------------------------------------------------------- |
+| 0     | Process data. It includes: <br>       (1) Download the dataset <br>       (2) Calculate the CMVN of the train dataset <br>       (3) Get the vocabulary file <br>       (4) Get the manifest files of the train, development and test dataset<br>       (5) Download the pretrained wav2vec2 model |
+| 1     | Train the model                                              |
+| 2     | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
+| 3     | Test the final model performance                             |
+| 4     | Infer the single audio file                                  |
+
+
+You can choose to run a range of stages by setting `stage` and `stop_stage `. 
+
+For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
+```bash
+bash run.sh --stage 2 --stop_stage 3
+```
+Or you can set `stage` equal to `stop-stage` to only run one stage.
+For example, if you only want to run `stage 0`, you can use the script below:
+```bash
+bash run.sh --stage 0 --stop_stage 0
+```
+The document below will describe the scripts in `run.sh` in detail.
+## The Environment Variables
+The path.sh contains the environment variables. 
+```bash
+. ./path.sh
+. ./cmd.sh
+```
+This script needs to be run first. And another script is also needed:
+```bash
+source ${MAIN_ROOT}/utils/parse_options.sh
+```
+It will support the way of using `--variable value` in the shell scripts.
+## The Local Variables
+Some local variables are set in `run.sh`. 
+`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU. 
+`stage` denotes the number of stages you want to start from in the experiments.
+`stop stage` denotes the number of the stage you want to end at in the experiments. 
+`conf_path` denotes the config path of the model.
+`avg_num` denotes the number K of top-K models you want to average to get the final model.
+`audio file` denotes the file path of the single file you want to infer in stage 5
+`ckpt` denotes the checkpoint prefix of the model, e.g. "WavLMASR"
+
+You can set the local variables (except `ckpt`) when you use `run.sh`
+
+For example, you can set the `gpus` and `avg_num` when you use the command line:
+```bash
+bash run.sh --gpus 0,1 --avg_num 20
+```
+## Stage 0: Data Processing
+To use this example, you need to process data firstly and you can use stage 0 in `run.sh` to do this. The code is shown below:
+```bash
+ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+     # prepare data
+     bash ./local/data.sh || exit -1
+ fi
+```
+Stage 0 is for processing the data.
+
+If you only want to process the data. You can run
+```bash
+bash run.sh --stage 0 --stop_stage 0
+```
+You can also just run these scripts in your command line.
+```bash
+. ./path.sh
+. ./cmd.sh
+bash ./local/data.sh
+```
+After processing the data, the `data` directory will look like this:
+```bash
+data/
+|-- dev.meta
+|-- lang_char
+|   `-- bpe_unigram_5000.model
+|   `-- bpe_unigram_5000.vocab
+|   `-- vocab.txt
+|-- manifest.dev
+|-- manifest.dev.raw
+|-- manifest.test
+|-- manifest.test.raw
+|-- manifest.train
+|-- manifest.train.raw
+|-- mean_std.json
+|-- test.meta
+`-- train.meta
+```
+
+Stage 0 also downloads the pre-trained [wavlm](https://paddlespeech.bj.bcebos.com/wavlm/wavlm-base-plus.pdparams) model.
+```bash
+mkdir -p exp/wavlm
+wget -P exp/wavlm https://paddlespeech.bj.bcebos.com/wavlm/wavlm-base-plus.pdparams
+```
+## Stage 1: Model Training
+If you want to train the model. you can use stage 1 in `run.sh`. The code is shown below. 
+```bash
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+     # train model, all `ckpt` under `exp` dir
+     CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt}
+ fi
+```
+If you want to train the model, you can use the script below to execute stage 0 and stage 1:
+```bash
+bash run.sh --stage 0 --stop_stage 1
+```
+or you can run these scripts in the command line (only use CPU).
+```bash
+. ./path.sh
+. ./cmd.sh
+bash ./local/data.sh
+CUDA_VISIBLE_DEVICES= ./local/train.sh conf/wavlmASR.yaml wavlmASR
+```
+## Stage 2: Top-k Models Averaging
+After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below. Note: We only train one epoch for wavlmASR, thus the `avg_num` is set to 1.
+```bash
+ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+     # avg n best model
+     avg.sh best exp/${ckpt}/checkpoints ${avg_num}
+ fi
+```
+The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
+If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
+```bash
+bash run.sh --stage 0 --stop_stage 2
+```
+or you can run these scripts in the command line (only use CPU).
+
+```bash
+. ./path.sh
+. ./cmd.sh
+bash ./local/data.sh
+CUDA_VISIBLE_DEVICES= ./local/train.sh conf/wavlmASR.yaml wavlmASR
+avg.sh best exp/wavlmASR/checkpoints 1
+```
+## Stage 3: Model Testing
+The test stage is to evaluate the model performance. The code of test stage is shown below:
+```bash
+ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+     # test ckpt avg_n
+     CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} ${decode_conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
+ fi
+```
+If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
+```bash
+bash run.sh --stage 0 --stop_stage 3
+```
+or you can run these scripts in the command line (only use CPU).
+```bash
+. ./path.sh
+. ./cmd.sh
+bash ./local/data.sh
+CUDA_VISIBLE_DEVICES= ./local/train.sh conf/wavlmASR.yaml wavlmASR
+avg.sh best exp/wavlmASR/checkpoints 1
+CUDA_VISIBLE_DEVICES= ./local/test.sh conf/wavlmASR.yaml conf/tuning/decode.yaml exp/wavlmASR/checkpoints/avg_1
+```
+## Pretrained Model
+You can get the pretrained wavlmASR from [this](../../../docs/source/released_model.md).
+
+using the `tar` scripts to unpack the model and then you can use the script to test the model.
+
+For example:
+```bash
+wget https://paddlespeech.bj.bcebos.com/wavlm/wavlmASR-base-100h-librispeech_ckpt_1.4.0.model.tar.gz
+tar xzvf wavlmASR-base-100h-librispeech_ckpt_1.4.0.model.tar.gz
+source path.sh
+# If you have process the data and get the manifest file， you can skip the following 2 steps
+bash local/data.sh --stage -1 --stop_stage -1
+bash local/data.sh --stage 2 --stop_stage 2
+CUDA_VISIBLE_DEVICES= ./local/test.sh conf/wavlmASR.yaml conf/tuning/decode.yaml exp/wavlmASR/checkpoints/avg_1
+```
+The performance of the released models are shown in [here](./RESULTS.md).
+
+
+## Stage 4: Single Audio File Inference
+In some situations, you want to use the trained model to do the inference for the single audio file. You can use stage 5. The code is shown below
+```bash
+ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+     # test a single .wav file
+     CUDA_VISIBLE_DEVICES=0 ./local/test_wav.sh ${conf_path} ${decode_conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} ${audio_file} || exit -1
+ fi
+```
+you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
+```bash
+wget https://paddlespeech.bj.bcebos.com/wavlm/wavlm_baseplus_libriclean_100h.tar.gz
+tar xzvf wavlm_baseplus_libriclean_100h.tar.gz
+```
+You can download the audio demo:
+```bash
+wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.wav -P data/
+```
+You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
+```bash
+CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/wavlmASR.yaml conf/tuning/decode.yaml exp/wavlmASR/checkpoints/avg_1 data/demo_002_en.wav
+```
--- a/examples/librispeech/asr5/RESULTS.md
+++ b/examples/librispeech/asr5/RESULTS.md
+# LibriSpeech
+
+## WavLMASR
+Fintuning on train-clean-100
+train: Epoch 16, 4*A800-80G, batchsize: 16, accum_grad: 8
+
+| Model | Params | Config | Augmentation| Test set | Decode method | WER |  
+| --- | --- | --- | --- | --- | --- | --- |
+| WavLMASR | 326.16M | conf/wavlmasr.yaml | spec_aug | test-clean | greedy search | 0.0561 |  
--- a/examples/librispeech/asr5/avg.sh
+++ b/examples/librispeech/asr5/avg.sh
+#! /usr/bin/env bash
+
+if [ $# != 3 ]; then
+    echo "usage: ${0} [best|latest] ckpt_dir avg_num"
+    exit -1
+fi
+
+avg_mode=${1} # best,latest
+ckpt_dir=${2}
+average_num=${3}
+decode_checkpoint=${ckpt_dir}/avg_${average_num}.pdparams
+
+if [ $avg_mode == best ];then
+    # best
+    python avg_model.py \
+    --dst_model ${decode_checkpoint} \
+    --ckpt_dir ${ckpt_dir}  \
+    --num ${average_num} \
+    --val_best
+else
+    # latest
+    python avg_model.py \
+    --dst_model ${decode_checkpoint} \
+    --ckpt_dir ${ckpt_dir}  \
+    --num ${average_num}
+fi
+
+if [ $? -ne 0 ]; then
+    echo "Failed in avg ckpt!"
+    exit 1
+fi
+
+exit 0
--- a/examples/librispeech/asr5/cmd.sh
+++ b/examples/librispeech/asr5/cmd.sh
+# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======
+# Usage: <cmd>.pl [options] JOB=1:<nj> <log> <command...>
+# e.g.
+#   run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB
+#
+# Options:
+#   --time <time>: Limit the maximum time to execute.
+#   --mem <mem>: Limit the maximum memory usage.
+#   -–max-jobs-run <njob>: Limit the number parallel jobs. This is ignored for non-array jobs.
+#   --num-threads <ngpu>: Specify the number of CPU core.
+#   --gpu <ngpu>: Specify the number of GPU devices.
+#   --config: Change the configuration file from default.
+#
+# "JOB=1:10" is used for "array jobs" and it can control the number of parallel jobs.
+# The left string of "=", i.e. "JOB", is replaced by <N>(Nth job) in the command and the log file name,
+# e.g. "echo JOB" is changed to "echo 3" for the 3rd job and "echo 8" for 8th job respectively.
+# Note that the number must start with a positive number, so you can't use "JOB=0:10" for example.
+#
+# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend.
+# These options are mapping to specific options for each backend and
+# it is configured by "conf/queue.conf" and "conf/slurm.conf" by default.
+# If jobs failed, your configuration might be wrong for your environment.
+#
+#
+# The official documentation for run.pl, queue.pl, slurm.pl, and ssh.pl:
+#   "Parallelization in Kaldi": http://kaldi-asr.org/doc/queue.html
+# =========================================================~
+
+
+# Select the backend used by run.sh from "local", "sge", "slurm", or "ssh"
+cmd_backend='local'
+
+# Local machine, without any Job scheduling system
+if [ "${cmd_backend}" = local ]; then
+
+    # The other usage
+    export train_cmd="run.pl"
+    # Used for "*_train.py": "--gpu" is appended optionally by run.sh
+    export cuda_cmd="run.pl"
+    # Used for "*_recog.py"
+    export decode_cmd="run.pl"
+
+# "qsub" (SGE, Torque, PBS, etc.)
+elif [ "${cmd_backend}" = sge ]; then
+    # The default setting is written in conf/queue.conf.
+    # You must change "-q g.q" for the "queue" for your environment.
+    # To know the "queue" names, type "qhost -q"
+    # Note that to use "--gpu *", you have to setup "complex_value" for the system scheduler.
+
+    export train_cmd="queue.pl"
+    export cuda_cmd="queue.pl"
+    export decode_cmd="queue.pl"
+
+# "sbatch" (Slurm)
+elif [ "${cmd_backend}" = slurm ]; then
+    # The default setting is written in conf/slurm.conf.
+    # You must change "-p cpu" and "-p gpu" for the "partion" for your environment.
+    # To know the "partion" names, type "sinfo".
+    # You can use "--gpu * " by default for slurm and it is interpreted as "--gres gpu:*"
+    # The devices are allocated exclusively using "${CUDA_VISIBLE_DEVICES}".
+
+    export train_cmd="slurm.pl"
+    export cuda_cmd="slurm.pl"
+    export decode_cmd="slurm.pl"
+
+elif [ "${cmd_backend}" = ssh ]; then
+    # You have to create ".queue/machines" to specify the host to execute jobs.
+    # e.g. .queue/machines
+    #   host1
+    #   host2
+    #   host3
+    # Assuming you can login them without any password, i.e. You have to set ssh keys.
+
+    export train_cmd="ssh.pl"
+    export cuda_cmd="ssh.pl"
+    export decode_cmd="ssh.pl"
+
+# This is an example of specifying several unique options in the JHU CLSP cluster setup.
+# Users can modify/add their own command options according to their cluster environments.
+elif [ "${cmd_backend}" = jhu ]; then
+
+    export train_cmd="queue.pl --mem 2G"
+    export cuda_cmd="queue-freegpu.pl --mem 2G --gpu 1 --config conf/gpu.conf"
+    export decode_cmd="queue.pl --mem 4G"
+
+else
+    echo "$0: Error: Unknown cmd_backend=${cmd_backend}" 1>&2
+    return 1
+fi
--- a/examples/librispeech/asr5/compute_wer.py
+++ b/examples/librispeech/asr5/compute_wer.py
--- a/examples/librispeech/asr5/conf/preprocess.yaml
+++ b/examples/librispeech/asr5/conf/preprocess.yaml
+process:
+    # use raw audio
+  - type: wav_process
--- a/examples/librispeech/asr5/conf/preprocessor_config.json
+++ b/examples/librispeech/asr5/conf/preprocessor_config.json
+{
+  "do_normalize": true,
+  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
+  "feature_size": 1,
+  "padding_side": "right",
+  "padding_value": 0,
+  "return_attention_mask": true,
+  "sampling_rate": 16000
+}
--- a/examples/librispeech/asr5/conf/tuning/decode.yaml
+++ b/examples/librispeech/asr5/conf/tuning/decode.yaml
+decode_batch_size: 1
+error_rate_type: wer
+decoding_method: "ctc_greedy_search"  # 'ctc_greedy_search', 'ctc_prefix_beam_search'
+beam_size: 10
--- a/examples/librispeech/asr5/conf/wavlmASR.yaml
+++ b/examples/librispeech/asr5/conf/wavlmASR.yaml
+############################################
+#          Network Architecture           #
+############################################
+freeze_wavlm: False
+normalize_wav: True
+output_norm: True
+init_type: kaiming_uniform # !Warning: need to convergence
+enc:
+  input_shape: 768
+  dnn_blocks: 2
+  dnn_neurons: 768
+  activation: True
+  normalization: True
+  dropout_rate: [0.15, 0]
+ctc:
+  enc_n_units: 768
+  blank_id: 0
+  dropout_rate: 0.0
+wavlm_params_path: exp/wavlm/wavlm-base-plus.pdparams
+
+
+task_cfg:
+  label_rate: 50.0
+  sample_rate: 16000
+  normalize: True
+  enable_padding: False
+  max_keep_size: None
+  max_sample_size: 250000
+  min_sample_size: 32000
+  dropout_input: 0.1
+  final_dropout: 0.0
+  dropout: 0.1
+  attention_dropout: 0.0
+  activation_dropout: 0.1
+  apply_mask: True
+  mask_length: 10
+  mask_prob: 0.5
+  mask_selection: static
+  mask_other: 0.0
+  no_mask_overlap: False
+  mask_channel_length: 10
+  mask_channel_prob: 0.0
+  mask_channel_selection: static
+  mask_channel_other: 0.0
+  no_mask_channel_overlap: False
+  feature_grad_mult: 0.0
+  layerdrop: 0.1
+  fp16: True
+  extractor_mode: layer_norm
+  encoder_layers: 12
+  encoder_embed_dim: 768
+  encoder_ffn_embed_dim: 3072
+  encoder_attention_heads: 12
+  activation_fn: gelu
+  encoder_layerdrop: 0.0
+  dropout_features: 0.0
+  final_dim: 768
+  untie_final_proj: True
+  layer_norm_first: True
+  conv_feature_layers: "[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2"
+  conv_bias: False
+  logit_temp: 0.1
+  target_glu: False
+  mask_min_space: 1
+  mask_channel_min_space: 1
+  conv_pos: 128
+  conv_pos_groups: 16
+  latent_temp: [2.0, 0.5, 0.999995]
+  skip_masked: False
+  skip_nomask: True
+
+###########################################
+#                   Data                  #
+###########################################
+train_manifest: data/manifest.train
+dev_manifest: data/manifest.dev
+test_manifest: data/manifest.test-clean
+
+###########################################
+#              Dataloader                 #
+###########################################
+vocab_filepath: data/lang_char/vocab.txt 
+unit_type: char
+mean_std_filepath: ""
+preprocess_config: conf/preprocess.yaml
+sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs 0: disabled other: enabled for other epochs 
+batch_size: 8  # Different batch_size may cause large differences in results
+maxlen_in: 51200000000  # if input length  > maxlen-in batchsize is automatically reduced
+maxlen_out: 160000
+minibatches: 0 # for debug
+batch_count: auto
+batch_bins: 0 
+batch_frames_in: 0
+batch_frames_out: 0
+batch_frames_inout: 0
+num_workers: 0
+subsampling_factor: 1
+num_encs: 1
+dist_sampler: True
+shortest_first: False
+return_lens_rate: True
+  
+############################################
+#             Data Augmentation            #
+############################################
+audio_augment:  # for raw audio 
+  sample_rate: 16000
+  speeds: [90, 100, 110]
+
+###########################################
+#                 Training                #
+###########################################
+n_epoch: 10
+accum_grad: 8
+global_grad_clip: 5.0
+model_scheduler: newbobscheduler
+model_scheduler_conf:
+  improvement_threshold: 0.0025
+  annealing_factor: 0.8
+  patient: 0
+model_optim: adam
+model_optim_conf:
+  lr: 0.0001
+  weight_decay: 0.0
+# I changed this
+wavlm_optim: adam
+wavlm_optim_conf:
+  lr: 0.00005
+  weight_decay: 0.0
+wavlm_scheduler: constantlr    
+wavlm_scheduler_conf:
+  warmup_steps: 1000
+  lr_decay: 1.0
+log_interval: 1
+checkpoint:
+  kbest_n: 50
+  latest_n: 5
--- a/examples/librispeech/asr5/local/data.sh
+++ b/examples/librispeech/asr5/local/data.sh
+#!/bin/bash
+
+stage=-1
+stop_stage=100
+
+unit_type=char
+dict_dir=data/lang_char
+
+source ${MAIN_ROOT}/utils/parse_options.sh
+
+mkdir -p data
+mkdir -p ${dict_dir}
+TARGET_DIR=${MAIN_ROOT}/dataset
+mkdir -p ${TARGET_DIR}
+
+if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
+    # download data, generate manifests
+    python3 ${TARGET_DIR}/librispeech/librispeech.py \
+    --manifest_prefix="data/manifest" \
+    --target_dir="${TARGET_DIR}/librispeech" \
+    --full_download="False"
+
+    if [ $? -ne 0 ]; then
+        echo "Prepare LibriSpeech failed. Terminated."
+        exit 1
+    fi
+
+    for set in train-clean-100 dev-clean test-clean; do
+        mv data/manifest.${set} data/manifest.${set}.raw
+    done
+
+    rm -rf data/manifest.train.raw data/manifest.dev.raw  data/manifest.test.raw
+    for set in train-clean-100; do
+        cat data/manifest.${set}.raw >> data/manifest.train.raw
+    done
+
+    for set in dev-clean; do
+        cat data/manifest.${set}.raw >> data/manifest.dev.raw
+    done
+
+    for set in test-clean; do
+        cat data/manifest.${set}.raw >> data/manifest.test.raw
+    done
+fi
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # compute mean and stddev for normalizer
+    num_workers=$(nproc)
+    python ${MAIN_ROOT}/utils/compute_mean_std.py \
+    --manifest_path="data/manifest.train.raw" \
+    --num_samples=2000 \
+    --spectrum_type="fbank" \
+    --feat_dim=161 \
+    --delta_delta=false \
+    --sample_rate=16000 \
+    --stride_ms=10 \
+    --window_ms=25 \
+    --use_dB_normalization=False \
+    --num_workers=${num_workers} \
+    --output_path="data/mean_std.json"
+
+    if [ $? -ne 0 ]; then
+        echo "Compute mean and stddev failed. Terminated."
+        exit 1
+    fi
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # build vocabulary
+    python3 ${MAIN_ROOT}/utils/build_vocab.py \
+    --unit_type ${unit_type} \
+    --count_threshold=0 \
+    --vocab_path="${dict_dir}/vocab.txt" \
+    --manifest_paths="data/manifest.train.raw"
+
+    if [ $? -ne 0 ]; then
+        echo "Build vocabulary failed. Terminated."
+        exit 1
+    fi
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # format manifest with tokenids, vocab size
+    for set in train dev test dev-clean test-clean; do
+    {
+        python3 ${MAIN_ROOT}/utils/format_data.py \
+        --cmvn_path "data/mean_std.json" \
+        --unit_type ${unit_type} \
+        --vocab_path="${dict_dir}/vocab.txt" \
+        --manifest_path="data/manifest.${set}.raw" \
+        --output_path="data/manifest.${set}"
+
+        if [ $? -ne 0 ]; then
+            echo "Formt manifest.${set} failed. Terminated."
+            exit 1
+        fi
+    }&
+    done
+    wait
+fi
+
+echo "LibriSpeech Data preparation done."
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    mkdir -p exp/wavlm
+    echo "Pretrained wavlm model download"
+    wget -P exp/wavlm https://paddlespeech.bj.bcebos.com/wavlm/wavlm-base-plus.pdparams
+fi
+
+exit 0
\ No newline at end of file
--- a/examples/librispeech/asr5/local/test.sh
+++ b/examples/librispeech/asr5/local/test.sh
+#!/bin/bash
+
+set -e
+
+ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
+echo "using $ngpu gpus..."
+
+expdir=exp
+datadir=data
+
+recog_set="test-clean test-other dev-clean dev-other"
+recog_set="test-clean"
+
+config_path=$1
+decode_config_path=$2
+ckpt_prefix=$3
+
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
+
+# download language model
+#bash local/download_lm_en.sh
+#if [ $? -ne 0 ]; then
+#    exit 1
+#fi
+
+python3 format_rsl.py \
+    --origin_ref data/manifest.test-clean.raw \
+    --trans_ref data/manifest.test-clean.text
+
+
+for type in ctc_greedy_search; do
+    echo "decoding ${type}"
+    batch_size=16
+    python3 -u ${BIN_DIR}/test.py \
+        --ngpu ${ngpu} \
+        --config ${config_path} \
+        --decode_cfg ${decode_config_path} \
+        --result_file ${ckpt_prefix}.${type}.rsl \
+        --checkpoint_path ${ckpt_prefix} \
+        --opts decode.decoding_method ${type} \
+        --opts decode.decode_batch_size ${batch_size}
+
+    if [ $? -ne 0 ]; then
+        echo "Failed in evaluation!"
+        exit 1
+    fi
+    python3 format_rsl.py \
+        --origin_hyp ${ckpt_prefix}.${type}.rsl \
+        --trans_hyp ${ckpt_prefix}.${type}.rsl.text
+
+    python3 compute_wer.py --char=1 --v=1 \
+        data/manifest.test-clean.text ${ckpt_prefix}.${type}.rsl.text > ${ckpt_prefix}.${type}.error
+    echo "decoding ${type} done."
+done
+
+for type in ctc_prefix_beam_search; do
+    echo "decoding ${type}"
+    batch_size=1
+    python3 -u ${BIN_DIR}/test.py \
+        --ngpu ${ngpu} \
+        --config ${config_path} \
+        --decode_cfg ${decode_config_path} \
+        --result_file ${ckpt_prefix}.${type}.rsl \
+        --checkpoint_path ${ckpt_prefix} \
+        --opts decode.decoding_method ${type} \
+        --opts decode.decode_batch_size ${batch_size}
+
+    if [ $? -ne 0 ]; then
+        echo "Failed in evaluation!"
+        exit 1
+    fi
+    python3 format_rsl.py \
+        --origin_hyp ${ckpt_prefix}.${type}.rsl \
+        --trans_hyp ${ckpt_prefix}.${type}.rsl.text
+
+    python3 compute_wer.py --char=1 --v=1 \
+        data/manifest.test-clean.text ${ckpt_prefix}.${type}.rsl.text > ${ckpt_prefix}.${type}.error
+    echo "decoding ${type} done."
+done
+
+echo "Finished"
+
+exit 0
--- a/examples/librispeech/asr5/local/test_wav.sh
+++ b/examples/librispeech/asr5/local/test_wav.sh
+#!/bin/bash
+
+if [ $# != 4 ];then
+    echo "usage: ${0} config_path decode_config_path ckpt_path_prefix audio_file"
+    exit -1
+fi
+
+ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
+echo "using $ngpu gpus..."
+
+config_path=$1
+decode_config_path=$2
+ckpt_prefix=$3
+audio_file=$4
+
+mkdir -p data
+wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.wav -P data/
+if [ $? -ne 0 ]; then
+   exit 1
+fi
+
+if [ ! -f ${audio_file} ]; then
+    echo "Plase input the right audio_file path"
+    exit 1
+fi
+
+chunk_mode=false
+if [[ ${config_path} =~ ^.*chunk_.*yaml$ ]];then
+    chunk_mode=true
+fi
+
+# download language model
+#bash local/download_lm_ch.sh
+#if [ $? -ne 0 ]; then
+#    exit 1
+#fi
+
+for type in ctc_greedy_search; do
+    echo "decoding ${type}"
+    batch_size=1
+    output_dir=${ckpt_prefix}
+    mkdir -p ${output_dir}
+    python3 -u ${BIN_DIR}/test_wav.py \
+    --ngpu ${ngpu} \
+    --config ${config_path} \
+    --decode_cfg ${decode_config_path} \
+    --result_file ${output_dir}/${type}.rsl \
+    --checkpoint_path ${ckpt_prefix} \
+    --opts decode.decoding_method ${type} \
+    --opts decode.decode_batch_size ${batch_size} \
+    --audio_file ${audio_file}
+
+    if [ $? -ne 0 ]; then
+        echo "Failed in evaluation!"
+        exit 1
+    fi
+done
+exit 0
--- a/examples/librispeech/asr5/local/train.sh
+++ b/examples/librispeech/asr5/local/train.sh
+#!/bin/bash
+
+if [ $# -lt 2 ] && [ $# -gt 3 ];then
+    echo "usage: CUDA_VISIBLE_DEVICES=0 ${0} config_path ckpt_name ips(optional)"
+    exit -1
+fi
+
+ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
+echo "using $ngpu gpus..."
+
+config_path=$1
+ckpt_name=$2
+resume=$3
+ips=$4
+
+if [ ! $ips ];then
+  ips_config=
+else
+  ips_config="--ips="${ips}
+fi
+
+mkdir -p exp
+
+# seed may break model convergence
+seed=1988
+if [ ${seed} != 0 ]; then
+    export FLAGS_cudnn_deterministic=True
+fi
+
+# export FLAGS_cudnn_exhaustive_search=true
+# export FLAGS_conv_workspace_size_limit=4000
+export FLAGS_allocator_strategy=naive_best_fit
+if [ ${ngpu} == 0 ]; then
+python3 -u ${BIN_DIR}/train.py \
+--ngpu ${ngpu} \
+--config ${config_path} \
+--output exp/${ckpt_name} \
+--seed ${seed} \
+--resume ${resume}
+else
+python3 -m paddle.distributed.launch --gpus=${CUDA_VISIBLE_DEVICES} ${ips_config} ${BIN_DIR}/train.py \
+--ngpu ${ngpu} \
+--config ${config_path} \
+--output exp/${ckpt_name} \
+--seed ${seed} \
+--resume ${resume}
+fi
+
+if [ ${seed} != 0 ]; then
+    unset FLAGS_cudnn_deterministic
+fi
+
+if [ $? -ne 0 ]; then
+    echo "Failed in training!"
+    exit 1
+fi
+
+exit 0
--- a/examples/librispeech/asr5/path.sh
+++ b/examples/librispeech/asr5/path.sh
+export MAIN_ROOT=`realpath ${PWD}/../../../`
+
+export PATH=${MAIN_ROOT}:${MAIN_ROOT}/tools/sctk/bin:${PWD}/utils:${PATH}
+export LC_ALL=C
+
+export PYTHONDONTWRITEBYTECODE=1
+# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+# export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
+
+export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib/
+
+export BIN_DIR=${MAIN_ROOT}/paddlespeech/s2t/exps/wavlm/bin
--- a/examples/librispeech/asr5/run.sh
+++ b/examples/librispeech/asr5/run.sh
+#!/bin/bash
+set -e
+
+. ./path.sh || exit 1;
+. ./cmd.sh || exit 1;
+
+gpus=0,1,2
+stage=0
+stop_stage=3
+conf_path=conf/wavlmASR.yaml
+ips=            #xx.xx.xx.xx,xx.xx.xx.xx
+decode_conf_path=conf/tuning/decode.yaml
+avg_num=3
+resume=         # xx e.g. 30
+
+. ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
+
+audio_file=data/demo_002_en.wav
+
+# avg_ckpt=avg_${avg_num}
+avg_ckpt=4
+ckpt=$(basename ${conf_path} | awk -F'.' '{print $1}')
+echo "checkpoint name ${ckpt}"
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # prepare data
+    bash ./local/data.sh || exit -1
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # train model, all `ckpt` under `exp` dir
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt} ${resume} ${ips}
+fi
+
+# if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+#     # avg n best model
+#     ./avg.sh best exp/${ckpt}/checkpoints ${avg_num}
+# fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # greedy search decoder
+    CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} ${decode_conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
+fi
+
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    # test a single .wav file
+    CUDA_VISIBLE_DEVICES=0 ./local/test_wav.sh ${conf_path} ${decode_conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} ${audio_file} || exit -1
+fi
--- a/examples/librispeech/asr5/utils
+++ b/examples/librispeech/asr5/utils
+../../../utils
\ No newline at end of file
--- a/examples/other/mfa/local/detect_oov.py
+++ b/examples/other/mfa/local/detect_oov.py
+#!/usr/bin/env python3
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");

--- a/examples/other/mfa/local/generate_canton_lexicon_wavlabs.py
+++ b/examples/other/mfa/local/generate_canton_lexicon_wavlabs.py
--- a/examples/other/mfa/local/generate_lexicon.py
+++ b/examples/other/mfa/local/generate_lexicon.py
+#!/usr/bin/env python3
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");

--- a/examples/other/mfa/local/reorganize_aishell3.py
+++ b/examples/other/mfa/local/reorganize_aishell3.py
--- a/examples/other/mfa/local/reorganize_baker.py
+++ b/examples/other/mfa/local/reorganize_baker.py
+#!/usr/bin/env python3
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");

--- a/examples/other/mfa/local/reorganize_ljspeech.py
+++ b/examples/other/mfa/local/reorganize_ljspeech.py
--- a/examples/other/mfa/local/reorganize_vctk.py
+++ b/examples/other/mfa/local/reorganize_vctk.py
--- a/examples/other/mfa/run.sh
+++ b/examples/other/mfa/run.sh
-EXP_DIR=exp
+exp=exp
+data=data
+
+mkdir -p $exp
+mkdir -p $data

-mkdir -p $EXP_DIR
 LEXICON_NAME='simple'
-if [ ! -f "$EXP_DIR/$LEXICON_NAME.lexicon" ]; then
+MFA_DOWNLOAD_DIR=local/
+
+if [ ! -f "$exp/$LEXICON_NAME.lexicon" ]; then
    echo "generating lexicon..."
-    python local/generate_lexicon.py "$EXP_DIR/$LEXICON_NAME" --with-r --with-tone
+    python local/generate_lexicon.py "$exp/$LEXICON_NAME" --with-r --with-tone
    echo "lexicon done"
 fi

-if [ ! -d $EXP_DIR/baker_corpus ]; then
+if [ ! -d $exp/baker_corpus ]; then
    echo "reorganizing baker corpus..."
-    python local/reorganize_baker.py --root-dir=~/datasets/BZNSYP --output-dir=$EXP_DIR/baker_corpus --resample-audio
-    echo "reorganization done. Check output in $EXP_DIR/baker_corpus."
+    python local/reorganize_baker.py --root-dir=~/datasets/BZNSYP --output-dir=$exp/baker_corpus --resample-audio
+    echo "reorganization done. Check output in $exp/baker_corpus."
    echo "audio files are resampled to 16kHz"
-    echo "transcription for each audio file is saved with the same namd in $EXP_DIR/baker_corpus "
+    echo "transcription for each audio file is saved with the same namd in $exp/baker_corpus "
 fi


 echo "detecting oov..."
-python local/detect_oov.py $EXP_DIR/baker_corpus $EXP_DIR/"$LEXICON_NAME.lexicon"
+python local/detect_oov.py $exp/baker_corpus $exp/"$LEXICON_NAME.lexicon"
 echo "detecting oov done. you may consider regenerate lexicon if there is unexpected OOVs."


-MFA_DOWNLOAD_DIR=local/
-
 if [ ! -f "$MFA_DOWNLOAD_DIR/montreal-forced-aligner_linux.tar.gz" ]; then
    echo "downloading mfa..."
    (cd $MFA_DOWNLOAD_DIR && wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz)
@@ -37,11 +40,15 @@ if [ ! -d "$MFA_DOWNLOAD_DIR/montreal-forced-aligner" ]; then
 fi

 export PATH="$MFA_DOWNLOAD_DIR/montreal-forced-aligner/bin"
-if [ ! -d "$EXP_DIR/baker_alignment" ]; then
+
+if [ ! -d "$exp/baker_alignment" ]; then
    echo "Start MFA training..."
-    mfa_train_and_align $EXP_DIR/baker_corpus "$EXP_DIR/$LEXICON_NAME.lexicon" $EXP_DIR/baker_alignment -o $EXP_DIR/baker_model --clean --verbose --temp_directory $EXP_DIR/.mfa_train_and_align
+    PATH=$MFA_DOWNLOAD_DIR/montreal-forced-aligner/bin/:$PATH \
+    LD_LIBRARY_PATH=$MFA_DOWNLOAD_DIR/montreal-forced-aligner/lib/:$LD_LIBRARY_PATH \
+    ./$MFA_DOWNLOAD_DIR/montreal-forced-aligner/bin/mfa_train_and_align \
+        $exp/baker_corpus "$exp/$LEXICON_NAME.lexicon" $exp/baker_alignment -o $exp/baker_model --clean --verbose -j 10 --temp_directory $exp/.mfa_train_and_align
    echo "training done!"
-    echo "results: $EXP_DIR/baker_alignment"
-    echo "model: $EXP_DIR/baker_model"
+    echo "results: $exp/baker_alignment"
+    echo "model: $exp/baker_model"
 fi

--- a/examples/other/mfa/run_canton.sh
+++ b/examples/other/mfa/run_canton.sh
-EXP_DIR=exp
+exp=exp

-mkdir -p $EXP_DIR
+mkdir -p $exp
 LEXICON_NAME='canton'
-if [ ! -f "$EXP_DIR/$LEXICON_NAME.lexicon" ]; then
+MFA_DOWNLOAD_DIR=local/
+
+if [ ! -f "$exp/$LEXICON_NAME.lexicon" ]; then
    echo "generating lexicon and training data..."
-    python local/generate_canton_lexicon_wavlabs.py --output_lexicon "$EXP_DIR/$LEXICON_NAME.lexicon" --output_wavlabs "$EXP_DIR/$LEXICON_NAME"_wavlabs --inputs ~/datasets/Guangzhou_Cantonese_Scripted_Speech_Corpus_Daily_Use_Sentence ~/datasets/Guangzhou_Cantonese_Scripted_Speech_Corpus_in_Vehicle
+    python local/generate_canton_lexicon_wavlabs.py --output_lexicon "$exp/$LEXICON_NAME.lexicon" --output_wavlabs "$exp/$LEXICON_NAME"_wavlabs --inputs ~/datasets/Guangzhou_Cantonese_Scripted_Speech_Corpus_Daily_Use_Sentence ~/datasets/Guangzhou_Cantonese_Scripted_Speech_Corpus_in_Vehicle
    echo "lexicon and training data done"
 fi

-
-MFA_DOWNLOAD_DIR=local/
-
 if [ ! -f "$MFA_DOWNLOAD_DIR/montreal-forced-aligner_linux.tar.gz" ]; then
    echo "downloading mfa..."
    (cd $MFA_DOWNLOAD_DIR && wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz)
@@ -24,11 +23,14 @@ if [ ! -d "$MFA_DOWNLOAD_DIR/montreal-forced-aligner" ]; then
 fi

 export PATH="$MFA_DOWNLOAD_DIR/montreal-forced-aligner/bin"
-if [ ! -d "$EXP_DIR/canton_alignment" ]; then
+if [ ! -d "$exp/canton_alignment" ]; then
    echo "Start MFA training..."
-    mfa_train_and_align "$EXP_DIR/$LEXICON_NAME"_wavlabs "$EXP_DIR/$LEXICON_NAME.lexicon" $EXP_DIR/canton_alignment -o $EXP_DIR/canton_model --clean --verbose --temp_directory $EXP_DIR/.mfa_train_and_align
+    PATH=$MFA_DOWNLOAD_DIR/montreal-forced-aligner/bin/:$PATH \
+    LD_LIBRARY_PATH=$MFA_DOWNLOAD_DIR/montreal-forced-aligner/lib/:$LD_LIBRARY_PATH \
+    ./$MFA_DOWNLOAD_DIR/montreal-forced-aligner/bin/mfa_train_and_align \
+        "$exp/$LEXICON_NAME"_wavlabs "$exp/$LEXICON_NAME.lexicon" $exp/canton_alignment -o $exp/canton_model --clean --verbose -j 10 --temp_directory $exp/.mfa_train_and_align
    echo "training done!"
-    echo "results: $EXP_DIR/canton_alignment"
-    echo "model: $EXP_DIR/canton_model"
+    echo "results: $exp/canton_alignment"
+    echo "model: $exp/canton_model"
 fi

--- a/examples/other/rhy/local/pre_for_sp_aishell.py
+++ b/examples/other/rhy/local/pre_for_sp_aishell.py
+#!/usr/bin/env python3
 import argparse
 import os
 import re
@@ -8,7 +9,7 @@ replace_ = {"#1": "%", "#2": "`", "#3": "~", "#4": "$"}


 def replace_rhy_with_punc(line):
-    # r'[：、，；。？！,.:;"?!”’《》【】<=>{}()（）#&@“”^_|…\\]%*$', '', line)     #参考checkcheck_oov.py,
+    # r'[：、，；。？！,.:;"?!”’《》【】<=>{}()（）#&@“”^_|…\\]%*$', '', line)     #参考check_oov.py,
    line = re.sub(r'[：、，；。？！,.:;"?!’《》【】<=>{}()（）#&@“”^_|…\\]%*$', '', line)
    for r in replace_.keys():
        if r in line:

--- a/examples/other/rhy/local/pre_for_sp_csmsc.py
+++ b/examples/other/rhy/local/pre_for_sp_csmsc.py
+#!/usr/bin/env python3
 import argparse
 import os
 import re
@@ -6,7 +7,7 @@ replace_ = {"#1": "%", "#2": "`", "#3": "~", "#4": "$"}


 def replace_rhy_with_punc(line):
-    # r'[：、，；。？！,.:;"?!”’《》【】<=>{}()（）#&@“”^_|…\\]%*$', '', line)     #参考checkcheck_oov.py,
+    # r'[：、，；。？！,.:;"?!”’《》【】<=>{}()（）#&@“”^_|…\\]%*$', '', line)     #参考check_oov.py,
    line = re.sub(r'^$\*%', '', line)
    for r in replace_.keys():
        if r in line:

--- a/examples/other/rhy/run.sh
+++ b/examples/other/rhy/run.sh
@@ -6,13 +6,15 @@ gpus=0
 stage=0
 stop_stage=100

+data=data
+mkdir -p $data
+
 aishell_data=label_train-set.txt
 csmsc_data=000001-010000.txt
-processed_path=data

 conf_path=conf/default.yaml
 train_output_path=exp/default
-ckpt_name=snapshot_iter_2600.pdz
+ckpt_name=snapshot_iter_4680.pdz
 text=我们城市的复苏有赖于他强有力的政策。
 print_eval=false

@@ -23,7 +25,7 @@ source ${MAIN_ROOT}/utils/parse_options.sh || exit 1

 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    # prepare data
-    ./local/data.sh ${aishell_data} ${csmsc_data} ${processed_path}
+    ./local/data.sh ${aishell_data} ${csmsc_data} ${data}
 fi

 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then

--- a/examples/other/spm/README.md
+++ b/examples/other/spm/README.md
@@ -33,6 +33,18 @@ ec5a9b24acc35469229e41256ceaf77d  data/lang_char/input.txt
 ```

 ```
+==> data/lang_char/input.txt <==
+mister quilter is the apostle of the middle classes and we are glad to welcome his gospel
+nor is mister quilter's manner less interesting than his matter
+he tells us that at this festive season of the year with christmas and roast beef looming before us similes drawn from eating and its results occur most readily to the mind
+he has grave doubts whether sir frederick leighton's work is really greek after all and can discover in it but little of rocky ithaca
+linnell's pictures are a sort of up guards and at em paintings and mason's exquisite idylls are as national as a jingo poem mister birket foster's landscapes smile at one much in the same way that mister carker used to flash his teeth and mister john collier gives his sitter a cheerful slap on the back before he says like a shampooer in a turkish bath next man
+it is obviously unnecessary for us to point out how luminous these criticisms are how delicate in expression
+on the general principles of art mister quilter writes with equal lucidity
+painting he tells us is of a different quality to mathematics and finish in art is adding more fact
+as for etchings they are of two kinds british and foreign
+he laments most bitterly the divorce that has been made between decorative art and what we usually call pictures makes the customary appeal to the last judgment and reminds us that in the great days of art michael angelo was the furnishing upholsterer
+
 ==> data/lang_char/input.bpe <==
 ▁mi ster ▁quilter ▁ is ▁the ▁a p ost le ▁o f ▁the ▁mi d d le ▁c las s es ▁ and ▁we ▁ar e ▁g l a d ▁ to ▁we l c om e ▁h is ▁g o s pe l
 ▁ n or ▁ is ▁mi ster ▁quilter ' s ▁ma nne r ▁ l ess ▁in ter es t ing ▁tha n ▁h is ▁ma t ter
@@ -58,17 +70,6 @@ painting he tells us is of a different quality to mathematics and finish in art
 as for etchings they are of two kinds british and foreign
 he laments most bitterly the divorce that has been made between decorative art and what we usually call pictures makes the customary appeal to the last judgment and reminds us that in the great days of art michael angelo was the furnishing upholsterer

-==> data/lang_char/input.txt <==
-mister quilter is the apostle of the middle classes and we are glad to welcome his gospel
-nor is mister quilter's manner less interesting than his matter
-he tells us that at this festive season of the year with christmas and roast beef looming before us similes drawn from eating and its results occur most readily to the mind
-he has grave doubts whether sir frederick leighton's work is really greek after all and can discover in it but little of rocky ithaca
-linnell's pictures are a sort of up guards and at em paintings and mason's exquisite idylls are as national as a jingo poem mister birket foster's landscapes smile at one much in the same way that mister carker used to flash his teeth and mister john collier gives his sitter a cheerful slap on the back before he says like a shampooer in a turkish bath next man
-it is obviously unnecessary for us to point out how luminous these criticisms are how delicate in expression
-on the general principles of art mister quilter writes with equal lucidity
-painting he tells us is of a different quality to mathematics and finish in art is adding more fact
-as for etchings they are of two kinds british and foreign
-he laments most bitterly the divorce that has been made between decorative art and what we usually call pictures makes the customary appeal to the last judgment and reminds us that in the great days of art michael angelo was the furnishing upholsterer

 ==> data/lang_char/train_unigram100_units.txt <==
 <blank> 0

--- a/paddlespeech/cli/ssl/infer.py
+++ b/paddlespeech/cli/ssl/infer.py
@@ -52,7 +52,7 @@ class SSLExecutor(BaseExecutor):
            '--model',
            type=str,
            default='wav2vec2',
-            choices=['wav2vec2', 'hubert'],
+            choices=['wav2vec2', 'hubert', "wavlm"],
            help='Choose model type of asr task.')
        self.parser.add_argument(
            '--task',
@@ -157,6 +157,12 @@ class SSLExecutor(BaseExecutor):
                    elif lang == 'zh':
                        logger.error("zh hubertASR is not supported yet")
                    tag = model_prefix + '-' + lang + '-' + sample_rate_str
+                elif model_type == 'wavlm':
+                    if lang == "en":
+                        model_prefix = "wavlmASR_librispeech"
+                    elif lang == "zh":
+                        logger.error("zh wavlmASR is not supported yet")
+                    tag = model_prefix + '-' + lang + '-' + sample_rate_str
            else:
                tag = model_type + '-' + lang + '-' + sample_rate_str
            self.task_resource.set_task_model(tag, version=None)

--- a/paddlespeech/resource/model_alias.py
+++ b/paddlespeech/resource/model_alias.py
@@ -25,6 +25,7 @@ model_alias = {
    "wav2vec2": ["paddlespeech.s2t.models.wav2vec2:Wav2vec2Base"],
    "hubertASR": ["paddlespeech.s2t.models.hubert:HubertASR"],
    "hubert": ["paddlespeech.s2t.models.hubert:HubertBase"],
+    "wavlmASR": ["paddlespeech.s2t.models.wavlm:WavLMASR"],

    # ---------------------------------
    # -------------- ASR --------------

--- a/paddlespeech/resource/pretrained_models.py
+++ b/paddlespeech/resource/pretrained_models.py
@@ -149,6 +149,16 @@ ssl_dynamic_pretrained_models = {
            'exp/hubertASR/checkpoints/avg_1.pdparams',
        },
    },
+    "wavlmASR_librispeech-en-16k": {
+        "1.0": {
+            "url": "https://paddlespeech.bj.bcebos.com/wavlm/wavlm_baseplus_libriclean_100h.tar.gz",
+            "md5": "f2238e982bb8bcf046e536201f5ea629",
+            "cfg_path": "model.yaml",
+            "ckpt_path": "exp/wavlmASR/checkpoints/46",
+            "model": "exp/wavlmASR/checkpoints/46.pdparams",
+            "params": "exp/wavlmASR/checkpoints/46.pdparams",
+        }
+    }
 }

 # ---------------------------------

--- a/paddlespeech/s2t/exps/wavlm/__init__.py
+++ b/paddlespeech/s2t/exps/wavlm/__init__.py
--- a/paddlespeech/s2t/exps/wavlm/bin/__init__.py
+++ b/paddlespeech/s2t/exps/wavlm/bin/__init__.py
--- a/paddlespeech/s2t/exps/wavlm/bin/test.py
+++ b/paddlespeech/s2t/exps/wavlm/bin/test.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Evaluation for WavLM model."""
+import cProfile
+
+from yacs.config import CfgNode
+
+from paddlespeech.s2t.exps.wavlm.model import WavLMASRTester as Tester
+from paddlespeech.s2t.training.cli import default_argument_parser
+from paddlespeech.utils.argparse import print_arguments, add_arguments
+
+
+def main_sp(config, args):
+    exp = Tester(config, args)
+    with exp.eval():
+        exp.setup()
+        exp.run_test()
+
+
+def main(config, args):
+    main_sp(config, args)
+
+
+if __name__ == "__main__":
+    parser = default_argument_parser()
+    # save asr result to
+    parser.add_argument(
+        '--dict-path', type=str, default=None, help='dict path.')
+    parser.add_argument(
+        "--result_file", type=str, help="path of save the asr result")
+    args = parser.parse_args()
+    print_arguments(args, globals())
+
+    # https://yaml.org/type/float.html
+    config = CfgNode(new_allowed=True)
+    if args.config:
+        config.merge_from_file(args.config)
+    if args.decode_cfg:
+        decode_confs = CfgNode(new_allowed=True)
+        decode_confs.merge_from_file(args.decode_cfg)
+        config.decode = decode_confs
+    if args.opts:
+        config.merge_from_list(args.opts)
+    config.freeze()
+    print(config)
+    if args.dump_config:
+        with open(args.dump_config, 'w') as f:
+            print(config, file=f)
+
+    # Setting for profiling
+    pr = cProfile.Profile()
+    pr.runcall(main, config, args)
+    pr.dump_stats('test.profile')
--- a/paddlespeech/s2t/exps/wavlm/bin/test_wav.py
+++ b/paddlespeech/s2t/exps/wavlm/bin/test_wav.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Evaluation for wavlm model."""
+import os
+import sys
+from pathlib import Path
+
+import paddle
+import soundfile
+from paddlenlp.transformers import AutoTokenizer
+from yacs.config import CfgNode
+
+from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
+from paddlespeech.s2t.models.wavlm.wavlm_asr import WavLMASR
+from paddlespeech.s2t.training.cli import default_argument_parser
+from paddlespeech.s2t.utils.log import Log
+from paddlespeech.s2t.utils.utility import UpdateConfig
+logger = Log(__name__).getlog()
+
+
+class WavLMInfer():
+    def __init__(self, config, args):
+        self.args = args
+        self.config = config
+        self.audio_file = args.audio_file
+        self.tokenizer = config.get("tokenizer", None)
+
+        if self.tokenizer:
+            self.text_feature = AutoTokenizer.from_pretrained(
+                self.config.tokenizer)
+        else:
+            self.text_feature = TextFeaturizer(
+                unit_type=config.unit_type, vocab=config.vocab_filepath)
+
+        paddle.set_device('gpu' if self.args.ngpu > 0 else 'cpu')
+
+        # model
+        model_conf = config
+        with UpdateConfig(model_conf):
+            model_conf.output_dim = self.text_feature.vocab_size
+        model = WavLMASR.from_config(model_conf)
+        self.model = model
+        self.model.eval()
+
+        # load model
+        params_path = self.args.checkpoint_path + ".pdparams"
+        model_dict = paddle.load(params_path)
+        self.model.set_state_dict(model_dict)
+
+    def run(self):
+        check(args.audio_file)
+
+        with paddle.no_grad():
+            # read
+            audio, _ = soundfile.read(
+                self.audio_file, dtype="int16", always_2d=True)
+            logger.info(f"audio shape: {audio.shape}")
+            xs = paddle.to_tensor(audio, dtype='float32').unsqueeze(axis=0)
+            decode_config = self.config.decode
+            result_transcripts, result_tokenids = self.model.decode(
+                xs,
+                text_feature=self.text_feature,
+                decoding_method=decode_config.decoding_method,
+                beam_size=decode_config.beam_size,
+                tokenizer=self.tokenizer, )
+            rsl = result_transcripts[0]
+            utt = Path(self.audio_file).name
+            logger.info(f"hyp: {utt} {rsl}")
+            return rsl
+
+
+def check(audio_file):
+    if not os.path.isfile(audio_file):
+        print("Please input the right audio file path")
+        sys.exit(-1)
+
+    logger.info("checking the audio file format......")
+    try:
+        sig, sample_rate = soundfile.read(audio_file)
+    except Exception as e:
+        logger.error(str(e))
+        logger.error(
+            "can not open the wav file, please check the audio file format")
+        sys.exit(-1)
+    logger.info("The sample rate is %d" % sample_rate)
+    assert (sample_rate == 16000)
+    logger.info("The audio file format is right")
+
+
+def main(config, args):
+    WavLMInfer(config, args).run()
+
+
+if __name__ == "__main__":
+    parser = default_argument_parser()
+    # save asr result to
+    parser.add_argument(
+        "--result_file", type=str, help="path of save the asr result")
+    parser.add_argument(
+        "--audio_file", type=str, help="path of the input audio file")
+    args = parser.parse_args()
+
+    config = CfgNode(new_allowed=True)
+
+    if args.config:
+        config.merge_from_file(args.config)
+    if args.decode_cfg:
+        decode_confs = CfgNode(new_allowed=True)
+        decode_confs.merge_from_file(args.decode_cfg)
+        config.decode = decode_confs
+    if args.opts:
+        config.merge_from_list(args.opts)
+    config.freeze()
+    main(config, args)
--- a/paddlespeech/s2t/exps/wavlm/bin/train.py
+++ b/paddlespeech/s2t/exps/wavlm/bin/train.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Trainer for wavlm model."""
+import cProfile
+import os
+
+from yacs.config import CfgNode
+
+from paddlespeech.s2t.exps.wavlm.model import WavLMASRTrainer as Trainer
+from paddlespeech.s2t.training.cli import default_argument_parser
+from paddlespeech.utils.argparse import print_arguments, add_arguments
+
+
+def main_sp(config, args):
+    exp = Trainer(config, args)
+    exp.setup()
+    exp.run()
+
+
+def main(config, args):
+    main_sp(config, args)
+
+
+if __name__ == "__main__":
+    parser = default_argument_parser()
+    parser.add_argument(
+        '--resume', type=str, default="", nargs="?", help='resume ckpt path.')
+    args = parser.parse_args()
+    print_arguments(args, globals())
+    # https://yaml.org/type/float.html
+    config = CfgNode(new_allowed=True)
+    if args.config:
+        config.merge_from_file(args.config)
+    if args.opts:
+        config.merge_from_list(args.opts)
+    config.freeze()
+    if args.dump_config:
+        with open(args.dump_config, 'w') as f:
+            print(config, file=f)
+
+    # Setting for profiling
+    pr = cProfile.Profile()
+    pr.runcall(main, config, args)
+    pr.dump_stats(os.path.join(args.output, 'train.profile'))
--- a/paddlespeech/s2t/exps/wavlm/model.py
+++ b/paddlespeech/s2t/exps/wavlm/model.py
--- a/paddlespeech/s2t/models/wavlm/__init__.py
+++ b/paddlespeech/s2t/models/wavlm/__init__.py
+from .wavlm_paddle import WavLM, WavLMConfig
+from .wavlm_asr import WavLMASR, WavLMBase
\ No newline at end of file
--- a/paddlespeech/s2t/models/wavlm/modules/__init__.py
+++ b/paddlespeech/s2t/models/wavlm/modules/__init__.py
--- a/paddlespeech/s2t/models/wavlm/modules/activations.py
+++ b/paddlespeech/s2t/models/wavlm/modules/activations.py
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import paddle
+import paddle.nn.functional as F
+
+
+def _gelu_python(x):
+    """
+    Original Implementation of the GELU activation function in Google BERT repo when initially created. For
+    information: OpenAI GPT's GELU is slightly different (and gives slightly different results): 0.5 * x * (1 +
+    torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) This is now written in C in
+    torch.nn.functional Also see the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
+    """
+    return x * 0.5 * (1.0 + paddle.erf(x / math.sqrt(2.0)))
+
+
+
+
+def gelu_new(x):
+    """
+    Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Also see
+    the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
+    """
+    return 0.5 * x * (1.0 + paddle.tanh(
+        math.sqrt(2.0 / math.pi) * (x + 0.044715 * paddle.pow(x, 3.0))))
+
+
+def gelu_fast(x):
+    return 0.5 * x * (1.0 + paddle.tanh(x * 0.7978845608 *
+                                        (1.0 + 0.044715 * x * x)))
+
+gelu = gelu_fast
+
+def _silu_python(x):
+    """
+    See Gaussian Error Linear Units (Hendrycks et al., https://arxiv.org/abs/1606.08415) where the SiLU (Sigmoid Linear
+    Unit) was originally introduced and coined, and see Sigmoid-Weighted Linear Units for Neural Network Function
+    Approximation in Reinforcement Learning (Elfwing et al., https://arxiv.org/abs/1702.03118) and Swish: a Self-Gated
+    Activation Function (Ramachandran et al., https://arxiv.org/abs/1710.05941v1) where the SiLU was experimented with
+    later.
+    """
+    return x * paddle.nn.functional.sigmoid(x)
+
+
+def mish(x):
+    return x * paddle.tanh(paddle.nn.functional.softplus(x))
+
+
+def linear_act(x):
+    return x
+
+
+ACT2FN = {
+    "relu": F.relu,
+    "silu": _silu_python,
+    "swish": _silu_python,
+    "gelu": gelu,
+    "tanh": paddle.tanh,
+    "gelu_new": gelu_new,
+    "gelu_fast": gelu_fast,
+    "mish": mish,
+    "linear": linear_act,
+    "sigmoid": paddle.nn.functional.sigmoid,
+}
+
+
+def get_activation(activation_string):
+    if activation_string in ACT2FN:
+        return ACT2FN[activation_string]
+    else:
+        raise KeyError(
+            f"function {activation_string} not found in ACT2FN mapping {list(ACT2FN.keys())}"
+        )
\ No newline at end of file
--- a/paddlespeech/s2t/models/wavlm/modules/functional.py
+++ b/paddlespeech/s2t/models/wavlm/modules/functional.py
--- a/paddlespeech/s2t/models/wavlm/modules/modules.py
+++ b/paddlespeech/s2t/models/wavlm/modules/modules.py
--- a/paddlespeech/s2t/models/wavlm/wavlm_asr.py
+++ b/paddlespeech/s2t/models/wavlm/wavlm_asr.py
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from collections import defaultdict
+from typing import Dict
+from typing import List
+from typing import Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddlespeech.s2t.models.wav2vec2.modules.VanillaNN import VanillaNN
+from paddlespeech.s2t.models.wav2vec2.processing.speech_augmentation import SpecAugment
+from paddlespeech.s2t.modules.ctc import CTCDecoderBase as CTC
+from paddlespeech.s2t.modules.initializer import DefaultInitializerContext
+from paddlespeech.s2t.utils.ctc_utils import remove_duplicates_and_blank
+from paddlespeech.s2t.utils.utility import log_add
+
+from .wavlm_paddle import WavLM, WavLMConfig
+
+
+class WavLMASR(nn.Layer):
+    def __init__(self, config: dict):
+        super().__init__()
+        init_type = config.get("init_type", None)
+        with DefaultInitializerContext(init_type):
+            self.config = config
+            wavlm_config = WavLMConfig(config)
+            wavlm = WavLM(wavlm_config)
+
+            self.normalize_wav = config.normalize_wav
+            self.output_norm = config.output_norm
+            if hasattr(config, 'spec_augment'):
+                self.spec_augment = SpecAugment(**config.spec_augment)
+
+            if config.freeze_wavlm:
+                wavlm.eval()
+                for parm in wavlm.parameters():
+                    parm.trainable = False
+            self.wavlm = wavlm
+            self.enc = VanillaNN(**config.enc)
+            self.ctc = CTC(**config.ctc,
+                           odim=config.output_dim,
+                           batch_average=False,
+                           reduction='mean')
+
+    def forward(self, wav, wavs_lens_rate, target, target_lens):
+        if self.normalize_wav:
+            wav = F.layer_norm(wav, wav.shape)
+
+        # Extract wav2vec output
+        out = self.wavlm(wav)
+        # We normalize the output if required
+        if self.output_norm:
+            out = F.layer_norm(out, out.shape)
+
+        if self.training and hasattr(self.config, 'spec_augment'):
+            feats = self.spec_augment(out)
+        else:
+            feats = out
+
+        x = self.enc(feats)
+        # x = feats
+
+        x_lens = (wavs_lens_rate * x.shape[1]).round().astype(paddle.int64)
+        target_lens = target_lens.astype(paddle.int64)
+        # target = target.astype(paddle.int32)
+        ctc_loss = self.ctc(x, x_lens, target, target_lens)
+
+        return ctc_loss
+
+    @paddle.no_grad()
+    def decode(self,
+               feats: paddle.Tensor,
+               text_feature: Dict[str, int],
+               decoding_method: str,
+               beam_size: int,
+               tokenizer: str=None,
+               sb_pipeline=False):
+        batch_size = feats.shape[0]
+
+        if decoding_method == 'ctc_prefix_beam_search' and batch_size > 1:
+            print(
+                f"decoding mode {decoding_method} must be running with batch_size == 1"
+            )
+            print(f"current batch_size is {batch_size}")
+
+        if decoding_method == 'ctc_greedy_search':
+            if tokenizer is None and sb_pipeline is False:
+                hyps = self.ctc_greedy_search(feats)
+                res = [text_feature.defeaturize(hyp) for hyp in hyps]
+                res_tokenids = [hyp for hyp in hyps]
+            else:
+                if sb_pipeline is True:
+                    hyps = self.ctc_greedy_search(feats.unsqueeze(-1))
+                else:
+                    hyps = self.ctc_greedy_search(feats)
+                res = []
+                res_tokenids = []
+                for sequence in hyps:
+                    # Decode token terms to words 
+                    predicted_tokens = text_feature.convert_ids_to_tokens(
+                        sequence)
+                tmp_res = []
+                tmp_res_tokenids = []
+                for c in predicted_tokens:
+                    if c == "[CLS]":
+                        continue
+                    elif c == "[SEP]" or c == "[PAD]":
+                        break
+                    else:
+                        tmp_res.append(c)
+                        tmp_res_tokenids.append(text_feature.vocab[c])
+                res.append(''.join(tmp_res))
+                res_tokenids.append(tmp_res_tokenids)
+
+        # ctc_prefix_beam_search and attention_rescoring only return one
+        # result in List[int], change it to List[List[int]] for compatible
+        # with other batch decoding mode
+        elif decoding_method == 'ctc_prefix_beam_search':
+            assert feats.shape[0] == 1
+            if tokenizer is None and sb_pipeline is False:
+                hyp = self.ctc_prefix_beam_search(feats, beam_size)
+                res = [text_feature.defeaturize(hyp)]
+                res_tokenids = [hyp]
+            else:
+                if sb_pipeline is True:
+                    hyp = self.ctc_prefix_beam_search(
+                        feats.unsqueeze(-1), beam_size)
+                else:
+                    hyp = self.ctc_prefix_beam_search(feats, beam_size)
+                res = []
+                res_tokenids = []
+                predicted_tokens = text_feature.convert_ids_to_tokens(hyp)
+                tmp_res = []
+                tmp_res_tokenids = []
+                for c in predicted_tokens:
+                    if c == "[CLS]":
+                        continue
+                    elif c == "[SEP]" or c == "[PAD]":
+                        break
+                    else:
+                        tmp_res.append(c)
+                        tmp_res_tokenids.append(text_feature.vocab[c])
+                res.append(''.join(tmp_res))
+                res_tokenids.append(tmp_res_tokenids)
+        else:
+            raise ValueError(
+                f"WavLM not support decoding method: {decoding_method}")
+
+        return res, res_tokenids
+
+    @classmethod
+    def from_config(cls, config):
+        model = cls(config)
+        return model
+
+    def ctc_greedy_search(self, wav) -> List[List[int]]:
+        """ Apply CTC greedy search
+        Args:
+            speech (paddle.Tensor): (batch, max_len)
+            speech_length (paddle.Tensor): (batch, )
+        Returns:
+            List[List[int]]: best path result
+        """
+        batch_size = wav.shape[0]
+        wav = wav[:, :, 0]
+        if self.normalize_wav:
+            wav = F.layer_norm(wav, wav.shape[1:])
+        # Extract wavlm output
+        out = self.wavlm(wav)
+        # We normalize the output if required
+        if self.output_norm:
+            out = F.layer_norm(out, out.shape[1:])
+        feats = out
+        x = self.enc(feats)
+        x_lens = x.shape[1]
+        ctc_probs = self.ctc.log_softmax(x)  # (B, maxlen, vocab_size)
+        topk_prob, topk_index = ctc_probs.topk(1, axis=2)  # (B, maxlen, 1)
+        topk_index = topk_index.view(batch_size, x_lens)  # (B, maxlen)
+
+        hyps = [hyp.tolist() for hyp in topk_index]
+        hyps = [remove_duplicates_and_blank(hyp) for hyp in hyps]
+        return hyps
+
+    def _ctc_prefix_beam_search(
+            self,
+            wav,
+            beam_size,
+            blank_id: int=0, ) -> Tuple[List[Tuple[int, float]], paddle.Tensor]:
+        """ CTC prefix beam search inner implementation
+        Args:
+            speech (paddle.Tensor): (batch, max_len, feat_dim)
+            speech_length (paddle.Tensor): (batch, )
+            beam_size (int): beam size for beam search
+            decoding_chunk_size (int): decoding chunk for dynamic chunk
+                trained model.
+                <0: for decoding, use full chunk.
+                >0: for decoding, use fixed chunk size as set.
+                0: used for training, it's prohibited here
+            simulate_streaming (bool): whether do encoder forward in a
+                streaming fashion
+        Returns:
+            List[Tuple[int, float]]: nbest results, (N,1), (text, likelihood)
+            paddle.Tensor: encoder output, (1, max_len, encoder_dim),
+                it will be used for rescoring in attention rescoring mode
+        """
+        wav = wav[:, :, 0]
+
+        if self.normalize_wav:
+            wav = F.layer_norm(wav, wav.shape[1:])
+        # Extract wavlm output
+        out = self.wavlm(wav)
+        # We normalize the output if required
+        if self.output_norm:
+            out = F.layer_norm(out, out.shape[1:])
+        feats = out
+
+        x = self.enc(feats)
+        maxlen = x.shape[1]
+        ctc_probs = self.ctc.log_softmax(x)  # (1, maxlen, vocab_size)
+        ctc_probs = ctc_probs.squeeze(0)
+
+        # cur_hyps: (prefix, (blank_ending_score, none_blank_ending_score))
+        # blank_ending_score and  none_blank_ending_score in ln domain
+        cur_hyps = [(tuple(), (0.0, -float('inf')))]
+        # 2. CTC beam search step by step
+        for t in range(0, maxlen):
+            logp = ctc_probs[t]  # (vocab_size,)
+            # key: prefix, value (pb, pnb), default value(-inf, -inf)
+            next_hyps = defaultdict(lambda: (-float('inf'), -float('inf')))
+            # 2.1 First beam prune: select topk best
+            top_k_logp, top_k_index = logp.topk(beam_size)  # (beam_size,)
+            for s in top_k_index:
+                s = s.item()
+                ps = logp[s].item()
+                for prefix, (pb, pnb) in cur_hyps:
+                    last = prefix[-1] if len(prefix) > 0 else None
+                    if s == blank_id:  # blank
+                        n_pb, n_pnb = next_hyps[prefix]
+                        n_pb = log_add([n_pb, pb + ps, pnb + ps])
+                        next_hyps[prefix] = (n_pb, n_pnb)
+                    elif s == last:
+                        #  Update *ss -> *s;
+                        n_pb, n_pnb = next_hyps[prefix]
+                        n_pnb = log_add([n_pnb, pnb + ps])
+                        next_hyps[prefix] = (n_pb, n_pnb)
+                        # Update *s-s -> *ss, - is for blank
+                        n_prefix = prefix + (s, )
+                        n_pb, n_pnb = next_hyps[n_prefix]
+                        n_pnb = log_add([n_pnb, pb + ps])
+                        next_hyps[n_prefix] = (n_pb, n_pnb)
+                    else:
+                        n_prefix = prefix + (s, )
+                        n_pb, n_pnb = next_hyps[n_prefix]
+                        n_pnb = log_add([n_pnb, pb + ps, pnb + ps])
+                        next_hyps[n_prefix] = (n_pb, n_pnb)
+
+            # 2.2 Second beam prune
+            next_hyps = sorted(
+                next_hyps.items(),
+                key=lambda x: log_add(list(x[1])),
+                reverse=True)
+            cur_hyps = next_hyps[:beam_size]
+
+        hyps = [(y[0], log_add([y[1][0], y[1][1]])) for y in cur_hyps]
+        return hyps
+
+    def ctc_prefix_beam_search(self, wav, beam_size) -> List[int]:
+        """ Apply CTC prefix beam search
+        Args:
+            speech (paddle.Tensor): (batch, max_len, feat_dim)
+            speech_length (paddle.Tensor): (batch, )
+            beam_size (int): beam size for beam search
+            decoding_chunk_size (int): decoding chunk for dynamic chunk
+                trained model.
+                <0: for decoding, use full chunk.
+                >0: for decoding, use fixed chunk size as set.
+                0: used for training, it's prohibited here
+            simulate_streaming (bool): whether do encoder forward in a
+                streaming fashion
+        Returns:
+            List[int]: CTC prefix beam search nbest results
+        """
+        hyps = self._ctc_prefix_beam_search(wav, beam_size)
+        return hyps[0][0]
+
+
+class WavLMBase(nn.Layer):
+    """WavLM model"""
+
+    def __init__(self, config: dict):
+        super().__init__()
+        wavlm_config = WavLMConfig(config)
+        wavlm = WavLM(wavlm_config)
+        self.wavlm = wavlm
+
+    @classmethod
+    def from_config(cls, configs: dict):
+        """init model.
+        Args:
+            configs (dict): config dict.
+        Raises:
+            ValueError: raise when using not support encoder type.
+        Returns:
+            nn.Layer: WavLMBase
+        """
+        model = cls(configs)
+        return model
+
+    def forward(self, wav):
+        out = self.wavlm(wav)
+        return out
--- a/paddlespeech/s2t/models/wavlm/wavlm_paddle.py
+++ b/paddlespeech/s2t/models/wavlm/wavlm_paddle.py
--- a/paddlespeech/text/exps/ernie_linear/train.py
+++ b/paddlespeech/text/exps/ernie_linear/train.py
@@ -66,7 +66,7 @@ def train_sp(args, config):
    seed_everything(config.seed)

    print(
-        f"rank: {dist.get_rank()}, pid: {os.getpid()}, parent_pid: {os.getppid()}",
+        f"rank:{dist.get_rank()}, pid: {os.getpid()}, parent_pid: {os.getppid()}"
    )
    # dataloader has been too verbose
    logging.getLogger("DataLoader").disabled = True

--- a/setup.py
+++ b/setup.py
@@ -41,6 +41,7 @@ base = [
    "inflect",
    "jsonlines",
    "librosa==0.8.1",
+    "scipy>=1.4.0",
    "loguru",
    "matplotlib",
    "nara_wpe",

--- a/tests/test_tipc/prepare.sh
+++ b/tests/test_tipc/prepare.sh
@@ -29,11 +29,11 @@ if [[ ${MODE} = "benchmark_train" ]];then
    cd ${curPath}/../..
    echo "------------- install for speech  "
    apt-get install libsndfile1 -y 
-    pip install yacs -i https://pypi.tuna.tsinghua.edu.cn/simple
-    pip install pytest-runner  -i https://pypi.tuna.tsinghua.edu.cn/simple
-    pip install kaldiio  -i https://pypi.tuna.tsinghua.edu.cn/simple
-    pip install setuptools_scm -i https://pypi.tuna.tsinghua.edu.cn/simple 
-    pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple 
+    pip install yacs #-i https://pypi.tuna.tsinghua.edu.cn/simple
+    pip install pytest-runner  #-i https://pypi.tuna.tsinghua.edu.cn/simple
+    pip install kaldiio  #-i https://pypi.tuna.tsinghua.edu.cn/simple
+    pip install setuptools_scm #-i https://pypi.tuna.tsinghua.edu.cn/simple 
+    pip install . #-i https://pypi.tuna.tsinghua.edu.cn/simple 
    pip install jsonlines
    pip list
    cd -