Merge branch 'develop' of https://github.com/PaddlePaddle/DeepSpeech into u2_export

812d80ab · Hui Zhang · d3572be0 · 80c219b7 · 812d80ab · 812d80ab
78 changed file
--- a/README_cn.md
+++ b/README_cn.md
@@ -20,7 +20,8 @@
 </p>
 <div align="center">  
 <h4>
-    <a href="#快速开始"> 快速开始 </a>
+  <a href="#安装"> 安装 </a>
+  | <a href="#快速开始"> 快速开始 </a>
  | <a href="#快速使用服务"> 快速使用服务 </a>
  | <a href="#快速使用流式服务"> 快速使用流式服务 </a>
  | <a href="#教程文档"> 教程文档 </a>
@@ -38,6 +39,8 @@

 **PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), 请访问 [Arxiv](https://arxiv.org/abs/2205.12007) 论文。
  
+### 效果展示
+
 ##### 语音识别

 <div align = "center">
@@ -154,7 +157,7 @@
 本项目采用了易用、高效、灵活以及可扩展的实现，旨在为工业应用、学术研究提供更好的支持，实现的功能包含训练、推断以及测试模块，以及部署过程，主要包括
 - 📦 **易用性**: 安装门槛低，可使用 [CLI](#quick-start) 快速开始。
 - 🏆 **对标 SoTA**: 提供了高速、轻量级模型，且借鉴了最前沿的技术。
- 🏆 **流式ASR和TTS系统**：工业级的端到端流式识别、流式合成系统。
+- 🏆 **流式 ASR 和 TTS 系统**：工业级的端到端流式识别、流式合成系统。
 - 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换（G2P）。此外，我们使用自定义语言规则来适应中文语境。
 - **多种工业界以及学术界主流功能支持**:
  - 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成、声纹识别、KWS等任务的实现。
@@ -182,61 +185,195 @@
 <img src="https://user-images.githubusercontent.com/23690325/169763015-cbd8e28d-602c-4723-810d-dbc6da49441e.jpg"  width = "200"  />
 </div>

+<a name="安装"></a>
 ## 安装

 我们强烈建议用户在 **Linux** 环境下，*3.7* 以上版本的 *python* 上安装 PaddleSpeech。
-目前为止，**Linux** 支持声音分类、语音识别、语音合成和语音翻译四种功能，**Mac OSX、 Windows** 下暂不支持语音翻译功能。 想了解具体安装细节，可以参考[安装文档](./docs/source/install_cn.md)。
+
+### 相关依赖
+ gcc >= 4.8.5
+ paddlepaddle >= 2.3.1
+ python >= 3.7
+ linux(推荐), mac, windows
+
+PaddleSpeech依赖于paddlepaddle，安装可以参考[paddlepaddle官网](https://www.paddlepaddle.org.cn/)，根据自己机器的情况进行选择。这里给出cpu版本示例，其它版本大家可以根据自己机器的情况进行安装。
+
+```shell
+pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
+```
+
+PaddleSpeech快速安装方式有两种，一种是pip安装，一种是源码编译（推荐）。
+
+### pip 安装
+```shell
+pip install pytest-runner
+pip install paddlespeech
+```
+
+### 源码编译
+```shell
+git clone https://github.com/PaddlePaddle/PaddleSpeech.git
+cd PaddleSpeech
+pip install pytest-runner
+pip install .
+```
+
+更多关于安装问题，如 conda 环境，librosa 依赖的系统库，gcc 环境问题，kaldi 安装等，可以参考这篇[安装文档](docs/source/install_cn.md)，如安装上遇到问题可以在 [#2150](https://github.com/PaddlePaddle/PaddleSpeech/issues/2150) 上留言以及查找相关问题

 <a name="快速开始"></a>
 ## 快速开始

-安装完成后，开发者可以通过命令行快速开始，改变 `--input` 可以尝试用自己的音频或文本测试。
+安装完成后，开发者可以通过命令行或者Python快速开始，命令行模式下改变 `--input` 可以尝试用自己的音频或文本测试，支持16k wav格式音频。
+
+你也可以在`aistudio`中快速体验 👉🏻[PaddleSpeech API Demo ](https://aistudio.baidu.com/aistudio/projectdetail/4281335?shared=1)。

-**声音分类**     
+测试音频示例下载
 ```shell
-paddlespeech cls --input input.wav
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
 ```
-**声纹识别**
+
+### 语音识别
+<details><summary>&emsp;（点击可展开）开源中文语音识别</summary>
+
+命令行一键体验
+
 ```shell
-paddlespeech vector --task spk --input input_16k.wav
+paddlespeech asr --lang zh --input zh.wav
+```
+
+Python API 一键预测
+
+```python
+>>> from paddlespeech.cli.asr.infer import ASRExecutor
+>>> asr = ASRExecutor()
+>>> result = asr(audio_file="zh.wav")
+>>> print(result)
+我认为跑步最重要的就是给我带来了身体健康
 ```
-**语音识别**
+</details>
+
+### 语音合成
+
+<details><summary>&emsp;开源中文语音合成</summary>
+
+输出 24k 采样率wav格式音频
+
+
+命令行一键体验
+
 ```shell
-paddlespeech asr --lang zh --input input_16k.wav
+paddlespeech tts --input "你好，欢迎使用百度飞桨深度学习框架！" --output output.wav
+```
+
+Python API 一键预测
+
+```python
+>>> from paddlespeech.cli.tts.infer import TTSExecutor
+>>> tts = TTSExecutor()
+>>> tts(text="今天天气十分不错。", output="output.wav")
 ```
-**语音翻译** (English to Chinese)
+- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS)
+
+</details>
+
+### 声音分类   
+
+<details><summary>&emsp;适配多场景的开放领域声音分类工具</summary>
+
+基于AudioSet数据集527个类别的声音分类模型
+
+命令行一键体验
+
 ```shell
-paddlespeech st --input input_16k.wav
+paddlespeech cls --input zh.wav
 ```
-**语音合成** 
+
+python API 一键预测
+
+```python
+>>> from paddlespeech.cli.cls.infer import CLSExecutor
+>>> cls = CLSExecutor()
+>>> result = cls(audio_file="zh.wav")
+>>> print(result)
+Speech 0.9027186632156372
+```
+
+</details>
+
+### 声纹提取
+
+<details><summary>&emsp;工业级声纹提取工具</summary>
+
+命令行一键体验
+
 ```shell
-paddlespeech tts --input "你好，欢迎使用百度飞桨深度学习框架！" --output output.wav
+paddlespeech vector --task spk --input zh.wav
 ```
- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/akhaliq/paddlespeech)

-**文本后处理** 
- - 标点恢复
-   ```bash
-   paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
-   ```
+Python API 一键预测

-**批处理**
+```python
+>>> from paddlespeech.cli.vector import VectorExecutor
+>>> vec = VectorExecutor()
+>>> result = vec(audio_file="zh.wav")
+>>> print(result) # 187维向量
+[ -0.19083306   9.474295   -14.122263    -2.0916545    0.04848729
+   4.9295826    1.4780062    0.3733844   10.695862     3.2697146
+  -4.48199     -0.6617882   -9.170393   -11.1568775   -1.2358263 ...]
 ```
-echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
+
+</details>
+
+### 标点恢复 
+
+<details><summary>&emsp;一键恢复文本标点，可与ASR模型配合使用</summary>
+
+命令行一键体验
+
+```shell
+paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
+```
+
+Python API 一键预测
+
+```python
+>>> from paddlespeech.cli.text.infer import TextExecutor
+>>> text_punc = TextExecutor()
+>>> result = text_punc(text="今天的天气真不错啊你下午有空吗我想约你一起去吃饭")
+今天的天气真不错啊！你下午有空吗？我想约你一起去吃饭。
 ```

-**Shell管道**
-ASR + Punc:
+</details>
+
+### 语音翻译
+
+<details><summary>&emsp;端到端英译中语音翻译工具</summary>
+
+使用预编译的kaldi相关工具，只支持在Ubuntu系统中体验
+
+命令行一键体验
+
+```shell
+paddlespeech st --input en.wav
 ```
-paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
+
+python API 一键预测
+
+```python
+>>> from paddlespeech.cli.st.infer import STExecutor
+>>> st = STExecutor()
+>>> result = st(audio_file="en.wav")
+['我 在 这栋 建筑 的 古老 门上 敲门 。']
 ```

-更多命令行命令请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos)
-> Note: 如果需要训练或者微调，请查看[语音识别](./docs/source/asr/quick_start.md)， [语音合成](./docs/source/tts/quick_start.md)。
+</details>
+
+

 <a name="快速使用服务"></a>
 ## 快速使用服务
-安装完成后，开发者可以通过命令行快速使用服务。
+安装完成后，开发者可以通过命令行一键启动语音识别，语音合成，音频分类三种服务。

 **启动服务**     
 ```shell
@@ -614,6 +751,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声

 语音合成模块最初被称为 [Parakeet](https://github.com/PaddlePaddle/Parakeet)，现在与此仓库合并。如果您对该任务的学术研究感兴趣，请参阅 [TTS 研究概述](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview)。此外，[模型介绍](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) 是了解语音合成流程的一个很好的指南。

+
 ## ⭐ 应用案例
 - **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。**
  

--- a/demos/speech_server/README.md
+++ b/demos/speech_server/README.md
--- a/demos/speech_server/README_cn.md
+++ b/demos/speech_server/README_cn.md
--- a/demos/speech_web/接口文档.md
+++ b/demos/speech_web/接口文档.md
@@ -8,7 +8,7 @@ http://0.0.0.0:8010/docs

 ### 【POST】/asr/offline

-说明：上传16k,16bit wav文件，返回 offline 语音识别模型识别结果
+说明：上传 16k, 16bit wav 文件，返回 offline 语音识别模型识别结果

 返回: JSON

@@ -26,11 +26,11 @@ http://0.0.0.0:8010/docs

 ### 【POST】/asr/offlinefile

-说明：上传16k,16bit wav文件，返回 offline 语音识别模型识别结果 + wav数据的base64
+说明：上传16k,16bit wav文件，返回 offline 语音识别模型识别结果 + wav 数据的 base64

 返回: JSON

-前端接口： 音频文件识别(播放这段base64还原后记得添加wav头，采样率16k, int16，添加后才能播放)
+前端接口： 音频文件识别(播放这段base64还原后记得添加 wav 头，采样率 16k, int16，添加后才能播放)

 示例:

@@ -48,7 +48,7 @@ http://0.0.0.0:8010/docs

 ### 【POST】/asr/collectEnv

-说明： 通过采集环境噪音，上传16k, int16 wav文件，来生成后台VAD的能量阈值， 返回阈值结果
+说明： 通过采集环境噪音，上传 16k, int16 wav 文件，来生成后台 VAD 的能量阈值， 返回阈值结果

 前端接口：ASR-环境采样

@@ -64,9 +64,9 @@ http://0.0.0.0:8010/docs

 ### 【GET】/asr/stopRecord

-说明：通过 GET 请求 /asr/stopRecord, 后台停止接收 offlineStream 中通过 WS协议 上传的数据
+说明：通过 GET 请求 /asr/stopRecord, 后台停止接收 offlineStream 中通过 WS 协议 上传的数据

-前端接口：语音聊天-暂停录音（获取NLP，播放TTS时暂停）
+前端接口：语音聊天-暂停录音（获取 NLP，播放 TTS 时暂停）

 返回: JSON

@@ -80,9 +80,9 @@ http://0.0.0.0:8010/docs

 ### 【GET】/asr/resumeRecord

-说明：通过 GET 请求 /asr/resumeRecord, 后台停止接收 offlineStream 中通过 WS协议 上传的数据
+说明：通过 GET 请求 /asr/resumeRecord, 后台停止接收 offlineStream 中通过 WS 协议 上传的数据

-前端接口：语音聊天-恢复录音（TTS播放完毕时，告诉后台恢复录音）
+前端接口：语音聊天-恢复录音（ TTS 播放完毕时，告诉后台恢复录音）

 返回: JSON

@@ -100,16 +100,16 @@ http://0.0.0.0:8010/docs

 前端接口：语音聊天-开始录音，持续将麦克风语音传给后端，后端推送语音识别结果

-返回：后端返回识别结果，offline模型识别结果， 由WS推送
+返回：后端返回识别结果，offline 模型识别结果， 由WS推送


 ### 【Websocket】/ws/asr/onlineStream

-说明：通过 WS 协议，将前端音频持续上传到后台，前端采集 16k，Int16 类型的PCM片段，持续上传到后端
+说明：通过 WS 协议，将前端音频持续上传到后台，前端采集 16k，Int16 类型的 PCM 片段，持续上传到后端

 前端接口：ASR-流式识别开始录音，持续将麦克风语音传给后端，后端推送语音识别结果

-返回：后端返回识别结果，online模型识别结果， 由WS推送
+返回：后端返回识别结果，online 模型识别结果， 由 WS 推送

 ## NLP

@@ -202,7 +202,7 @@ http://0.0.0.0:8010/docs

 ### 【POST】/tts/offline

-说明：获取TTS离线模型音频
+说明：获取 TTS 离线模型音频

 前端接口：TTS-端到端合成

@@ -272,7 +272,7 @@ curl -X 'POST' \

 ### 【POST】/vpr/recog

-说明：声纹识别，识别文件，提取文件的声纹信息做比对 音频 16k, int 16 wav格式
+说明：声纹识别，识别文件，提取文件的声纹信息做比对 音频 16k, int 16 wav 格式

 前端接口：声纹识别-上传音频，返回声纹识别结果

@@ -383,9 +383,9 @@ curl -X 'GET' \

 ### 【GET】/vpr/database64

-说明： 根据 vpr_id 获取用户vpr时注册使用音频转换成 16k, int16 类型的数组，返回base64编码
+说明： 根据 vpr_id 获取用户 vpr 时注册使用音频转换成 16k, int16 类型的数组，返回 base64 编码

-前端接口：声纹识别-获取vpr对应的音频（注意：播放时需要添加 wav头，16k,int16, 可参考tts播放时添加wav的方式，注意更改采样率）
+前端接口：声纹识别-获取 vpr 对应的音频（注意：播放时需要添加 wav头，16k,int16, 可参考 tts 播放时添加 wav 的方式，注意更改采样率）

 访问示例：

@@ -401,6 +401,4 @@ curl -X 'GET' \
  "code": 0,
  "result":"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
  "message": "ok"
-```
-
-
+```
\ No newline at end of file
--- a/demos/speech_web/README_cn.md
+++ b/demos/speech_web/README_cn.md
 # Paddle Speech Demo

-PaddleSpeechDemo是一个以PaddleSpeech的语音交互功能为主体开发的Demo展示项目，用于帮助大家更好的上手PaddleSpeech以及使用PaddleSpeech构建自己的应用。
+PaddleSpeechDemo 是一个以 PaddleSpeech 的语音交互功能为主体开发的 Demo 展示项目，用于帮助大家更好的上手 PaddleSpeech 以及使用 PaddleSpeech 构建自己的应用。

-智能语音交互部分使用PaddleSpeech，对话以及信息抽取部分使用PaddleNLP，网页前端展示部分基于Vue3进行开发
+智能语音交互部分使用 PaddleSpeech，对话以及信息抽取部分使用 PaddleNLP，网页前端展示部分基于 Vue3 进行开发

 主要功能：

-+ 语音聊天：PaddleSpeech的语音识别能力+语音合成能力，对话部分基于PaddleNLP的闲聊功能
-+ 声纹识别：PaddleSpeech的声纹识别功能展示
+ 语音聊天：PaddleSpeech 的语音识别能力+语音合成能力，对话部分基于 PaddleNLP 的闲聊功能
+ 声纹识别：PaddleSpeech 的声纹识别功能展示
 + 语音识别：支持【实时语音识别】，【端到端识别】，【音频文件识别】三种模式
 + 语音合成：支持【流式合成】与【端到端合成】两种方式
-+ 语音指令：基于PaddleSpeech的语音识别能力与PaddleNLP的信息抽取，实现交通费的智能报销
+ 语音指令：基于 PaddleSpeech 的语音识别能力与 PaddleNLP 的信息抽取，实现交通费的智能报销

 运行效果：

@@ -32,23 +32,21 @@ cd model
 wget https://bj.bcebos.com/paddlenlp/applications/speech-cmd-analysis/finetune/model_state.pdparams
 ```

-
 ### 前端环境安装

-前端依赖node.js ，需要提前安装，确保npm可用，npm测试版本8.3.1，建议下载[官网](https://nodejs.org/en/)稳定版的node.js
+前端依赖 `node.js` ，需要提前安装，确保 `npm` 可用，`npm` 测试版本 `8.3.1`，建议下载[官网](https://nodejs.org/en/)稳定版的 `node.js`

 ```
 # 进入前端目录
 cd web_client

-# 安装yarn，已经安装可跳过
+# 安装 `yarn`，已经安装可跳过
 npm install -g yarn

 # 使用yarn安装前端依赖
 yarn install
 ```

-
 ## 启动服务

 ### 开启后端服务
@@ -66,18 +64,18 @@ cd web_client
 yarn dev --port 8011
 ```

-默认配置下，前端中配置的后台地址信息是localhost，确保后端服务器和打开页面的游览器在同一台机器上，不在一台机器的配置方式见下方的FAQ：【后端如果部署在其它机器或者别的端口如何修改】
+默认配置下，前端中配置的后台地址信息是 localhost，确保后端服务器和打开页面的游览器在同一台机器上，不在一台机器的配置方式见下方的 FAQ：【后端如果部署在其它机器或者别的端口如何修改】
 ## FAQ 

 #### Q: 如何安装node.js

-A： node.js的安装可以参考[【菜鸟教程】](https://www.runoob.com/nodejs/nodejs-install-setup.html), 确保npm可用
+A： node.js的安装可以参考[【菜鸟教程】](https://www.runoob.com/nodejs/nodejs-install-setup.html), 确保 npm 可用

 #### Q：后端如果部署在其它机器或者别的端口如何修改

 A：后端的配置地址有分散在两个文件中

-修改第一个文件`PaddleSpeechWebClient/vite.config.js`
+修改第一个文件 `PaddleSpeechWebClient/vite.config.js`

 ```
 server: {
@@ -92,7 +90,7 @@ server: {
  }
 ```

-修改第二个文件`PaddleSpeechWebClient/src/api/API.js`（Websocket代理配置失败，所以需要在这个文件中修改）
+修改第二个文件 `PaddleSpeechWebClient/src/api/API.js`（ Websocket 代理配置失败，所以需要在这个文件中修改）

 ```
 // websocket （这里改成后端所在的接口）
@@ -107,9 +105,6 @@ A：这里主要是游览器安全策略的限制，需要配置游览器后重

 chrome设置地址: chrome://flags/#unsafely-treat-insecure-origin-as-secure

-
-
-
 ## 参考资料

 vue实现录音参考资料：https://blog.csdn.net/qq_41619796/article/details/107865602#t1

--- a/demos/streaming_asr_server/README.md
+++ b/demos/streaming_asr_server/README.md
--- a/demos/streaming_asr_server/README_cn.md
+++ b/demos/streaming_asr_server/README_cn.md
--- a/demos/streaming_tts_server/README.md
+++ b/demos/streaming_tts_server/README.md
@@ -5,15 +5,19 @@
 ## Introduction
 This demo is an implementation of starting the streaming speech synthesis service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python.

+For service interface definition, please check:
+- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API)
+- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API)

 ## Usage
 ### 1. Installation
 see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).

-It is recommended to use **paddlepaddle 2.2.2** or above.
+It is recommended to use **paddlepaddle 2.3.1** or above.
+
 You can choose one way from easy, meduim and hard to install paddlespeech.
-**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**

+**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**

 ### 2. Prepare config File
 The configuration file can be found in `conf/tts_online_application.yaml`.
@@ -29,11 +33,10 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
    - Both hifigan and mb_melgan support streaming voc inference.
    - When the voc model is mb_melgan, when voc_pad=14, the synthetic audio for streaming inference is consistent with the non-streaming synthetic audio; the minimum voc_pad can be set to 7, and the synthetic audio has no abnormal hearing. If the voc_pad is less than 7, the synthetic audio sounds abnormal.
    - When the voc model is hifigan, when voc_pad=19, the streaming inference synthetic audio is consistent with the non-streaming synthetic audio; when voc_pad=14, the synthetic audio has no abnormal hearing.
+    - Pad calculation method of streaming vocoder in PaddleSpeech: [AIStudio tutorial](https://aistudio.baidu.com/aistudio/projectdetail/4151335)
 - Inference speed: mb_melgan > hifigan; Audio quality: mb_melgan < hifigan
 - **Note:** If the service can be started normally in the container, but the client access IP is unreachable, you can try to replace the `host` address in the configuration file with the local IP address.

-
-
 ### 3. Streaming speech synthesis server and client using http protocol
 #### 3.1 Server Usage
 - Command Line (Recommended)
@@ -53,7 +56,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
  - `log_file`: log file. Default: ./log/paddlespeech.log

  Output:
-  ```bash
+  ```text
  [2022-04-24 20:05:27,887] [    INFO] - The first response time of the 0 warm up: 1.0123658180236816 s
  [2022-04-24 20:05:28,038] [    INFO] - The first response time of the 1 warm up: 0.15108466148376465 s
  [2022-04-24 20:05:28,191] [    INFO] - The first response time of the 2 warm up: 0.15317344665527344 s
@@ -79,8 +82,8 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
      log_file="./log/paddlespeech.log")
  ```

- Output:
-  ```bash
+  Output:
+  ```text
  [2022-04-24 21:00:16,934] [    INFO] - The first response time of the 0 warm up: 1.268730878829956 s
  [2022-04-24 21:00:17,046] [    INFO] - The first response time of the 1 warm up: 0.11168622970581055 s
  [2022-04-24 21:00:17,151] [    INFO] - The first response time of the 2 warm up: 0.10413002967834473 s
@@ -93,8 +96,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
  [2022-04-24 21:00:17] [INFO] [on.py:59] Application startup complete.
  INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
  [2022-04-24 21:00:17] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
-
-
  ```

 #### 3.2 Streaming TTS client Usage
@@ -125,7 +126,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
    - Currently, only the single-speaker model is supported in the code, so `spk_id` does not take effect. Streaming TTS does not support changing sample rate, variable speed and volume.
    
    Output:
-    ```bash
+    ```text
    [2022-04-24 21:08:18,559] [    INFO] - tts http client start
    [2022-04-24 21:08:21,702] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
    [2022-04-24 21:08:21,703] [    INFO] - 首包响应：0.18863153457641602 s
@@ -154,7 +155,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
  ```

  Output:
-  ```bash
+  ```text
  [2022-04-24 21:11:13,798] [    INFO] - tts http client start
  [2022-04-24 21:11:16,800] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
  [2022-04-24 21:11:16,801] [    INFO] - 首包响应：0.18234872817993164 s
@@ -164,7 +165,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
  [2022-04-24 21:11:16,837] [    INFO] - 音频保存至：./output.wav
  ```

- 
 ### 4. Streaming speech synthesis server and client using websocket protocol
 #### 4.1 Server Usage
 - Command Line (Recommended)
@@ -184,21 +184,19 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
  - `log_file`: log file. Default: ./log/paddlespeech.log

  Output:
-  ```bash
-    [2022-04-27 10:18:09,107] [    INFO] - The first response time of the 0 warm up: 1.1551103591918945 s
-    [2022-04-27 10:18:09,219] [    INFO] - The first response time of the 1 warm up: 0.11204338073730469 s
-    [2022-04-27 10:18:09,324] [    INFO] - The first response time of the 2 warm up: 0.1051797866821289 s
-    [2022-04-27 10:18:09,325] [    INFO] - **********************************************************************
-    INFO:     Started server process [17600]
-    [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600]
-    INFO:     Waiting for application startup.
-    [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup.
-    INFO:     Application startup complete.
-    [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete.
-    INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
-    [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
-
-
+  ```text
+  [2022-04-27 10:18:09,107] [    INFO] - The first response time of the 0 warm up: 1.1551103591918945 s
+  [2022-04-27 10:18:09,219] [    INFO] - The first response time of the 1 warm up: 0.11204338073730469 s
+  [2022-04-27 10:18:09,324] [    INFO] - The first response time of the 2 warm up: 0.1051797866821289 s
+  [2022-04-27 10:18:09,325] [    INFO] - **********************************************************************
+  INFO:     Started server process [17600]
+  [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600]
+  INFO:     Waiting for application startup.
+  [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup.
+  INFO:     Application startup complete.
+  [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete.
+  INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
+  [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
  ```

 - Python API
@@ -212,20 +210,19 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
  ```

  Output:
-  ```bash
-    [2022-04-27 10:20:16,660] [    INFO] - The first response time of the 0 warm up: 1.0945196151733398 s
-    [2022-04-27 10:20:16,773] [    INFO] - The first response time of the 1 warm up: 0.11222052574157715 s
-    [2022-04-27 10:20:16,878] [    INFO] - The first response time of the 2 warm up: 0.10494542121887207 s
-    [2022-04-27 10:20:16,878] [    INFO] - **********************************************************************
-    INFO:     Started server process [23466]
-    [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466]
-    INFO:     Waiting for application startup.
-    [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup.
-    INFO:     Application startup complete.
-    [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete.
-    INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
-    [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
-
+  ```text
+  [2022-04-27 10:20:16,660] [    INFO] - The first response time of the 0 warm up: 1.0945196151733398 s
+  [2022-04-27 10:20:16,773] [    INFO] - The first response time of the 1 warm up: 0.11222052574157715 s
+  [2022-04-27 10:20:16,878] [    INFO] - The first response time of the 2 warm up: 0.10494542121887207 s
+  [2022-04-27 10:20:16,878] [    INFO] - **********************************************************************
+  INFO:     Started server process [23466]
+  [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466]
+  INFO:     Waiting for application startup.
+  [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup.
+  INFO:     Application startup complete.
+  [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete.
+  INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
+  [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
  ```

 #### 4.2 Streaming TTS client Usage
@@ -258,7 +255,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`.

    
    Output:
-    ```bash
+    ```text
    [2022-04-27 10:21:04,262] [    INFO] - tts websocket client start
    [2022-04-27 10:21:04,496] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
    [2022-04-27 10:21:04,496] [    INFO] - 首包响应：0.2124948501586914 s
@@ -266,7 +263,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
    [2022-04-27 10:21:07,484] [    INFO] - 音频时长：3.825 s
    [2022-04-27 10:21:07,484] [    INFO] - RTF: 0.8363677006141812
    [2022-04-27 10:21:07,516] [    INFO] - 音频保存至：output.wav
-
    ```

 - Python API
@@ -283,21 +279,15 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
      spk_id=0,
      output="./output.wav",
      play=False)
-
  ```

  Output:
-  ```bash
-    [2022-04-27 10:22:48,852] [    INFO] - tts websocket client start
-    [2022-04-27 10:22:49,080] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
-    [2022-04-27 10:22:49,080] [    INFO] - 首包响应：0.21017956733703613 s
-    [2022-04-27 10:22:52,100] [    INFO] - 尾包响应：3.2304444313049316 s
-    [2022-04-27 10:22:52,101] [    INFO] - 音频时长：3.825 s
-    [2022-04-27 10:22:52,101] [    INFO] - RTF: 0.8445606356352762
-    [2022-04-27 10:22:52,134] [    INFO] - 音频保存至：./output.wav
-
+  ```text
+  [2022-04-27 10:22:48,852] [    INFO] - tts websocket client start
+  [2022-04-27 10:22:49,080] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
+  [2022-04-27 10:22:49,080] [    INFO] - 首包响应：0.21017956733703613 s
+  [2022-04-27 10:22:52,100] [    INFO] - 尾包响应：3.2304444313049316 s
+  [2022-04-27 10:22:52,101] [    INFO] - 音频时长：3.825 s
+  [2022-04-27 10:22:52,101] [    INFO] - RTF: 0.8445606356352762
+  [2022-04-27 10:22:52,134] [    INFO] - 音频保存至：./output.wav
  ```
-
-
-
-  
--- a/demos/streaming_tts_server/README_cn.md
+++ b/demos/streaming_tts_server/README_cn.md
@@ -3,15 +3,19 @@
 # 流式语音合成服务

 ## 介绍
-这个demo是一个启动流式语音合成服务和访问该服务的实现。 它可以通过使用`paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。
-
+这个 demo 是一个启动流式语音合成服务和访问该服务的实现。 它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。

+服务接口定义请参考:
+- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API)
+- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API)
 ## 使用方法
 ### 1. 安装
 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).

-推荐使用 **paddlepaddle 2.2.2** 或以上版本。
-你可以从简单，中等，困难几种方式中选择一种方式安装 PaddleSpeech。
+推荐使用 **paddlepaddle 2.3.1** 或以上版本。
+
+你可以从简单，中等，困难 几种方式中选择一种方式安装 PaddleSpeech。
+
 **如果使用简单模式安装，需要自行准备 yaml 文件，可参考 conf 目录下的 yaml 文件。**

 ### 2. 准备配置文件
@@ -20,19 +24,20 @@
 - `engine_list` 表示即将启动的服务将会包含的语音引擎，格式为 <语音任务>_<引擎类型>。
    - 该 demo 主要介绍流式语音合成服务，因此语音任务应设置为 tts。
    - 目前引擎类型支持两种形式：**online** 表示使用python进行动态图推理的引擎；**online-onnx** 表示使用 onnxruntime 进行推理的引擎。其中，online-onnx 的推理速度更快。
- 流式 TTS 引擎的 AM 模型支持：**fastspeech2 以及fastspeech2_cnndecoder**; Voc 模型支持：**hifigan, mb_melgan**
+- 流式 TTS 引擎的 AM 模型支持：**fastspeech2 以及 fastspeech2_cnndecoder**; Voc 模型支持：**hifigan, mb_melgan**
 - 流式 am 推理中，每次会对一个 chunk 的数据进行推理以达到流式的效果。其中 `am_block` 表示 chunk 中的有效帧数，`am_pad` 表示一个 chunk 中 am_block 前后各加的帧数。am_pad 的存在用于消除流式推理产生的误差，避免由流式推理对合成音频质量的影响。
-    - fastspeech2 不支持流式 am 推理，因此 am_pad 与 m_block 对它无效
+    - fastspeech2 不支持流式 am 推理，因此 am_pad 与 am_block 对它无效
    - fastspeech2_cnndecoder 支持流式推理，当 am_pad=12 时，流式推理合成音频与非流式合成音频一致
- 流式 voc 推理中，每次会对一个 chunk 的数据进行推理以达到流式的效果。其中 `voc_block` 表示chunk中的有效帧数，`voc_pad` 表示一个 chunk 中 voc_block 前后各加的帧数。voc_pad 的存在用于消除流式推理产生的误差，避免由流式推理对合成音频质量的影响。
+- 流式 voc 推理中，每次会对一个 chunk 的数据进行推理以达到流式的效果。其中 `voc_block` 表示 chunk 中的有效帧数，`voc_pad` 表示一个 chunk 中 voc_block 前后各加的帧数。voc_pad 的存在用于消除流式推理产生的误差，避免由流式推理对合成音频质量的影响。
    - hifigan, mb_melgan 均支持流式 voc 推理
    - 当 voc 模型为 mb_melgan，当 voc_pad=14 时，流式推理合成音频与非流式合成音频一致；voc_pad 最小可以设置为7，合成音频听感上没有异常，若 voc_pad 小于7，合成音频听感上存在异常。
    - 当 voc 模型为 hifigan，当 voc_pad=19 时，流式推理合成音频与非流式合成音频一致；当 voc_pad=14 时，合成音频听感上没有异常。
+    - PaddleSpeech 中流式声码器 Pad 计算方法: [AIStudio 教程](https://aistudio.baidu.com/aistudio/projectdetail/4151335)
 - 推理速度：mb_melgan > hifigan; 音频质量：mb_melgan < hifigan
 - **注意：** 如果在容器里可正常启动服务，但客户端访问 ip 不可达，可尝试将配置文件中 `host` 地址换成本地 ip 地址。


-### 3. 使用http协议的流式语音合成服务端及客户端使用方法
+### 3. 使用 http 协议的流式语音合成服务端及客户端使用方法
 #### 3.1 服务端使用方法
 - 命令行 (推荐使用)

@@ -51,7 +56,7 @@
  - `log_file`: log 文件. 默认：./log/paddlespeech.log

  输出:
-  ```bash
+  ```text
  [2022-04-24 20:05:27,887] [    INFO] - The first response time of the 0 warm up: 1.0123658180236816 s
  [2022-04-24 20:05:28,038] [    INFO] - The first response time of the 1 warm up: 0.15108466148376465 s
  [2022-04-24 20:05:28,191] [    INFO] - The first response time of the 2 warm up: 0.15317344665527344 s
@@ -64,7 +69,6 @@
  [2022-04-24 20:05:28] [INFO] [on.py:59] Application startup complete.
  INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
  [2022-04-24 20:05:28] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
-
  ```

 - Python API
@@ -77,8 +81,8 @@
      log_file="./log/paddlespeech.log")
  ```

-  输出：
-  ```bash
+  输出:
+  ```text
  [2022-04-24 21:00:16,934] [    INFO] - The first response time of the 0 warm up: 1.268730878829956 s
  [2022-04-24 21:00:17,046] [    INFO] - The first response time of the 1 warm up: 0.11168622970581055 s
  [2022-04-24 21:00:17,151] [    INFO] - The first response time of the 2 warm up: 0.10413002967834473 s
@@ -91,8 +95,6 @@
  [2022-04-24 21:00:17] [INFO] [on.py:59] Application startup complete.
  INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
  [2022-04-24 21:00:17] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
-
-
  ```

 #### 3.2 客户端使用方法
@@ -124,7 +126,7 @@

    
    输出:
-    ```bash
+    ```text
    [2022-04-24 21:08:18,559] [    INFO] - tts http client start
    [2022-04-24 21:08:21,702] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
    [2022-04-24 21:08:21,703] [    INFO] - 首包响应：0.18863153457641602 s
@@ -162,9 +164,8 @@
  [2022-04-24 21:11:16,802] [    INFO] - RTF: 0.7846773683635238
  [2022-04-24 21:11:16,837] [    INFO] - 音频保存至：./output.wav
  ```
-
 
-### 4. 使用websocket协议的流式语音合成服务端及客户端使用方法
+### 4. 使用 websocket 协议的流式语音合成服务端及客户端使用方法
 #### 4.1 服务端使用方法
 - 命令行 (推荐使用)
  首先修改配置文件 `conf/tts_online_application.yaml`， **将 `protocol` 设置为 `websocket`**。
@@ -183,21 +184,19 @@
  - `log_file`: log 文件. 默认：./log/paddlespeech.log

  输出:
-  ```bash
-    [2022-04-27 10:18:09,107] [    INFO] - The first response time of the 0 warm up: 1.1551103591918945 s
-    [2022-04-27 10:18:09,219] [    INFO] - The first response time of the 1 warm up: 0.11204338073730469 s
-    [2022-04-27 10:18:09,324] [    INFO] - The first response time of the 2 warm up: 0.1051797866821289 s
-    [2022-04-27 10:18:09,325] [    INFO] - **********************************************************************
-    INFO:     Started server process [17600]
-    [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600]
-    INFO:     Waiting for application startup.
-    [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup.
-    INFO:     Application startup complete.
-    [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete.
-    INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
-    [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
-
-
+  ```text
+  [2022-04-27 10:18:09,107] [    INFO] - The first response time of the 0 warm up: 1.1551103591918945 s
+  [2022-04-27 10:18:09,219] [    INFO] - The first response time of the 1 warm up: 0.11204338073730469 s
+  [2022-04-27 10:18:09,324] [    INFO] - The first response time of the 2 warm up: 0.1051797866821289 s
+  [2022-04-27 10:18:09,325] [    INFO] - **********************************************************************
+  INFO:     Started server process [17600]
+  [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600]
+  INFO:     Waiting for application startup.
+  [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup.
+  INFO:     Application startup complete.
+  [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete.
+  INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
+  [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
  ```

 - Python API
@@ -210,27 +209,26 @@
      log_file="./log/paddlespeech.log")
  ```

-  输出：
-  ```bash
-    [2022-04-27 10:20:16,660] [    INFO] - The first response time of the 0 warm up: 1.0945196151733398 s
-    [2022-04-27 10:20:16,773] [    INFO] - The first response time of the 1 warm up: 0.11222052574157715 s
-    [2022-04-27 10:20:16,878] [    INFO] - The first response time of the 2 warm up: 0.10494542121887207 s
-    [2022-04-27 10:20:16,878] [    INFO] - **********************************************************************
-    INFO:     Started server process [23466]
-    [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466]
-    INFO:     Waiting for application startup.
-    [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup.
-    INFO:     Application startup complete.
-    [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete.
-    INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
-    [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
-
+  输出:
+  ```text
+  [2022-04-27 10:20:16,660] [    INFO] - The first response time of the 0 warm up: 1.0945196151733398 s
+  [2022-04-27 10:20:16,773] [    INFO] - The first response time of the 1 warm up: 0.11222052574157715 s
+  [2022-04-27 10:20:16,878] [    INFO] - The first response time of the 2 warm up: 0.10494542121887207 s
+  [2022-04-27 10:20:16,878] [    INFO] - **********************************************************************
+  INFO:     Started server process [23466]
+  [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466]
+  INFO:     Waiting for application startup.
+  [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup.
+  INFO:     Application startup complete.
+  [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete.
+  INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
+  [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
  ```

 #### 4.2 客户端使用方法
 - 命令行 (推荐使用)

-    访问 websocket 流式TTS服务：
+    访问 websocket 流式 TTS 服务：

    若 `127.0.0.1` 不能访问，则需要使用实际服务 IP 地址

@@ -255,9 +253,8 @@
    - 目前代码中只支持单说话人的模型，因此 spk_id 的选择并不生效。流式 TTS 不支持更换采样率，变速和变音量等功能。


-    
    输出:
-    ```bash
+    ```text
    [2022-04-27 10:21:04,262] [    INFO] - tts websocket client start
    [2022-04-27 10:21:04,496] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
    [2022-04-27 10:21:04,496] [    INFO] - 首包响应：0.2124948501586914 s
@@ -265,7 +262,6 @@
    [2022-04-27 10:21:07,484] [    INFO] - 音频时长：3.825 s
    [2022-04-27 10:21:07,484] [    INFO] - RTF: 0.8363677006141812
    [2022-04-27 10:21:07,516] [    INFO] - 音频保存至：output.wav
-
    ```

 - Python API
@@ -282,11 +278,10 @@
      spk_id=0,
      output="./output.wav",
      play=False)
-
  ```

  输出:
-  ```bash
+  ```text
    [2022-04-27 10:22:48,852] [    INFO] - tts websocket client start
    [2022-04-27 10:22:49,080] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
    [2022-04-27 10:22:49,080] [    INFO] - 首包响应：0.21017956733703613 s
@@ -294,8 +289,4 @@
    [2022-04-27 10:22:52,101] [    INFO] - 音频时长：3.825 s
    [2022-04-27 10:22:52,101] [    INFO] - RTF: 0.8445606356352762
    [2022-04-27 10:22:52,134] [    INFO] - 音频保存至：./output.wav
-
  ```
-
-
-  
--- a/docker/ubuntu18-cpu/Dockerfile
+++ b/docker/ubuntu18-cpu/Dockerfile
 FROM registry.baidubce.com/paddlepaddle/paddle:2.2.2
 LABEL maintainer="paddlesl@baidu.com"

+RUN apt-get update \
+  && apt-get install libsndfile-dev \
+  && apt-get clean \
+  && rm -rf /var/lib/apt/lists/*
+
 RUN git clone --depth 1 https://github.com/PaddlePaddle/PaddleSpeech.git /home/PaddleSpeech  
 RUN pip3 uninstall mccabe -y ; exit 0;
 RUN pip3 install multiprocess==0.70.12 importlib-metadata==4.2.0 dill==0.3.4

-RUN cd /home/PaddleSpeech/audio
-RUN python setup.py bdist_wheel
-
-RUN cd /home/PaddleSpeech
+WORKDIR /home/PaddleSpeech/
 RUN python setup.py bdist_wheel
-RUN pip install audio/dist/*.whl dist/*.whl
+RUN pip install dist/*.whl -i https://pypi.tuna.tsinghua.edu.cn/simple

-WORKDIR /home/PaddleSpeech/
+CMD ['bash']
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -48,4 +48,5 @@ fastapi
 websockets
 keyboard
 uvicorn
-pattern_singleton
\ No newline at end of file
+pattern_singleton
+braceexpand
\ No newline at end of file
--- a/examples/aishell3/ernie_sat/conf/default.yaml
+++ b/examples/aishell3/ernie_sat/conf/default.yaml
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+
+fs: 24000          # sr
+n_fft: 2048        # FFT size (samples).
+n_shift: 300       # Hop size (samples). 12.5ms
+win_length: 1200   # Window length (samples). 50ms
+                   # If set to null, it will be the same as fft_size.
+window: "hann"     # Window function.
+
+# Only used for feats_type != raw
+
+fmin: 80           # Minimum frequency of Mel basis.
+fmax: 7600         # Maximum frequency of Mel basis.
+n_mels: 80         # The number of mel basis.
+
+mean_phn_span: 8
+mlm_prob: 0.8
+
+###########################################################
+#                       DATA SETTING                      #
+###########################################################
+batch_size: 20
+num_workers: 2
+
+###########################################################
+#                       MODEL SETTING                     #
+###########################################################
+model:
+    text_masking: false
+    postnet_layers: 5
+    postnet_filts: 5
+    postnet_chans: 256
+    encoder_type: conformer
+    decoder_type: conformer
+    enc_input_layer: sega_mlm
+    enc_pre_speech_layer: 0
+    enc_cnn_module_kernel: 7
+    enc_attention_dim: 384
+    enc_attention_heads: 2
+    enc_linear_units: 1536
+    enc_num_blocks: 4
+    enc_dropout_rate: 0.2
+    enc_positional_dropout_rate: 0.2
+    enc_attention_dropout_rate: 0.2
+    enc_normalize_before: true
+    enc_macaron_style: true
+    enc_use_cnn_module: true
+    enc_selfattention_layer_type: legacy_rel_selfattn
+    enc_activation_type: swish
+    enc_pos_enc_layer_type: legacy_rel_pos
+    enc_positionwise_layer_type: conv1d
+    enc_positionwise_conv_kernel_size: 3
+    dec_cnn_module_kernel: 31
+    dec_attention_dim: 384
+    dec_attention_heads: 2
+    dec_linear_units: 1536
+    dec_num_blocks: 4
+    dec_dropout_rate: 0.2
+    dec_positional_dropout_rate: 0.2
+    dec_attention_dropout_rate: 0.2
+    dec_macaron_style: true
+    dec_use_cnn_module: true
+    dec_selfattention_layer_type: legacy_rel_selfattn
+    dec_activation_type: swish
+    dec_pos_enc_layer_type: legacy_rel_pos
+    dec_positionwise_layer_type: conv1d
+    dec_positionwise_conv_kernel_size: 3
+
+###########################################################
+#                     OPTIMIZER SETTING                   #
+###########################################################
+scheduler_params:
+    d_model: 384
+    warmup_steps: 4000
+grad_clip: 1.0
+
+###########################################################
+#                     TRAINING SETTING                    #
+###########################################################
+max_epoch: 1500
+num_snapshots: 50
+
+###########################################################
+#                       OTHER SETTING                     #
+###########################################################
+seed: 0
+
+token_list:
+- <blank>
+- <unk>
+- d
+- sp
+- sh
+- ii
+- j
+- zh
+- l
+- x
+- b
+- g
+- uu
+- e5
+- h
+- q
+- m
+- i1
+- t
+- z
+- ch
+- f
+- s
+- u4
+- ix4
+- i4
+- n
+- i3
+- iu3
+- vv
+- ian4
+- ix2
+- r
+- e4
+- ai4
+- k
+- ing2
+- a1
+- en2
+- ui4
+- ong1
+- uo3
+- u2
+- u3
+- ao4
+- ee
+- p
+- an1
+- eng2
+- i2
+- in1
+- c
+- ai2
+- ian2
+- e2
+- an4
+- ing4
+- v4
+- ai3
+- a5
+- ian3
+- eng1
+- ong4
+- ang4
+- ian1
+- ing1
+- iy4
+- ao3
+- ang1
+- uo4
+- u1
+- iao4
+- iu4
+- a4
+- van2
+- ie4
+- ang2
+- ou4
+- iang4
+- ix1
+- er4
+- iy1
+- e1
+- en1
+- ui2
+- an3
+- ei4
+- ong2
+- uo1
+- ou3
+- uo2
+- iao1
+- ou1
+- an2
+- uan4
+- ia4
+- ia1
+- ang3
+- v3
+- iu2
+- iao3
+- in4
+- a3
+- ei3
+- iang3
+- v2
+- eng4
+- en3
+- aa
+- uan1
+- v1
+- ao1
+- ve4
+- ie3
+- ai1
+- ing3
+- iang1
+- a2
+- ui1
+- en4
+- en5
+- in3
+- uan3
+- e3
+- ie1
+- ve2
+- ei2
+- in2
+- ix3
+- uan2
+- iang2
+- ie2
+- ua4
+- ou2
+- uai4
+- er2
+- eng3
+- uang3
+- un1
+- ong3
+- uang4
+- vn4
+- un2
+- iy3
+- iz4
+- ui3
+- iao2
+- iong4
+- un4
+- van4
+- ao2
+- uang1
+- iy5
+- o2
+- ei1
+- ua1
+- iu1
+- uang2
+- er5
+- o1
+- un3
+- vn1
+- vn2
+- o4
+- ve1
+- van3
+- ua2
+- er3
+- iong3
+- van1
+- ia2
+- iy2
+- ia3
+- iong1
+- uo5
+- oo
+- ve3
+- ou5
+- uai3
+- ian5
+- iong2
+- uai2
+- uai1
+- ua3
+- vn3
+- ia5
+- ie5
+- ueng1
+- o5
+- o3
+- iang5
+- ei5
+- <sos/eos>
\ No newline at end of file
--- a/examples/aishell3/ernie_sat/local/preprocess.sh
+++ b/examples/aishell3/ernie_sat/local/preprocess.sh
+#!/bin/bash
+
+stage=0
+stop_stage=100
+
+config_path=$1
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # get durations from MFA's result
+    echo "Generate durations.txt from MFA results ..."
+    python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
+        --inputdir=./aishell3_alignment_tone \
+        --output durations.txt \
+        --config=${config_path}
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # extract features
+    echo "Extract features ..."
+    python3 ${BIN_DIR}/preprocess.py \
+        --dataset=aishell3 \
+        --rootdir=~/datasets/data_aishell3/ \
+        --dumpdir=dump \
+        --dur-file=durations.txt \
+        --config=${config_path} \
+        --num-cpu=20 \
+        --cut-sil=True
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # get features' stats(mean and std)
+    echo "Get features' stats ..."
+    python3 ${MAIN_ROOT}/utils/compute_statistics.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --field-name="speech"
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # normalize and covert phone/speaker to id, dev and test should use train's stats
+    echo "Normalize ..."
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --dumpdir=dump/train/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/dev/raw/metadata.jsonl \
+        --dumpdir=dump/dev/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/test/raw/metadata.jsonl \
+        --dumpdir=dump/test/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+fi
--- a/examples/aishell3/ernie_sat/local/synthesize.sh
+++ b/examples/aishell3/ernie_sat/local/synthesize.sh
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+ckpt_name=$3
+
+stage=1
+stop_stage=1
+
+# pwgan
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    FLAGS_allocator_strategy=naive_best_fit \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+    python3 ${BIN_DIR}/synthesize.py \
+        --erniesat_config=${config_path} \
+        --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+        --erniesat_stat=dump/train/speech_stats.npy \
+        --voc=pwgan_aishell3 \
+        --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
+        --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
+        --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
+        --test_metadata=dump/test/norm/metadata.jsonl \
+        --output_dir=${train_output_path}/test \
+        --phones_dict=dump/phone_id_map.txt
+fi
+
+# hifigan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    FLAGS_allocator_strategy=naive_best_fit \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+    python3 ${BIN_DIR}/synthesize.py \
+        --erniesat_config=${config_path} \
+        --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+        --erniesat_stat=dump/train/speech_stats.npy \
+        --voc=hifigan_aishell3 \
+        --voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
+        --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \
+        --voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
+        --test_metadata=dump/test/norm/metadata.jsonl \
+        --output_dir=${train_output_path}/test \
+        --phones_dict=dump/phone_id_map.txt
+fi
--- a/examples/aishell3/ernie_sat/local/train.sh
+++ b/examples/aishell3/ernie_sat/local/train.sh
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+
+python3 ${BIN_DIR}/train.py \
+    --train-metadata=dump/train/norm/metadata.jsonl \
+    --dev-metadata=dump/dev/norm/metadata.jsonl \
+    --config=${config_path} \
+    --output-dir=${train_output_path} \
+    --ngpu=2 \
+    --phones-dict=dump/phone_id_map.txt
\ No newline at end of file
--- a/examples/aishell3/ernie_sat/path.sh
+++ b/examples/aishell3/ernie_sat/path.sh
+#!/bin/bash
+export MAIN_ROOT=`realpath ${PWD}/../../../`
+
+export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
+export LC_ALL=C
+
+export PYTHONDONTWRITEBYTECODE=1
+# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
+
+MODEL=ernie_sat
+export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
\ No newline at end of file
--- a/examples/aishell3/ernie_sat/run.sh
+++ b/examples/aishell3/ernie_sat/run.sh
+#!/bin/bash
+
+set -e
+source path.sh
+
+gpus=0,1
+stage=0
+stop_stage=100
+
+conf_path=conf/default.yaml
+train_output_path=exp/default
+ckpt_name=snapshot_iter_153.pdz
+
+# with the following command, you can choose the stage range you want to run
+# such as `./run.sh --stage 0 --stop-stage 0`
+# this can not be mixed use with `$1`, `$2` ...
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # prepare data
+    ./local/preprocess.sh ${conf_path} || exit -1
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # train model, all `ckpt` under `train_output_path/checkpoints/` dir
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # synthesize, vocoder is pwgan
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
+fi
--- a/examples/aishell3/tts3/conf/conformer.yaml
+++ b/examples/aishell3/tts3/conf/conformer.yaml
@@ -94,8 +94,8 @@ updater:
 #                     OPTIMIZER SETTING                   #
 ###########################################################
 optimizer:
-  optim: adam              # optimizer type
-  learning_rate: 0.001     # learning rate
+    optim: adam              # optimizer type
+    learning_rate: 0.001     # learning rate

 ###########################################################
 #                     TRAINING SETTING                    #

--- a/examples/aishell3/tts3/conf/default.yaml
+++ b/examples/aishell3/tts3/conf/default.yaml
@@ -88,8 +88,8 @@ updater:
 #                     OPTIMIZER SETTING                   #
 ###########################################################
 optimizer:
-  optim: adam               # optimizer type
-  learning_rate: 0.001     # learning rate
+    optim: adam               # optimizer type
+    learning_rate: 0.001      # learning rate

 ###########################################################
 #                     TRAINING SETTING                    #

--- a/examples/aishell3/tts3/local/synthesize.sh
+++ b/examples/aishell3/tts3/local/synthesize.sh
@@ -37,7 +37,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
        --am_stat=dump/train/speech_stats.npy \
        --voc=hifigan_aishell3 \
        --voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
-        --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pd \
+        --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \
        --voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \

--- a/examples/aishell3_vctk/ernie_sat/conf/default.yaml
+++ b/examples/aishell3_vctk/ernie_sat/conf/default.yaml
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+
+fs: 24000          # sr
+n_fft: 2048        # FFT size (samples).
+n_shift: 300       # Hop size (samples). 12.5ms
+win_length: 1200   # Window length (samples). 50ms
+                   # If set to null, it will be the same as fft_size.
+window: "hann"     # Window function.
+
+# Only used for feats_type != raw
+
+fmin: 80           # Minimum frequency of Mel basis.
+fmax: 7600         # Maximum frequency of Mel basis.
+n_mels: 80         # The number of mel basis.
+
+mean_phn_span: 8
+mlm_prob: 0.8
+
+###########################################################
+#                       DATA SETTING                      #
+###########################################################
+batch_size: 20
+num_workers: 2
+
+###########################################################
+#                       MODEL SETTING                     #
+###########################################################
+model:
+    text_masking: true
+    postnet_layers: 5
+    postnet_filts: 5
+    postnet_chans: 256
+    encoder_type: conformer
+    decoder_type: conformer
+    enc_input_layer: sega_mlm
+    enc_pre_speech_layer: 0
+    enc_cnn_module_kernel: 7
+    enc_attention_dim: 384
+    enc_attention_heads: 2
+    enc_linear_units: 1536
+    enc_num_blocks: 4
+    enc_dropout_rate: 0.2
+    enc_positional_dropout_rate: 0.2
+    enc_attention_dropout_rate: 0.2
+    enc_normalize_before: true
+    enc_macaron_style: true
+    enc_use_cnn_module: true
+    enc_selfattention_layer_type: legacy_rel_selfattn
+    enc_activation_type: swish
+    enc_pos_enc_layer_type: legacy_rel_pos
+    enc_positionwise_layer_type: conv1d
+    enc_positionwise_conv_kernel_size: 3
+    dec_cnn_module_kernel: 31
+    dec_attention_dim: 384
+    dec_attention_heads: 2
+    dec_linear_units: 1536
+    dec_num_blocks: 4
+    dec_dropout_rate: 0.2
+    dec_positional_dropout_rate: 0.2
+    dec_attention_dropout_rate: 0.2
+    dec_macaron_style: true
+    dec_use_cnn_module: true
+    dec_selfattention_layer_type: legacy_rel_selfattn
+    dec_activation_type: swish
+    dec_pos_enc_layer_type: legacy_rel_pos
+    dec_positionwise_layer_type: conv1d
+    dec_positionwise_conv_kernel_size: 3
+
+###########################################################
+#                     OPTIMIZER SETTING                   #
+###########################################################
+scheduler_params:
+    d_model: 384
+    warmup_steps: 4000
+grad_clip: 1.0
+
+###########################################################
+#                     TRAINING SETTING                    #
+###########################################################
+max_epoch: 700
+num_snapshots: 50
+
+###########################################################
+#                       OTHER SETTING                     #
+###########################################################
+seed: 0
+
+token_list:
+- <blank>
+- <unk>
+- AH0
+- T
+- N
+- sp
+- S
+- R
+- D
+- L
+- Z
+- DH
+- IH1
+- K
+- W
+- M
+- EH1
+- AE1
+- ER0
+- B
+- IY1
+- P
+- V
+- IY0
+- F
+- HH
+- AA1
+- AY1
+- AH1
+- EY1
+- IH0
+- AO1
+- OW1
+- UW1
+- G
+- NG
+- SH
+- Y
+- TH
+- ER1
+- JH
+- UH1
+- AW1
+- CH
+- IH2
+- OW0
+- OW2
+- EY2
+- EH2
+- UW0
+- OY1
+- ZH
+- EH0
+- AY2
+- AW2
+- AA2
+- AE2
+- IY2
+- AH2
+- AE0
+- AO2
+- AY0
+- AO0
+- UW2
+- UH2
+- AA0
+- EY0
+- AW0
+- UH0
+- ER2
+- OY2
+- OY0
+- d
+- sh
+- ii
+- j
+- zh
+- l
+- x
+- b
+- g
+- uu
+- e5
+- h
+- q
+- m
+- i1
+- t
+- z
+- ch
+- f
+- s
+- u4
+- ix4
+- i4
+- n
+- i3
+- iu3
+- vv
+- ian4
+- ix2
+- r
+- e4
+- ai4
+- k
+- ing2
+- a1
+- en2
+- ui4
+- ong1
+- uo3
+- u2
+- u3
+- ao4
+- ee
+- p
+- an1
+- eng2
+- i2
+- in1
+- c
+- ai2
+- ian2
+- e2
+- an4
+- ing4
+- v4
+- ai3
+- a5
+- ian3
+- eng1
+- ong4
+- ang4
+- ian1
+- ing1
+- iy4
+- ao3
+- ang1
+- uo4
+- u1
+- iao4
+- iu4
+- a4
+- van2
+- ie4
+- ang2
+- ou4
+- iang4
+- ix1
+- er4
+- iy1
+- e1
+- en1
+- ui2
+- an3
+- ei4
+- ong2
+- uo1
+- ou3
+- uo2
+- iao1
+- ou1
+- an2
+- uan4
+- ia4
+- ia1
+- ang3
+- v3
+- iu2
+- iao3
+- in4
+- a3
+- ei3
+- iang3
+- v2
+- eng4
+- en3
+- aa
+- uan1
+- v1
+- ao1
+- ve4
+- ie3
+- ai1
+- ing3
+- iang1
+- a2
+- ui1
+- en4
+- en5
+- in3
+- uan3
+- e3
+- ie1
+- ve2
+- ei2
+- in2
+- ix3
+- uan2
+- iang2
+- ie2
+- ua4
+- ou2
+- uai4
+- er2
+- eng3
+- uang3
+- un1
+- ong3
+- uang4
+- vn4
+- un2
+- iy3
+- iz4
+- ui3
+- iao2
+- iong4
+- un4
+- van4
+- ao2
+- uang1
+- iy5
+- o2
+- ei1
+- ua1
+- iu1
+- uang2
+- er5
+- o1
+- un3
+- vn1
+- vn2
+- o4
+- ve1
+- van3
+- ua2
+- er3
+- iong3
+- van1
+- ia2
+- iy2
+- ia3
+- iong1
+- uo5
+- oo
+- ve3
+- ou5
+- uai3
+- ian5
+- iong2
+- uai2
+- uai1
+- ua3
+- vn3
+- ia5
+- ie5
+- ueng1
+- o5
+- o3
+- iang5
+- ei5
+- <sos/eos>
--- a/examples/aishell3_vctk/ernie_sat/local/preprocess.sh
+++ b/examples/aishell3_vctk/ernie_sat/local/preprocess.sh
+#!/bin/bash
+
+stage=0
+stop_stage=100
+
+config_path=$1
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # get durations from MFA's result
+    echo "Generate durations.txt from MFA results for aishell3 ..."
+    python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
+        --inputdir=./aishell3_alignment_tone \
+        --output durations_aishell3.txt \
+        --config=${config_path}
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # get durations from MFA's result
+    echo "Generate durations.txt from MFA results for vctk ..."
+    python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
+        --inputdir=./vctk_alignment \
+        --output durations_vctk.txt \
+        --config=${config_path}
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # get durations from MFA's result
+    echo "concat durations_aishell3.txt and durations_vctk.txt to durations.txt"
+    cat durations_aishell3.txt durations_vctk.txt > durations.txt
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # extract features
+    echo "Extract features ..."
+    python3 ${BIN_DIR}/preprocess.py \
+        --dataset=aishell3 \
+        --rootdir=~/datasets/data_aishell3/ \
+        --dumpdir=dump \
+        --dur-file=durations.txt \
+        --config=${config_path} \
+        --num-cpu=20 \
+        --cut-sil=True
+fi
+
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    # extract features
+    echo "Extract features ..."
+    python3 ${BIN_DIR}/preprocess.py \
+        --dataset=vctk \
+        --rootdir=~/datasets/VCTK-Corpus-0.92/ \
+        --dumpdir=dump \
+        --dur-file=durations.txt \
+        --config=${config_path} \
+        --num-cpu=20 \
+        --cut-sil=True
+fi
+
+if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
+    # get features' stats(mean and std)
+    echo "Get features' stats ..."
+    python3 ${MAIN_ROOT}/utils/compute_statistics.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --field-name="speech"
+fi
+
+if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
+    # normalize and covert phone/speaker to id, dev and test should use train's stats
+    echo "Normalize ..."
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --dumpdir=dump/train/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/dev/raw/metadata.jsonl \
+        --dumpdir=dump/dev/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/test/raw/metadata.jsonl \
+        --dumpdir=dump/test/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+fi
--- a/examples/aishell3_vctk/ernie_sat/local/synthesize.sh
+++ b/examples/aishell3_vctk/ernie_sat/local/synthesize.sh
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+ckpt_name=$3
+
+stage=1
+stop_stage=1
+
+# pwgan
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    FLAGS_allocator_strategy=naive_best_fit \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+    python3 ${BIN_DIR}/synthesize.py \
+        --erniesat_config=${config_path} \
+        --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+        --erniesat_stat=dump/train/speech_stats.npy \
+        --voc=pwgan_aishell3 \
+        --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
+        --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
+        --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
+        --test_metadata=dump/test/norm/metadata.jsonl \
+        --output_dir=${train_output_path}/test \
+        --phones_dict=dump/phone_id_map.txt
+fi
+
+# hifigan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    FLAGS_allocator_strategy=naive_best_fit \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+    python3 ${BIN_DIR}/synthesize.py \
+        --erniesat_config=${config_path} \
+        --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+        --erniesat_stat=dump/train/speech_stats.npy \
+        --voc=hifigan_aishell3 \
+        --voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
+        --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \
+        --voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
+        --test_metadata=dump/test/norm/metadata.jsonl \
+        --output_dir=${train_output_path}/test \
+        --phones_dict=dump/phone_id_map.txt
+fi
--- a/examples/aishell3_vctk/ernie_sat/local/train.sh
+++ b/examples/aishell3_vctk/ernie_sat/local/train.sh
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+
+python3 ${BIN_DIR}/train.py \
+    --train-metadata=dump/train/norm/metadata.jsonl \
+    --dev-metadata=dump/dev/norm/metadata.jsonl \
+    --config=${config_path} \
+    --output-dir=${train_output_path} \
+    --ngpu=2 \
+    --phones-dict=dump/phone_id_map.txt
\ No newline at end of file
--- a/examples/aishell3_vctk/ernie_sat/path.sh
+++ b/examples/aishell3_vctk/ernie_sat/path.sh
+#!/bin/bash
+export MAIN_ROOT=`realpath ${PWD}/../../../`
+
+export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
+export LC_ALL=C
+
+export PYTHONDONTWRITEBYTECODE=1
+# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
+
+MODEL=ernie_sat
+export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
\ No newline at end of file
--- a/examples/aishell3_vctk/ernie_sat/run.sh
+++ b/examples/aishell3_vctk/ernie_sat/run.sh
+#!/bin/bash
+
+set -e
+source path.sh
+
+gpus=0,1
+stage=0
+stop_stage=100
+
+conf_path=conf/default.yaml
+train_output_path=exp/default
+ckpt_name=snapshot_iter_153.pdz
+
+# with the following command, you can choose the stage range you want to run
+# such as `./run.sh --stage 0 --stop-stage 0`
+# this can not be mixed use with `$1`, `$2` ...
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # prepare data
+    ./local/preprocess.sh ${conf_path} || exit -1
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # train model, all `ckpt` under `train_output_path/checkpoints/` dir
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # synthesize, vocoder is pwgan
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
+fi
--- a/examples/csmsc/tts2/conf/default.yaml
+++ b/examples/csmsc/tts2/conf/default.yaml
@@ -21,22 +21,22 @@ num_workers: 4
 #                       MODEL SETTING                     #
 ###########################################################
 model:
-  encoder_hidden_size: 128
-  encoder_kernel_size: 3
-  encoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 1]
-  duration_predictor_hidden_size: 128
-  decoder_hidden_size: 128
-  decoder_output_size: 80
-  decoder_kernel_size: 3
-  decoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 1]
+    encoder_hidden_size: 128
+    encoder_kernel_size: 3
+    encoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 1]
+    duration_predictor_hidden_size: 128
+    decoder_hidden_size: 128
+    decoder_output_size: 80
+    decoder_kernel_size: 3
+    decoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 1]

 ###########################################################
 #                     OPTIMIZER SETTING                   #
 ###########################################################
 optimizer:
-  optim: adam              # optimizer type
-  learning_rate: 0.002     # learning rate
-  max_grad_norm: 1
+    optim: adam              # optimizer type
+    learning_rate: 0.002     # learning rate
+    max_grad_norm: 1

 ###########################################################
 #                     TRAINING SETTING                    #

--- a/examples/csmsc/voc3/conf/default.yaml
+++ b/examples/csmsc/voc3/conf/default.yaml
@@ -29,7 +29,7 @@ generator_params:
    out_channels: 4               # Number of output channels.
    kernel_size: 7                # Kernel size of initial and final conv layers.
    channels: 384                 # Initial number of channels for conv layers.
-    upsample_scales: [5, 5, 3]    # List of Upsampling scales. prod(upsample_scales) == n_shift
+    upsample_scales: [5, 5, 3]    # List of Upsampling scales. prod(upsample_scales) x out_channels == n_shift
    stack_kernel_size: 3          # Kernel size of dilated conv layers in residual stack.
    stacks: 4                     # Number of stacks in a single residual stack module.
    use_weight_norm: True         # Whether to use weight normalization.

--- a/examples/csmsc/voc3/conf/finetune.yaml
+++ b/examples/csmsc/voc3/conf/finetune.yaml
@@ -29,7 +29,7 @@ generator_params:
    out_channels: 4               # Number of output channels.
    kernel_size: 7                # Kernel size of initial and final conv layers.
    channels: 384                 # Initial number of channels for conv layers.
-    upsample_scales: [5, 5, 3]    # List of Upsampling scales. prod(upsample_scales) == n_shift
+    upsample_scales: [5, 5, 3]    # List of Upsampling scales. prod(upsample_scales) x out_channels == n_shift
    stack_kernel_size: 3          # Kernel size of dilated conv layers in residual stack.
    stacks: 4                     # Number of stacks in a single residual stack module.
    use_weight_norm: True         # Whether to use weight normalization.

--- a/examples/ernie_sat/local/align.py
+++ b/examples/ernie_sat/local/align.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """ Usage:
    align.py wavfile trsfile outwordfile outphonefile
 """

--- a/examples/ernie_sat/local/inference.py
+++ b/examples/ernie_sat/local/inference.py
-#!/usr/bin/env python3
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import os
 import random
 from typing import Dict
@@ -305,7 +317,6 @@ def get_dur_adj_factor(orig_dur: List[int],


 def prep_feats_with_dur(wav_path: str,
-                        mlm_model: nn.Layer,
                        source_lang: str="English",
                        target_lang: str="English",
                        old_str: str="",
@@ -425,8 +436,7 @@ def prep_feats_with_dur(wav_path: str,
    return new_wav, new_phns, new_mfa_start, new_mfa_end, old_span_bdy, new_span_bdy


-def prep_feats(mlm_model: nn.Layer,
-               wav_path: str,
+def prep_feats(wav_path: str,
               source_lang: str="english",
               target_lang: str="english",
               old_str: str="",
@@ -440,7 +450,6 @@ def prep_feats(mlm_model: nn.Layer,
    wav, phns, mfa_start, mfa_end, old_span_bdy, new_span_bdy = prep_feats_with_dur(
        source_lang=source_lang,
        target_lang=target_lang,
-        mlm_model=mlm_model,
        old_str=old_str,
        new_str=new_str,
        wav_path=wav_path,
@@ -482,7 +491,6 @@ def decode_with_model(mlm_model: nn.Layer,
    batch, old_span_bdy, new_span_bdy = prep_feats(
        source_lang=source_lang,
        target_lang=target_lang,
-        mlm_model=mlm_model,
        wav_path=wav_path,
        old_str=old_str,
        new_str=new_str,

--- a/examples/ernie_sat/local/inference_new.py
+++ b/examples/ernie_sat/local/inference_new.py
--- a/examples/ernie_sat/local/sedit_arg_parser.py
+++ b/examples/ernie_sat/local/sedit_arg_parser.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import argparse



--- a/examples/ernie_sat/local/utils.py
+++ b/examples/ernie_sat/local/utils.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 from pathlib import Path
 from typing import Dict
 from typing import List

--- a/examples/ernie_sat/run_clone_en_to_zh_new.sh
+++ b/examples/ernie_sat/run_clone_en_to_zh_new.sh
+#!/bin/bash
+
+set -e
+source path.sh
+
+# en --> zh  的 语音合成
+# 根据 Prompt_003_new 作为提示语音: This was not the show for me. 来合成:  '今天天气很好'
+# 注: 输入的 new_str 需为中文汉字, 否则会通过预处理只保留中文汉字, 即合成预处理后的中文语音。
+
+python local/inference_new.py \
+    --task_name=cross-lingual_clone \
+    --model_name=paddle_checkpoint_dual_mask_enzh \
+    --uid=Prompt_003_new \
+    --new_str='今天天气很好.' \
+    --prefix='./prompt/dev/' \
+    --source_lang=english \
+    --target_lang=chinese \
+    --output_name=pred_clone.wav \
+    --voc=pwgan_aishell3 \
+    --voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \
+    --voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
+    --voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \
+    --am=fastspeech2_csmsc \
+    --am_config=download/fastspeech2_conformer_baker_ckpt_0.5/conformer.yaml \
+    --am_ckpt=download/fastspeech2_conformer_baker_ckpt_0.5/snapshot_iter_76000.pdz \
+    --am_stat=download/fastspeech2_conformer_baker_ckpt_0.5/speech_stats.npy \
+    --phones_dict=download/fastspeech2_conformer_baker_ckpt_0.5/phone_id_map.txt
--- a/examples/ernie_sat/run_gen_en_new.sh
+++ b/examples/ernie_sat/run_gen_en_new.sh
+#!/bin/bash
+
+set -e
+source path.sh
+
+# 纯英文的语音合成
+# 样例为根据 p299_096 对应的语音作为提示语音: This was not the show for me. 来合成:  'I enjoy my life.'
+
+python local/inference_new.py \
+    --task_name=synthesize \
+    --model_name=paddle_checkpoint_en \
+    --uid=p299_096 \
+    --new_str='I enjoy my life, do you?' \
+    --prefix='./prompt/dev/' \
+    --source_lang=english \
+    --target_lang=english \
+    --output_name=pred_gen.wav \
+    --voc=pwgan_aishell3 \
+    --voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \
+    --voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
+    --voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \
+    --am=fastspeech2_ljspeech \
+    --am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \
+    --am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \
+    --am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \
+    --phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
--- a/examples/ernie_sat/run_sedit_en_new.sh
+++ b/examples/ernie_sat/run_sedit_en_new.sh
+#!/bin/bash
+
+set -e
+source path.sh
+
+# 纯英文的语音编辑
+# 样例为把 p243_new 对应的原始语音: For that reason cover should not be given.编辑成 'for that reason cover is impossible to be given.' 对应的语音
+# NOTE: 语音编辑任务暂支持句子中 1 个位置的替换或者插入文本操作
+
+python local/inference_new.py \
+    --task_name=edit \
+    --model_name=paddle_checkpoint_en \
+    --uid=p243_new \
+    --new_str='for that reason cover is impossible to be given.' \
+    --prefix='./prompt/dev/' \
+    --source_lang=english \
+    --target_lang=english \
+    --output_name=pred_edit.wav \
+    --voc=pwgan_aishell3 \
+    --voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \
+    --voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
+    --voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \
+    --am=fastspeech2_ljspeech \
+    --am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \
+    --am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \
+    --am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \
+    --phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
--- a/examples/ernie_sat/test_run_new.sh
+++ b/examples/ernie_sat/test_run_new.sh
+#!/bin/bash
+
+rm -rf *.wav
+./run_sedit_en_new.sh       # 语音编辑任务(英文)
+./run_gen_en_new.sh         # 个性化语音合成任务(英文)
+./run_clone_en_to_zh_new.sh # 跨语言语音合成任务(英文到中文的语音克隆)
\ No newline at end of file
--- a/examples/iwslt2012/punc0/conf/default.yaml
+++ b/examples/iwslt2012/punc0/conf/default.yaml
@@ -29,7 +29,7 @@ optimizer_params:

 scheduler_params:
    learning_rate: 1.0e-5               # learning rate.
-    gamma: 1.0                          # scheduler gamma.
+    gamma: 0.9999                          # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better.

 ###########################################################
 #                     TRAINING SETTING                    #

--- a/examples/vctk/ernie_sat/conf/default.yaml
+++ b/examples/vctk/ernie_sat/conf/default.yaml
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+
+fs: 24000          # sr
+n_fft: 2048        # FFT size (samples).
+n_shift: 300       # Hop size (samples). 12.5ms
+win_length: 1200   # Window length (samples). 50ms
+                   # If set to null, it will be the same as fft_size.
+window: "hann"     # Window function.
+
+# Only used for feats_type != raw
+
+fmin: 80           # Minimum frequency of Mel basis.
+fmax: 7600         # Maximum frequency of Mel basis.
+n_mels: 80         # The number of mel basis.
+
+mean_phn_span: 8
+mlm_prob: 0.8
+
+###########################################################
+#                       DATA SETTING                      #
+###########################################################
+batch_size: 20
+num_workers: 2
+
+###########################################################
+#                       MODEL SETTING                     #
+###########################################################
+model:
+    text_masking: false
+    postnet_layers: 5
+    postnet_filts: 5
+    postnet_chans: 256
+    encoder_type: conformer
+    decoder_type: conformer
+    enc_input_layer: sega_mlm
+    enc_pre_speech_layer: 0
+    enc_cnn_module_kernel: 7
+    enc_attention_dim: 384
+    enc_attention_heads: 2
+    enc_linear_units: 1536
+    enc_num_blocks: 4
+    enc_dropout_rate: 0.2
+    enc_positional_dropout_rate: 0.2
+    enc_attention_dropout_rate: 0.2
+    enc_normalize_before: true
+    enc_macaron_style: true
+    enc_use_cnn_module: true
+    enc_selfattention_layer_type: legacy_rel_selfattn
+    enc_activation_type: swish
+    enc_pos_enc_layer_type: legacy_rel_pos
+    enc_positionwise_layer_type: conv1d
+    enc_positionwise_conv_kernel_size: 3
+    dec_cnn_module_kernel: 31
+    dec_attention_dim: 384
+    dec_attention_heads: 2
+    dec_linear_units: 1536
+    dec_num_blocks: 4
+    dec_dropout_rate: 0.2
+    dec_positional_dropout_rate: 0.2
+    dec_attention_dropout_rate: 0.2
+    dec_macaron_style: true
+    dec_use_cnn_module: true
+    dec_selfattention_layer_type: legacy_rel_selfattn
+    dec_activation_type: swish
+    dec_pos_enc_layer_type: legacy_rel_pos
+    dec_positionwise_layer_type: conv1d
+    dec_positionwise_conv_kernel_size: 3
+
+###########################################################
+#                     OPTIMIZER SETTING                   #
+###########################################################
+scheduler_params:
+    d_model: 384
+    warmup_steps: 4000
+grad_clip: 1.0
+
+###########################################################
+#                     TRAINING SETTING                    #
+###########################################################
+max_epoch: 1500
+num_snapshots: 50
+
+###########################################################
+#                       OTHER SETTING                     #
+###########################################################
+seed: 0
+
+token_list:
+- <blank>
+- <unk>
+- AH0
+- T
+- N
+- sp
+- D
+- S
+- R
+- L
+- IH1
+- DH
+- AE1
+- M
+- EH1
+- K
+- Z
+- W
+- HH
+- ER0
+- AH1
+- IY1
+- P
+- V
+- F
+- B
+- AY1
+- IY0
+- EY1
+- AA1
+- AO1
+- UW1
+- IH0
+- OW1
+- NG
+- G
+- SH
+- ER1
+- Y
+- TH
+- AW1
+- CH
+- UH1
+- IH2
+- JH
+- OW0
+- EH2
+- OY1
+- AY2
+- EH0
+- EY2
+- UW0
+- AE2
+- AA2
+- OW2
+- AH2
+- ZH
+- AO2
+- IY2
+- AE0
+- UW2
+- AY0
+- AA0
+- AO0
+- AW2
+- EY0
+- UH2
+- ER2
+- OY2
+- UH0
+- AW0
+- OY0
+- <sos/eos>
--- a/examples/vctk/ernie_sat/local/preprocess.sh
+++ b/examples/vctk/ernie_sat/local/preprocess.sh
+#!/bin/bash
+
+stage=0
+stop_stage=100
+
+config_path=$1
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # get durations from MFA's result
+    echo "Generate durations.txt from MFA results ..."
+    python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
+        --inputdir=./vctk_alignment \
+        --output durations.txt \
+        --config=${config_path}
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # extract features
+    echo "Extract features ..."
+    python3 ${BIN_DIR}/preprocess.py \
+        --dataset=vctk \
+        --rootdir=~/datasets/VCTK-Corpus-0.92/ \
+        --dumpdir=dump \
+        --dur-file=durations.txt \
+        --config=${config_path} \
+        --num-cpu=20 \
+        --cut-sil=True
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # get features' stats(mean and std)
+    echo "Get features' stats ..."
+    python3 ${MAIN_ROOT}/utils/compute_statistics.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --field-name="speech"
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # normalize and covert phone/speaker to id, dev and test should use train's stats
+    echo "Normalize ..."
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --dumpdir=dump/train/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/dev/raw/metadata.jsonl \
+        --dumpdir=dump/dev/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/test/raw/metadata.jsonl \
+        --dumpdir=dump/test/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+fi
--- a/examples/vctk/ernie_sat/local/synthesize.sh
+++ b/examples/vctk/ernie_sat/local/synthesize.sh
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+ckpt_name=$3
+
+stage=1
+stop_stage=1
+
+# use am to predict duration here
+# 增加 am_phones_dict am_tones_dict 等，也可以用新的方式构造 am, 不需要这么多参数了就
+
+# pwgan
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    FLAGS_allocator_strategy=naive_best_fit \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+    python3 ${BIN_DIR}/synthesize.py \
+        --erniesat_config=${config_path} \
+        --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+        --erniesat_stat=dump/train/speech_stats.npy \
+        --voc=pwgan_vctk \
+        --voc_config=pwg_vctk_ckpt_0.1.1/default.yaml  \
+        --voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
+        --voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
+        --test_metadata=dump/test/norm/metadata.jsonl \
+        --output_dir=${train_output_path}/test \
+        --phones_dict=dump/phone_id_map.txt
+fi
+
+# hifigan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    FLAGS_allocator_strategy=naive_best_fit \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+    python3 ${BIN_DIR}/synthesize.py \
+        --erniesat_config=${config_path} \
+        --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+        --erniesat_stat=dump/train/speech_stats.npy \
+        --voc=hifigan_vctk \
+        --voc_config=hifigan_vctk_ckpt_0.2.0/default.yaml  \
+        --voc_ckpt=hifigan_vctk_ckpt_0.2.0/snapshot_iter_2500000.pdz \
+        --voc_stat=hifigan_vctk_ckpt_0.2.0/feats_stats.npy \
+        --test_metadata=dump/test/norm/metadata.jsonl \
+        --output_dir=${train_output_path}/test \
+        --phones_dict=dump/phone_id_map.txt
+fi
--- a/examples/vctk/ernie_sat/local/train.sh
+++ b/examples/vctk/ernie_sat/local/train.sh
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+
+python3 ${BIN_DIR}/train.py \
+    --train-metadata=dump/train/norm/metadata.jsonl \
+    --dev-metadata=dump/dev/norm/metadata.jsonl \
+    --config=${config_path} \
+    --output-dir=${train_output_path} \
+    --ngpu=2 \
+    --phones-dict=dump/phone_id_map.txt
\ No newline at end of file
--- a/examples/vctk/ernie_sat/path.sh
+++ b/examples/vctk/ernie_sat/path.sh
+#!/bin/bash
+export MAIN_ROOT=`realpath ${PWD}/../../../`
+
+export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
+export LC_ALL=C
+
+export PYTHONDONTWRITEBYTECODE=1
+# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
+
+MODEL=ernie_sat
+export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
\ No newline at end of file
--- a/examples/vctk/ernie_sat/run.sh
+++ b/examples/vctk/ernie_sat/run.sh
+#!/bin/bash
+
+set -e
+source path.sh
+
+gpus=0,1
+stage=0
+stop_stage=100
+
+conf_path=conf/default.yaml
+train_output_path=exp/default
+ckpt_name=snapshot_iter_153.pdz
+
+# with the following command, you can choose the stage range you want to run
+# such as `./run.sh --stage 0 --stop-stage 0`
+# this can not be mixed use with `$1`, `$2` ...
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # prepare data
+    ./local/preprocess.sh ${conf_path} || exit -1
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # train model, all `ckpt` under `train_output_path/checkpoints/` dir
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # synthesize, vocoder is pwgan
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
+fi
--- a/examples/vctk/tts3/conf/default.yaml
+++ b/examples/vctk/tts3/conf/default.yaml
@@ -24,7 +24,7 @@ f0max: 400         # Maximum f0 for pitch extraction.
 #                       DATA SETTING                      #
 ###########################################################
 batch_size: 64
-num_workers: 4
+num_workers: 2


 ###########################################################
@@ -88,8 +88,8 @@ updater:
 #                     OPTIMIZER SETTING                   #
 ###########################################################
 optimizer:
-  optim: adam               # optimizer type
-  learning_rate: 0.001     # learning rate
+    optim: adam               # optimizer type
+    learning_rate: 0.001      # learning rate

 ###########################################################
 #                     TRAINING SETTING                    #

--- a/examples/zh_en_tts/tts3/README.md
+++ b/examples/zh_en_tts/tts3/README.md
+# Test
+We train a Chinese-English mixed fastspeech2 model. The training code is still being sorted out, let's show how to use it first.
+The sample rate of the synthesized audio is 22050 Hz. 
+
+## Download pretrained models
+Put pretrained models in a directory named `models`.
+
+- [fastspeech2_csmscljspeech_add-zhen.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip)
+- [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)
+
+```bash
+mkdir models
+cd models
+wget https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip
+unzip fastspeech2_csmscljspeech_add-zhen.zip
+wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip
+unzip hifigan_ljspeech_ckpt_0.2.0.zip
+cd ../
+```
+
+## test
+You can choose `--spk_id` {0, 1} in `local/synthesize_e2e.sh`.
+
+```bash
+bash test.sh
+```
--- a/examples/zh_en_tts/tts3/local/synthesize_e2e.sh
+++ b/examples/zh_en_tts/tts3/local/synthesize_e2e.sh
+#!/bin/bash
+
+model_dir=$1
+output=$2
+am_name=fastspeech2_csmscljspeech_add-zhen
+am_model_dir=${model_dir}/${am_name}/
+
+stage=1
+stop_stage=1
+
+
+# hifigan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    FLAGS_allocator_strategy=naive_best_fit \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+    python3 ${BIN_DIR}/../synthesize_e2e.py \
+        --am=fastspeech2_mix \
+        --am_config=${am_model_dir}/default.yaml \
+        --am_ckpt=${am_model_dir}/snapshot_iter_94000.pdz \
+        --am_stat=${am_model_dir}/speech_stats.npy \
+        --voc=hifigan_ljspeech \
+        --voc_config=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/default.yaml \
+        --voc_ckpt=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/snapshot_iter_2500000.pdz \
+        --voc_stat=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/feats_stats.npy \
+        --lang=mix \
+        --text=${BIN_DIR}/../sentences_mix.txt \
+        --output_dir=${output}/test_e2e \
+        --phones_dict=${am_model_dir}/phone_id_map.txt \
+        --speaker_dict=${am_model_dir}/speaker_id_map.txt \
+        --spk_id 0 
+fi
--- a/examples/zh_en_tts/tts3/path.sh
+++ b/examples/zh_en_tts/tts3/path.sh
+#!/bin/bash
+export MAIN_ROOT=`realpath ${PWD}/../../../`
+
+export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
+export LC_ALL=C
+
+export PYTHONDONTWRITEBYTECODE=1
+# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
+
+MODEL=fastspeech2
+export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
--- a/examples/zh_en_tts/tts3/test.sh
+++ b/examples/zh_en_tts/tts3/test.sh
+#!/bin/bash
+
+set -e
+source path.sh
+
+gpus=0,1
+stage=3
+stop_stage=100
+
+model_dir=models
+output_dir=output
+
+# with the following command, you can choose the stage range you want to run
+# such as `./run.sh --stage 0 --stop-stage 0`
+# this can not be mixed use with `$1`, `$2` ...
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
+
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # synthesize_e2e, vocoder is hifigan by default
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${model_dir} ${output_dir}  || exit -1
+fi
+
--- a/paddlespeech/cli/tts/infer.py
+++ b/paddlespeech/cli/tts/infer.py
@@ -29,8 +29,7 @@ from yacs.config import CfgNode
 from ..executor import BaseExecutor
 from ..log import logger
 from ..utils import stats_wrapper
-from paddlespeech.t2s.frontend import English
-from paddlespeech.t2s.frontend.zh_frontend import Frontend
+from paddlespeech.t2s.exps.syn_utils import get_frontend
 from paddlespeech.t2s.modules.normalizer import ZScore

 __all__ = ['TTSExecutor']
@@ -54,6 +53,7 @@ class TTSExecutor(BaseExecutor):
                'fastspeech2_ljspeech',
                'fastspeech2_aishell3',
                'fastspeech2_vctk',
+                'fastspeech2_mix',
                'tacotron2_csmsc',
                'tacotron2_ljspeech',
            ],
@@ -98,7 +98,7 @@ class TTSExecutor(BaseExecutor):
        self.parser.add_argument(
            '--voc',
            type=str,
-            default='pwgan_csmsc',
+            default='hifigan_csmsc',
            choices=[
                'pwgan_csmsc',
                'pwgan_ljspeech',
@@ -135,7 +135,7 @@ class TTSExecutor(BaseExecutor):
            '--lang',
            type=str,
            default='zh',
-            help='Choose model language. zh or en')
+            help='Choose model language. zh or en or mix')
        self.parser.add_argument(
            '--device',
            type=str,
@@ -231,8 +231,11 @@ class TTSExecutor(BaseExecutor):
            use_pretrained_voc = True
        else:
            use_pretrained_voc = False
-
-        voc_tag = voc + '-' + lang
+        voc_lang = lang
+        # we must use ljspeech's voc for mix am now!
+        if lang == 'mix':
+            voc_lang = 'en'
+        voc_tag = voc + '-' + voc_lang
        self.task_resource.set_task_model(
            model_tag=voc_tag,
            model_type=1,  # vocoder
@@ -281,13 +284,8 @@ class TTSExecutor(BaseExecutor):
            spk_num = len(spk_id)

        # frontend
-        if lang == 'zh':
-            self.frontend = Frontend(
-                phone_vocab_path=self.phones_dict,
-                tone_vocab_path=self.tones_dict)
-
-        elif lang == 'en':
-            self.frontend = English(phone_vocab_path=self.phones_dict)
+        self.frontend = get_frontend(
+            lang=lang, phones_dict=self.phones_dict, tones_dict=self.tones_dict)

        # acoustic model
        odim = self.am_config.n_mels
@@ -381,8 +379,12 @@ class TTSExecutor(BaseExecutor):
            input_ids = self.frontend.get_input_ids(
                text, merge_sentences=merge_sentences)
            phone_ids = input_ids["phone_ids"]
+        elif lang == 'mix':
+            input_ids = self.frontend.get_input_ids(
+                text, merge_sentences=merge_sentences)
+            phone_ids = input_ids["phone_ids"]
        else:
-            logger.error("lang should in {'zh', 'en'}!")
+            logger.error("lang should in {'zh', 'en', 'mix'}!")
        self.frontend_time = time.time() - frontend_st

        self.am_time = 0
@@ -398,7 +400,7 @@ class TTSExecutor(BaseExecutor):
            # fastspeech2
            else:
                # multi speaker
-                if am_dataset in {"aishell3", "vctk"}:
+                if am_dataset in {'aishell3', 'vctk', 'mix'}:
                    mel = self.am_inference(
                        part_phone_ids, spk_id=paddle.to_tensor(spk_id))
                else:

--- a/paddlespeech/resource/pretrained_models.py
+++ b/paddlespeech/resource/pretrained_models.py
@@ -655,6 +655,24 @@ tts_dynamic_pretrained_models = {
            'phone_id_map.txt',
        },
    },
+    "fastspeech2_mix-mix": {
+        '1.0': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip',
+            'md5':
+            '77d9d4b5a79ed6203339ead7ef6c74f9',
+            'config':
+            'default.yaml',
+            'ckpt':
+            'snapshot_iter_94000.pdz',
+            'speech_stats':
+            'speech_stats.npy',
+            'phones_dict':
+            'phone_id_map.txt',
+            'speaker_dict':
+            'speaker_id_map.txt',
+        },
+    },
    # tacotron2
    "tacotron2_csmsc-zh": {
        '1.0': {

--- a/paddlespeech/s2t/models/u2/u2.py
+++ b/paddlespeech/s2t/models/u2/u2.py
@@ -630,13 +630,11 @@ class U2BaseModel(ASRInterface, nn.Layer):
                (elayers, head, cache_t1, d_k * 2), where
                `head * d_k == hidden-dim` and
                `cache_t1 == chunk_size * num_decoding_left_chunks`.
-                `d_k * 2` for att key & value. Default is 0-dims Tensor, 
-                it is used for dy2st.
+                `d_k * 2` for att key & value. 
            cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer,
                (elayers, b=1, hidden-dim, cache_t2), where
-                `cache_t2 == cnn.lorder - 1`. Default is 0-dims Tensor, 
-                it is used for dy2st.
-
+                `cache_t2 == cnn.lorder - 1`. 
+                
        Returns:
            paddle.Tensor: output of current input xs,
                with shape (b=1, chunk_size, hidden-dim).

--- a/paddlespeech/s2t/modules/encoder_layer.py
+++ b/paddlespeech/s2t/modules/encoder_layer.py
@@ -76,9 +76,9 @@ class TransformerEncoderLayer(nn.Layer):
            x: paddle.Tensor,
            mask: paddle.Tensor,
            pos_emb: paddle.Tensor,
-            mask_pad: paddle.Tensor= paddle.ones([0,0,0], dtype=paddle.bool),
-            att_cache: paddle.Tensor=paddle.zeros([0,0,0,0]),
-            cnn_cache: paddle.Tensor=paddle.zeros([0,0,0,0]),
+            mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
+            att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+            cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Compute encoded features.
        Args:
@@ -105,9 +105,7 @@ class TransformerEncoderLayer(nn.Layer):
        if self.normalize_before:
            x = self.norm1(x)

-        x_att, new_att_cache = self.self_attn(
-            x, x, x, mask, cache=att_cache
-        )
+        x_att, new_att_cache = self.self_attn(x, x, x, mask, cache=att_cache)

        if self.concat_after:
            x_concat = paddle.concat((x, x_att), axis=-1)
@@ -124,7 +122,7 @@ class TransformerEncoderLayer(nn.Layer):
        if not self.normalize_before:
            x = self.norm2(x)

-        fake_cnn_cache = paddle.zeros([0,0,0], dtype=x.dtype)
+        fake_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype)
        return x, mask, new_att_cache, fake_cnn_cache


@@ -195,9 +193,9 @@ class ConformerEncoderLayer(nn.Layer):
            x: paddle.Tensor,
            mask: paddle.Tensor,
            pos_emb: paddle.Tensor,
-            mask_pad: paddle.Tensor= paddle.ones([0,0,0], dtype=paddle.bool),
-            att_cache: paddle.Tensor=paddle.zeros([0,0,0,0]),
-            cnn_cache: paddle.Tensor=paddle.zeros([0,0,0,0]),
+            mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
+            att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+            cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Compute encoded features.
        Args:
@@ -211,7 +209,8 @@ class ConformerEncoderLayer(nn.Layer):
            att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE
                (#batch=1, head, cache_t1, d_k * 2), head * d_k == size.
            cnn_cache (paddle.Tensor): Convolution cache in conformer layer
-                (#batch=1, size, cache_t2)
+                (1, #batch=1, size, cache_t2). First dim will not be used, just
+                for dy2st.
        Returns:
           paddle.Tensor: Output tensor (#batch, time, size).
           paddle.Tensor: Mask tensor (#batch, time, time).
@@ -219,6 +218,8 @@ class ConformerEncoderLayer(nn.Layer):
                (#batch=1, head, cache_t1 + time, d_k * 2).
           paddle.Tensor: cnn_cahce tensor (#batch, size, cache_t2).
        """
+        # (1, #batch=1, size, cache_t2) -> (#batch=1, size, cache_t2)
+        cnn_cache = paddle.squeeze(cnn_cache, axis=0)

        # whether to use macaron style FFN
        if self.feed_forward_macaron is not None:
@@ -249,8 +250,7 @@ class ConformerEncoderLayer(nn.Layer):

        # convolution module
        # Fake new cnn cache here, and then change it in conv_module
-        new_cnn_cache = paddle.zeros([0,0,0], dtype=x.dtype)
-        cnn_cache = paddle.squeeze(cnn_cache, axis=0)
+        new_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype)
        if self.conv_module is not None:
            residual = x
            if self.normalize_before:
@@ -275,4 +275,4 @@ class ConformerEncoderLayer(nn.Layer):
        if self.conv_module is not None:
            x = self.norm_final(x)

-        return x, mask, new_att_cache, new_cnn_cache
\ No newline at end of file
+        return x, mask, new_att_cache, new_cnn_cache
--- a/paddlespeech/server/engine/engine_warmup.py
+++ b/paddlespeech/server/engine/engine_warmup.py
@@ -60,7 +60,10 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool:

                else:
                    st = time.time()
-                    connection_handler.infer(text=sentence)
+                    connection_handler.infer(
+                        text=sentence,
+                        lang=tts_engine.lang,
+                        am=tts_engine.config.am)
                    et = time.time()
                    logger.debug(
                        f"The response time of the {i} warm up: {et - st} s")

--- a/paddlespeech/t2s/datasets/am_batch_fn.py
+++ b/paddlespeech/t2s/datasets/am_batch_fn.py
@@ -28,6 +28,150 @@ from paddlespeech.t2s.modules.nets_utils import phones_masking
 from paddlespeech.t2s.modules.nets_utils import phones_text_masking


+# 因为要传参数，所以需要额外构建
+def build_erniesat_collate_fn(mlm_prob: float=0.8,
+                              mean_phn_span: int=8,
+                              seg_emb: bool=False,
+                              text_masking: bool=False):
+
+    return ErnieSATCollateFn(
+        mlm_prob=mlm_prob,
+        mean_phn_span=mean_phn_span,
+        seg_emb=seg_emb,
+        text_masking=text_masking)
+
+
+class ErnieSATCollateFn:
+    """Functor class of common_collate_fn()"""
+
+    def __init__(self,
+                 mlm_prob: float=0.8,
+                 mean_phn_span: int=8,
+                 seg_emb: bool=False,
+                 text_masking: bool=False):
+        self.mlm_prob = mlm_prob
+        self.mean_phn_span = mean_phn_span
+        self.seg_emb = seg_emb
+        self.text_masking = text_masking
+
+    def __call__(self, exmaples):
+        return erniesat_batch_fn(
+            exmaples,
+            mlm_prob=self.mlm_prob,
+            mean_phn_span=self.mean_phn_span,
+            seg_emb=self.seg_emb,
+            text_masking=self.text_masking)
+
+
+def erniesat_batch_fn(examples,
+                      mlm_prob: float=0.8,
+                      mean_phn_span: int=8,
+                      seg_emb: bool=False,
+                      text_masking: bool=False):
+    # fields = ["text", "text_lengths", "speech", "speech_lengths", "align_start", "align_end"]
+    text = [np.array(item["text"], dtype=np.int64) for item in examples]
+    speech = [np.array(item["speech"], dtype=np.float32) for item in examples]
+
+    text_lengths = [
+        np.array(item["text_lengths"], dtype=np.int64) for item in examples
+    ]
+    speech_lengths = [
+        np.array(item["speech_lengths"], dtype=np.int64) for item in examples
+    ]
+
+    align_start = [
+        np.array(item["align_start"], dtype=np.int64) for item in examples
+    ]
+
+    align_end = [
+        np.array(item["align_end"], dtype=np.int64) for item in examples
+    ]
+
+    align_start_lengths = [
+        np.array(len(item["align_start"]), dtype=np.int64) for item in examples
+    ]
+
+    # add_pad
+    text = batch_sequences(text)
+    speech = batch_sequences(speech)
+    align_start = batch_sequences(align_start)
+    align_end = batch_sequences(align_end)
+
+    # convert each batch to paddle.Tensor
+    text = paddle.to_tensor(text)
+    speech = paddle.to_tensor(speech)
+    text_lengths = paddle.to_tensor(text_lengths)
+    speech_lengths = paddle.to_tensor(speech_lengths)
+    align_start_lengths = paddle.to_tensor(align_start_lengths)
+
+    speech_pad = speech
+    text_pad = text
+
+    text_mask = make_non_pad_mask(
+        text_lengths, text_pad, length_dim=1).unsqueeze(-2)
+    speech_mask = make_non_pad_mask(
+        speech_lengths, speech_pad[:, :, 0], length_dim=1).unsqueeze(-2)
+
+    # for training
+    span_bdy = None
+    # for inference
+    if 'span_bdy' in examples[0].keys():
+        span_bdy = [
+            np.array(item["span_bdy"], dtype=np.int64) for item in examples
+        ]
+        span_bdy = paddle.to_tensor(span_bdy)
+
+    # dual_mask 的是混合中英时候同时 mask 语音和文本 
+    # ernie sat 在实现跨语言的时候都 mask 了
+    if text_masking:
+        masked_pos, text_masked_pos = phones_text_masking(
+            xs_pad=speech_pad,
+            src_mask=speech_mask,
+            text_pad=text_pad,
+            text_mask=text_mask,
+            align_start=align_start,
+            align_end=align_end,
+            align_start_lens=align_start_lengths,
+            mlm_prob=mlm_prob,
+            mean_phn_span=mean_phn_span,
+            span_bdy=span_bdy)
+    # 训练纯中文和纯英文的 -> a3t 没有对 phoneme 做 mask, 只对语音 mask 了
+    # a3t 和 ernie sat 的区别主要在于做 mask 的时候
+    else:
+        masked_pos = phones_masking(
+            xs_pad=speech_pad,
+            src_mask=speech_mask,
+            align_start=align_start,
+            align_end=align_end,
+            align_start_lens=align_start_lengths,
+            mlm_prob=mlm_prob,
+            mean_phn_span=mean_phn_span,
+            span_bdy=span_bdy)
+        text_masked_pos = paddle.zeros(paddle.shape(text_pad))
+
+    speech_seg_pos, text_seg_pos = get_seg_pos(
+        speech_pad=speech_pad,
+        text_pad=text_pad,
+        align_start=align_start,
+        align_end=align_end,
+        align_start_lens=align_start_lengths,
+        seg_emb=seg_emb)
+
+    batch = {
+        "text": text,
+        "speech": speech,
+        # need to generate 
+        "masked_pos": masked_pos,
+        "speech_mask": speech_mask,
+        "text_mask": text_mask,
+        "speech_seg_pos": speech_seg_pos,
+        "text_seg_pos": text_seg_pos,
+        "text_masked_pos": text_masked_pos
+    }
+
+    return batch
+
+
 def tacotron2_single_spk_batch_fn(examples):
    # fields = ["text", "text_lengths", "speech", "speech_lengths"]
    text = [np.array(item["text"], dtype=np.int64) for item in examples]
@@ -378,7 +522,6 @@ class MLMCollateFn:
            mean_phn_span=self.mean_phn_span,
            seg_emb=self.seg_emb,
            text_masking=self.text_masking,
-            attention_window=self.attention_window,
            not_sequence=self.not_sequence)


@@ -389,7 +532,6 @@ def mlm_collate_fn(
        mean_phn_span: int=8,
        seg_emb: bool=False,
        text_masking: bool=False,
-        attention_window: int=0,
        pad_value: int=0,
        not_sequence: Collection[str]=(),
 ) -> Tuple[List[str], Dict[str, paddle.Tensor]]:
@@ -420,6 +562,7 @@ def mlm_collate_fn(

    feats = feats_extract.get_log_mel_fbank(np.array(output["speech"][0]))
    feats = paddle.to_tensor(feats)
+    print("feats.shape:", feats.shape)
    feats_lens = paddle.shape(feats)[0]
    feats = paddle.unsqueeze(feats, 0)

@@ -439,6 +582,7 @@ def mlm_collate_fn(
        text_lens, text_pad, length_dim=1).unsqueeze(-2)
    speech_mask = make_non_pad_mask(
        feats_lens, speech_pad[:, :, 0], length_dim=1).unsqueeze(-2)
+
    span_bdy = None
    if 'span_bdy' in output.keys():
        span_bdy = output['span_bdy']

--- a/paddlespeech/t2s/exps/ernie_sat/__init__.py
+++ b/paddlespeech/t2s/exps/ernie_sat/__init__.py
--- a/paddlespeech/t2s/exps/ernie_sat/align.py
+++ b/paddlespeech/t2s/exps/ernie_sat/align.py
--- a/paddlespeech/t2s/exps/ernie_sat/normalize.py
+++ b/paddlespeech/t2s/exps/ernie_sat/normalize.py
--- a/paddlespeech/t2s/exps/ernie_sat/preprocess.py
+++ b/paddlespeech/t2s/exps/ernie_sat/preprocess.py
--- a/paddlespeech/t2s/exps/ernie_sat/synthesize.py
+++ b/paddlespeech/t2s/exps/ernie_sat/synthesize.py
--- a/paddlespeech/t2s/exps/ernie_sat/synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/ernie_sat/synthesize_e2e.py
--- a/paddlespeech/t2s/exps/ernie_sat/train.py
+++ b/paddlespeech/t2s/exps/ernie_sat/train.py
--- a/paddlespeech/t2s/exps/ernie_sat/utils.py
+++ b/paddlespeech/t2s/exps/ernie_sat/utils.py
--- a/paddlespeech/t2s/exps/sentences_mix.txt
+++ b/paddlespeech/t2s/exps/sentences_mix.txt
+001 你好，欢迎使用 Paddle Speech 中英文混合 T T S 功能，开始你的合成之旅吧!
+002 我们的声学模型使用了 Fast Speech Two, 声码器使用了 Parallel Wave GAN and Hifi GAN.
+003 Paddle N L P 发布 ERNIE Tiny 全系列中文预训练小模型，快速提升预训练模型部署效率，通用信息抽取技术 U I E Tiny 系列模型全新升级，支持速度更快效果更好的 U I E 小模型。
+004 Paddle Speech 发布 P P A S R 流式语音识别系统、P P T T S 流式语音合成系统、P P V P R 全链路声纹识别系统。
+005 Paddle Bo Bo: 使用 Paddle Speech 的语音合成模块生成虚拟人的声音。
+006 热烈欢迎您在 Discussions 中提交问题，并在 Issues 中指出发现的 bug。此外，我们非常希望您参与到 Paddle Speech 的开发中！
+007 我喜欢 eat apple, 你喜欢 drink milk。
+008 我们要去云南 team building, 非常非常 happy.
\ No newline at end of file
--- a/paddlespeech/t2s/exps/syn_utils.py
+++ b/paddlespeech/t2s/exps/syn_utils.py
--- a/paddlespeech/t2s/models/ernie_sat/__init__.py
+++ b/paddlespeech/t2s/models/ernie_sat/__init__.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -11,4 +11,6 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+from .ernie_sat import *
+from .ernie_sat_updater import *
 from .mlm import *
--- a/paddlespeech/t2s/models/ernie_sat/ernie_sat.py
+++ b/paddlespeech/t2s/models/ernie_sat/ernie_sat.py
--- a/paddlespeech/t2s/models/ernie_sat/ernie_sat_updater.py
+++ b/paddlespeech/t2s/models/ernie_sat/ernie_sat_updater.py
--- a/paddlespeech/t2s/models/ernie_sat/mlm.py
+++ b/paddlespeech/t2s/models/ernie_sat/mlm.py
--- a/paddlespeech/t2s/models/fastspeech2/fastspeech2.py
+++ b/paddlespeech/t2s/models/fastspeech2/fastspeech2.py
@@ -274,9 +274,7 @@ class FastSpeech2(nn.Layer):
        super().__init__()

        # store hyperparameters
-        self.idim = idim
        self.odim = odim
-        self.eos = idim - 1
        self.reduction_factor = reduction_factor
        self.encoder_type = encoder_type
        self.decoder_type = decoder_type

--- a/paddlespeech/t2s/modules/losses.py
+++ b/paddlespeech/t2s/modules/losses.py
--- a/paddlespeech/t2s/modules/nets_utils.py
+++ b/paddlespeech/t2s/modules/nets_utils.py
--- a/tests/unit/cli/test_cli.sh
+++ b/tests/unit/cli/test_cli.sh
--- a/third_party/README.md
+++ b/third_party/README.md
--- a/third_party/ctc_decoders/scorer.cpp
+++ b/third_party/ctc_decoders/scorer.cpp
@@ -13,7 +13,8 @@
 #include "decoder_utils.h"

 using namespace lm::ngram;
-
+// if your platform is windows ,you need add the define
+#define    F_OK    0
 Scorer::Scorer(double alpha,
               double beta,
               const std::string& lm_path,

--- a/third_party/ctc_decoders/setup.py
+++ b/third_party/ctc_decoders/setup.py
--- a/third_party/install_win_ctc.bat
+++ b/third_party/install_win_ctc.bat