Add simultaneous translation models (#1626)

07d13b73 · linjieccc · GitHub · 73d41b1c · 07d13b73 · 07d13b73
30 changed file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_1/README.md
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_1/README.md
+# transformer_nist_wait_1
+|模型名称|transformer_nist_wait_1|
+| :--- | :---: | 
+|类别|同声传译|
+|网络|transformer|
+|数据集|NIST 2008-中英翻译数据集|
+|是否支持Fine-tuning|否|
+|模型大小|377MB|
+|最新更新日期|2021-09-17|
+|数据指标|-|
+## 一、模型基本信息
+- ### 模型介绍
+  - 同声传译（Simultaneous Translation），即在句子完成之前进行翻译，同声传译的目标是实现同声传译的自动化，它可以与源语言同时翻译，延迟时间只有几秒钟。
+    STACL 是论文 [STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework](https://www.aclweb.org/anthology/P19-1289/) 中针对同传提出的适用于所有同传场景的翻译架构。
+    - STACL 主要具有以下优势：
+    - Prefix-to-Prefix架构拥有预测能力，即在未看到源词的情况下仍然可以翻译出对应的目标词，克服了SOV→SVO等词序差异
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/40840292/133761990-13e55d0f-5c3a-476c-8865-5808d13cba97.png"> <br />
+    </p>
+     和传统的机器翻译模型主要的区别在于翻译时是否需要利用全句的源句。上图中，Seq2Seq模型需要等到全句的源句（1-5）全部输入Encoder后，Decoder才开始解码进行翻译；而STACL架构采用了Wait-k（图中Wait-2）的策略，当源句只有两个词（1和2）输入到Encoder后，Decoder即可开始解码预测目标句的第一个词。
+    - Wait-k策略可以不需要全句的源句，直接预测目标句，可以实现任意的字级延迟，同时保持较高的翻译质量。
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/40840292/133762098-6ea6f3ca-0d70-4a0a-981d-0fcc6f3cd96b.png"> <br />
+    </p>
+     Wait-k策略首先等待源句单词，然后与源句的其余部分同时翻译，即输出总是隐藏在输入后面。这是受到同声传译人员的启发，同声传译人员通常会在几秒钟内开始翻译演讲者的演讲，在演讲者结束几秒钟后完成。例如，如果k=2，第一个目标词使用前2个源词预测，第二个目标词使用前3个源词预测，以此类推。上图中，(a)simultaneous: our wait-2 等到"布什"和"总统"输入后就开始解码预测"pres."，而(b) non-simultaneous baseline 为传统的翻译模型，需要等到整句"布什 总统 在 莫斯科 与 普京 会晤"才开始解码预测。
+  - 该PaddleHub Module基于transformer网络结构，采用wait-1策略进行中文到英文的翻译。
+## 二、安装
+- ### 1、环境依赖
+  - paddlepaddle >= 2.1.0
+  - paddlehub >= 2.1.0    | [如何安装PaddleHub](../../../../../docs/docs_ch/get_start/installation.rst)
+- ### 2、安装
+  - ```shell
+    $ hub install transformer_nist_wait_1
+    ```
+  - 如您安装时遇到问题，可参考：[零基础windows安装](../../../../../docs/docs_ch/get_start/windows_quickstart.md)
+ | [零基础Linux安装](../../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../../docs/docs_ch/get_start/mac_quickstart.md)
+## 三、模型API预测
+- ### 1、预测代码示例
+  - ```python
+    import paddlehub as hub
+    model = hub.Module(name="transformer_nist_wait_1")
+    # 待预测数据（模拟同声传译实时输入）
+    text = [
+        "他", 
+        "他还", 
+        "他还说", 
+        "他还说现在", 
+        "他还说现在正在",
+        "他还说现在正在为",
+        "他还说现在正在为这",
+        "他还说现在正在为这一",
+        "他还说现在正在为这一会议",
+        "他还说现在正在为这一会议作出",
+        "他还说现在正在为这一会议作出安排",
+        "他还说现在正在为这一会议作出安排。",      
+    ]
+    for t in text:
+        print("input: {}".format(t))
+        result = model.translate(t)
+        print("model output: {}\n".format(result))
+    # input: 他
+    # model output: he
+    #
+    # input: 他还
+    # model output: he also
+    #
+    # input: 他还说
+    # model output: he also said
+    #
+    # input: 他还说现在
+    # model output: he also said that
+    #
+    # input: 他还说现在正在
+    # model output: he also said that he
+    #
+    # input: 他还说现在正在为
+    # model output: he also said that he is
+    #
+    # input: 他还说现在正在为这
+    # model output: he also said that he is making
+    #
+    # input: 他还说现在正在为这一
+    # model output: he also said that he is making preparations
+    #
+    # input: 他还说现在正在为这一会议
+    # model output: he also said that he is making preparations for
+    #
+    # input: 他还说现在正在为这一会议作出
+    # model output: he also said that he is making preparations for this
+    #
+    # input: 他还说现在正在为这一会议作出安排
+    # model output: he also said that he is making preparations for this meeting
+    #
+    # input: 他还说现在正在为这一会议作出安排。
+    # model output: he also said that he is making preparations for this meeting .
+    ```
+- ### 2、 API
+    - ```python
+      __init__(max_length=256, max_out_len=256)
+      ```
+        - 初始化module， 可配置模型的输入文本的最大长度
+        - **参数**
+            - max_length(int): 输入文本的最大长度，默认值为256。
+            - max_out_len(int): 输出文本的最大解码长度，超过最大解码长度时会截断句子的后半部分，默认值为256。
+    - ```python
+      translate(text, use_gpu=False)
+      ```
+        - 预测API，输入源语言的文本（模拟同传语音输入），解码后输出翻译后的目标语言文本。
+        - **参数**
+            - text(str): 输入源语言的文本，数据类型为str
+            - use_gpu(bool): 是否使用gpu进行预测，默认为False
+        - **返回**
+            - result(str): 翻译后的目标语言文本。
+## 四、服务部署
+- PaddleHub Serving可以部署一个在线语义匹配服务，可以将此接口用于在线web应用。
+- ### 第一步：启动PaddleHub Serving
+  - 运行启动命令：
+  - ```shell
+    $ hub serving start -m transformer_nist_wait_1
+    ```
+  - 启动时会显示加载模型过程，启动成功后显示
+  - ```shell
+    Loading transformer_nist_wait_1 successful.
+    ```
+  - 这样就完成了服务化API的部署，默认端口号为8866。
+  - **NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+- ### 第二步：发送预测请求
+  - 配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果       
+  - ```python
+    import requests
+    import json
+    # 待预测数据（模拟同声传译实时输入）
+    text = [
+        "他", 
+        "他还", 
+        "他还说", 
+        "他还说现在", 
+        "他还说现在正在",
+        "他还说现在正在为",
+        "他还说现在正在为这",
+        "他还说现在正在为这一",
+        "他还说现在正在为这一会议",
+        "他还说现在正在为这一会议作出",
+        "他还说现在正在为这一会议作出安排",
+        "他还说现在正在为这一会议作出安排。",      
+    ]
+    # 指定预测方法为transformer_nist_wait_1并发送post请求，content-type类型应指定json方式
+    # HOST_IP为服务器IP
+    url = "http://HOST_IP:8866/predict/transformer_nist_wait_1"
+    headers = {"Content-Type": "application/json"}
+    for t in text:
+        print("input: {}".format(t))
+        r = requests.post(url=url, headers=headers, data=json.dumps(t))
+        # 打印预测结果
+        print("model output: {}\n".format(result))
+  - 关于PaddleHub Serving更多信息参考：[服务部署](../../../../../docs/docs_ch/tutorial/serving.md)
+## 五、更新历史
+* 1.0.0
+    初始发布
+    ```shell
+    hub install transformer_nist_wait_1==1.0.0
+    ```
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_1/__init__.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_1/__init__.py
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_1/model.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_1/model.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddlenlp.transformers import WordEmbedding, PositionalEmbedding
+class DecoderLayer(nn.TransformerDecoderLayer):
+    def __init__(self, *args, **kwargs):
+        super(DecoderLayer, self).__init__(*args, **kwargs)
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+        if cache is None:
+            tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, None)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask,
+                                                    cache[0])
+        tgt = residual + self.dropout1(tgt)
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm2(tgt)
+        if len(memory) == 1:
+            # Full sent
+            tgt = self.cross_attn(tgt, memory[0], memory[0], memory_mask, None)
+        else:
+            # Wait-k policy
+            cross_attn_outputs = []
+            for i in range(tgt.shape[1]):
+                q = tgt[:, i:i + 1, :]
+                if i >= len(memory):
+                    e = memory[-1]
+                else:
+                    e = memory[i]
+                cross_attn_outputs.append(
+                    self.cross_attn(q, e, e, memory_mask[:, :, i:i + 1, :
+                                                         e.shape[1]], None))
+            tgt = paddle.concat(cross_attn_outputs, axis=1)
+        tgt = residual + self.dropout2(tgt)
+        if not self.normalize_before:
+            tgt = self.norm2(tgt)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm3(tgt)
+        tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
+        tgt = residual + self.dropout3(tgt)
+        if not self.normalize_before:
+            tgt = self.norm3(tgt)
+        return tgt if cache is None else (tgt, (incremental_cache, ))
+class Decoder(nn.TransformerDecoder):
+    """
+    PaddlePaddle 2.1 casts memory_mask.dtype to memory.dtype, but in STACL,
+    type of memory is list, having no dtype attribute.
+    """
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        output = tgt
+        new_caches = []
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                output = mod(output,
+                             memory,
+                             tgt_mask=tgt_mask,
+                             memory_mask=memory_mask,
+                             cache=None)
+            else:
+                output, new_cache = mod(output,
+                                        memory,
+                                        tgt_mask=tgt_mask,
+                                        memory_mask=memory_mask,
+                                        cache=cache[i])
+                new_caches.append(new_cache)
+        if self.norm is not None:
+            output = self.norm(output)
+        return output if cache is None else (output, new_caches)
+class SimultaneousTransformer(nn.Layer):
+    """
+    model
+    """
+    def __init__(self,
+                 src_vocab_size,
+                 trg_vocab_size,
+                 max_length=256,
+                 n_layer=6,
+                 n_head=8,
+                 d_model=512,
+                 d_inner_hid=2048,
+                 dropout=0.1,
+                 weight_sharing=False,
+                 bos_id=0,
+                 eos_id=1,
+                 waitk=-1):
+        super(SimultaneousTransformer, self).__init__()
+        self.trg_vocab_size = trg_vocab_size
+        self.emb_dim = d_model
+        self.bos_id = bos_id
+        self.eos_id = eos_id
+        self.dropout = dropout
+        self.waitk = waitk
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.d_model = d_model
+        self.src_word_embedding = WordEmbedding(
+            vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+        self.src_pos_embedding = PositionalEmbedding(
+            emb_dim=d_model, max_length=max_length+1)
+        if weight_sharing:
+            assert src_vocab_size == trg_vocab_size, (
+                "Vocabularies in source and target should be same for weight sharing."
+            )
+            self.trg_word_embedding = self.src_word_embedding
+            self.trg_pos_embedding = self.src_pos_embedding
+        else:
+            self.trg_word_embedding = WordEmbedding(
+                vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+            self.trg_pos_embedding = PositionalEmbedding(
+                emb_dim=d_model, max_length=max_length+1)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=d_model,
+            nhead=n_head,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            activation='relu',
+            normalize_before=True,
+            bias_attr=[False, True])
+        encoder_norm = nn.LayerNorm(d_model)
+        self.encoder = nn.TransformerEncoder(
+            encoder_layer=encoder_layer, num_layers=n_layer, norm=encoder_norm)
+        decoder_layer = DecoderLayer(
+            d_model=d_model,
+            nhead=n_head,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            activation='relu',
+            normalize_before=True,
+            bias_attr=[False, False, True])
+        decoder_norm = nn.LayerNorm(d_model)
+        self.decoder = Decoder(
+            decoder_layer=decoder_layer, num_layers=n_layer, norm=decoder_norm)
+        if weight_sharing:
+            self.linear = lambda x: paddle.matmul(
+                x=x, y=self.trg_word_embedding.word_embedding.weight, transpose_y=True)
+        else:
+            self.linear = nn.Linear(
+                in_features=d_model,
+                out_features=trg_vocab_size,
+                bias_attr=False)
+    def forward(self, src_word, trg_word):
+        src_max_len = paddle.shape(src_word)[-1]
+        trg_max_len = paddle.shape(trg_word)[-1]
+        base_attn_bias = paddle.cast(
+            src_word == self.bos_id,
+            dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        src_slf_attn_bias = base_attn_bias
+        src_slf_attn_bias.stop_gradient = True
+        trg_slf_attn_bias = paddle.tensor.triu(
+            (paddle.ones(
+                (trg_max_len, trg_max_len),
+                dtype=paddle.get_default_dtype()) * -np.inf),
+            1)
+        trg_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, trg_max_len, 1])
+        src_pos = paddle.cast(
+            src_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=src_max_len)
+        trg_pos = paddle.cast(
+            trg_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=trg_max_len)
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(
+            src_emb, p=self.dropout,
+            training=self.training) if self.dropout else src_emb
+        with paddle.static.amp.fp16_guard():
+            if self.waitk >= src_max_len or self.waitk == -1:
+                # Full sentence
+                enc_outputs = [
+                    self.encoder(
+                        enc_input, src_mask=src_slf_attn_bias)
+                ]
+            else:
+                # Wait-k policy
+                enc_outputs = []
+                for i in range(self.waitk, src_max_len + 1):
+                    enc_output = self.encoder(
+                        enc_input[:, :i, :],
+                        src_mask=src_slf_attn_bias[:, :, :, :i])
+                    enc_outputs.append(enc_output)
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(
+                trg_emb, p=self.dropout,
+                training=self.training) if self.dropout else trg_emb
+            dec_output = self.decoder(
+                dec_input,
+                enc_outputs,
+                tgt_mask=trg_slf_attn_bias,
+                memory_mask=trg_src_attn_bias)
+            predict = self.linear(dec_output)
+        return predict
+    def beam_search(self, src_word, beam_size=4, max_len=256, waitk=-1):
+        # TODO: "Speculative Beam Search for Simultaneous Translation"
+        raise NotImplementedError
+    def greedy_search(self,
+                      src_word,
+                      max_len=256,
+                      waitk=-1,
+                      caches=None,
+                      bos_id=None):
+        """
+        greedy_search uses streaming reader. It doesn't need calling
+        encoder many times, an a sub-sentence just needs calling encoder once.
+        So, it needs previous state(caches) and last one of generated
+        tokens id last time.
+        """
+        src_max_len = paddle.shape(src_word)[-1]
+        base_attn_bias = paddle.cast(
+            src_word == self.bos_id,
+            dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        src_slf_attn_bias = base_attn_bias
+        src_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, 1, 1])
+        src_pos = paddle.cast(
+            src_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=src_max_len)
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(
+            src_emb, p=self.dropout,
+            training=self.training) if self.dropout else src_emb
+        enc_outputs = [self.encoder(enc_input, src_mask=src_slf_attn_bias)]
+        # constant number
+        batch_size = enc_outputs[-1].shape[0]
+        max_len = (
+            enc_outputs[-1].shape[1] + 20) if max_len is None else max_len
+        end_token_tensor = paddle.full(
+            shape=[batch_size, 1], fill_value=self.eos_id, dtype="int64")
+        predict_ids = []
+        log_probs = paddle.full(
+            shape=[batch_size, 1], fill_value=0, dtype="float32")
+        if not bos_id:
+            trg_word = paddle.full(
+                shape=[batch_size, 1], fill_value=self.bos_id, dtype="int64")
+        else:
+            trg_word = paddle.full(
+                shape=[batch_size, 1], fill_value=bos_id, dtype="int64")
+        # init states (caches) for transformer
+        if not caches:
+            caches = self.decoder.gen_cache(enc_outputs[-1], do_zip=False)
+        for i in range(max_len):
+            trg_pos = paddle.full(
+                shape=trg_word.shape, fill_value=i, dtype="int64")
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(
+                trg_emb, p=self.dropout,
+                training=self.training) if self.dropout else trg_emb
+            if waitk < 0 or i >= len(enc_outputs):
+                # if the decoder step is full sent or longer than all source
+                # step, then read the whole src
+                _e = enc_outputs[-1]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None,
+                    trg_src_attn_bias[:, :, :, :_e.shape[1]], caches)
+            else:
+                _e = enc_outputs[i]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None,
+                    trg_src_attn_bias[:, :, :, :_e.shape[1]], caches)
+            dec_output = paddle.reshape(
+                dec_output, shape=[-1, dec_output.shape[-1]])
+            logits = self.linear(dec_output)
+            step_log_probs = paddle.log(F.softmax(logits, axis=-1))
+            log_probs = paddle.add(x=step_log_probs, y=log_probs)
+            scores = log_probs
+            topk_scores, topk_indices = paddle.topk(x=scores, k=1)
+            finished = paddle.equal(topk_indices, end_token_tensor)
+            trg_word = topk_indices
+            log_probs = topk_scores
+            predict_ids.append(topk_indices)
+            if paddle.all(finished).numpy():
+                break
+        predict_ids = paddle.stack(predict_ids, axis=0)
+        finished_seq = paddle.transpose(predict_ids, [1, 2, 0])
+        finished_scores = topk_scores
+        return finished_seq, finished_scores, caches
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_1/module.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_1/module.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import jieba
+import paddle
+from paddlenlp.transformers import position_encoding_init
+from paddlenlp.transformers import WordEmbedding, PositionalEmbedding
+from paddlehub.env import MODULE_HOME
+from paddlehub.module.module import moduleinfo, serving
+from transformer_nist_wait_1.model import SimultaneousTransformer
+from transformer_nist_wait_1.processor import STACLTokenizer, predict
+@moduleinfo(
+    name="transformer_nist_wait_1",
+    version="1.0.0",
+    summary="",
+    author="PaddlePaddle",
+    author_email="",
+    type="nlp/simultaneous_translation",
+)
+class STTransformer():
+    """
+    Transformer model for simultaneous translation.
+    """
+    # Model config
+    model_config = {
+        # Number of head used in multi-head attention.
+        "n_head": 8,
+        # Number of sub-layers to be stacked in the encoder and decoder.
+        "n_layer": 6,
+        # The dimension for word embeddings, which is also the last dimension of
+        # the input and output of multi-head attention, position-wise feed-forward
+        # networks, encoder and decoder.
+        "d_model": 512,
+    }
+    def __init__(self, 
+                 max_length=256,
+                 max_out_len=256,
+                 ):
+        super(STTransformer, self).__init__()
+        bpe_codes_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_1", "assets", "2M.zh2en.dict4bpe.zh")
+        src_vocab_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_1", "assets", "nist.20k.zh.vocab")
+        trg_vocab_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_1", "assets", "nist.10k.en.vocab")
+        params_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_1", "assets", "transformer.pdparams")
+        self.max_length = max_length
+        self.max_out_len = max_out_len
+        self.tokenizer = STACLTokenizer(
+            bpe_codes_fpath,
+            src_vocab_fpath,
+            trg_vocab_fpath,
+        )
+        src_vocab_size = self.tokenizer.src_vocab_size
+        trg_vocab_size = self.tokenizer.trg_vocab_size
+        self.transformer = SimultaneousTransformer(
+            src_vocab_size,
+            trg_vocab_size,
+            max_length=self.max_length,
+            n_layer=self.model_config['n_layer'],
+            n_head=self.model_config['n_head'],
+            d_model=self.model_config['d_model'],
+        )
+        model_dict = paddle.load(params_fpath)
+        # To avoid a longer length than training, reset the size of position
+        # encoding to max_length
+        model_dict["src_pos_embedding.pos_encoder.weight"] = position_encoding_init(
+            self.max_length + 1, self.model_config['d_model'])
+        model_dict["trg_pos_embedding.pos_encoder.weight"] = position_encoding_init(
+            self.max_length + 1, self.model_config['d_model'])
+        self.transformer.load_dict(model_dict)
+    @serving
+    def translate(self, text, use_gpu=False):
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        # Word segmentation
+        text = ' '.join(jieba.cut(text))
+        # For decoding max length
+        decoder_max_length = 1
+        # For decoding cache
+        cache = None
+        # For decoding start token id
+        bos_id = None
+        # Current source word index
+        i = 0
+        # For decoding: is_last=True, max_len=256
+        is_last = False
+        # Tokenized id
+        user_input_tokenized = []
+        # Store the translation
+        result = []
+        bpe_str, tokenized_src = self.tokenizer.tokenize(text)
+        while i < len(tokenized_src):
+            user_input_tokenized.append(tokenized_src[i])
+            if bpe_str[i] in ['。', '？', '！']:
+                is_last = True
+            result, cache, bos_id = predict(
+                user_input_tokenized, 
+                decoder_max_length,
+                is_last, 
+                cache, 
+                bos_id, 
+                result,
+                self.tokenizer, 
+                self.transformer,
+                max_out_len=self.max_out_len)
+            i += 1    
+        return " ".join(result)
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_1/processor.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_1/processor.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+from paddlenlp.data import Vocab
+from subword_nmt import subword_nmt
+class STACLTokenizer:
+    """
+    Jieba+BPE, and convert tokens to ids.
+    """
+    def __init__(self,
+                 bpe_codes_fpath,
+                 src_vocab_fpath,
+                 trg_vocab_fpath,
+                 special_token=["<s>", "<e>", "<unk>"]):
+        bpe_parser = subword_nmt.create_apply_bpe_parser()
+        bpe_args = bpe_parser.parse_args(args=['-c', bpe_codes_fpath])
+        self.bpe = subword_nmt.BPE(bpe_args.codes, bpe_args.merges,
+                                   bpe_args.separator, None,
+                                   bpe_args.glossaries)
+        self.src_vocab = Vocab.load_vocabulary(
+            src_vocab_fpath,
+            bos_token=special_token[0],
+            eos_token=special_token[1],
+            unk_token=special_token[2])
+        self.trg_vocab = Vocab.load_vocabulary(
+            trg_vocab_fpath,
+            bos_token=special_token[0],
+            eos_token=special_token[1],
+            unk_token=special_token[2])
+        self.src_vocab_size = len(self.src_vocab)
+        self.trg_vocab_size = len(self.trg_vocab)
+    def tokenize(self, text):
+        bpe_str = self.bpe.process_line(text)
+        ids = self.src_vocab.to_indices(bpe_str.split())
+        return bpe_str.split(), ids
+def post_process_seq(seq, 
+                     bos_idx=0, 
+                     eos_idx=1, 
+                     output_bos=False, 
+                     output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [
+        idx for idx in seq[:eos_pos + 1]
+        if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)
+    ]
+    return seq
+def predict(tokenized_src, 
+              decoder_max_length, 
+              is_last, 
+              cache, 
+              bos_id, 
+              result,
+              tokenizer, 
+              transformer,
+              n_best=1,
+              max_out_len=256,
+              eos_idx=1,
+              waitk=1,
+    ):
+    # Set evaluate mode
+    transformer.eval()
+    if len(tokenized_src) < waitk:
+        return result, cache, bos_id
+    with paddle.no_grad():
+        paddle.disable_static()
+        input_src = tokenized_src
+        if is_last:
+            decoder_max_length = max_out_len
+            input_src += [eos_idx]
+        src_word = paddle.to_tensor(input_src).unsqueeze(axis=0)
+        finished_seq, finished_scores, cache = transformer.greedy_search(
+            src_word,
+            max_len=decoder_max_length,
+            waitk=waitk,
+            caches=cache,
+            bos_id=bos_id)
+        finished_seq = finished_seq.numpy()
+        for beam_idx, beam in enumerate(finished_seq[0]):
+            if beam_idx >= n_best:
+                break
+            id_list = post_process_seq(beam)
+            if len(id_list) == 0:
+                continue
+            bos_id = id_list[-1]
+            word_list = tokenizer.trg_vocab.to_tokens(id_list)
+            for word in word_list:
+                result.append(word)
+            res = ' '.join(word_list).replace('@@ ', '')
+        paddle.enable_static()
+    return result, cache, bos_id
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_1/requirements.txt
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_1/requirements.txt
+jieba==0.42.1 
+subword-nmt==0.3.7
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_3/README.md
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_3/README.md
+# transformer_nist_wait_3
+|模型名称|transformer_nist_wait_3|
+| :--- | :---: | 
+|类别|同声传译|
+|网络|transformer|
+|数据集|NIST 2008-中英翻译数据集|
+|是否支持Fine-tuning|否|
+|模型大小|377MB|
+|最新更新日期|2021-09-17|
+|数据指标|-|
+## 一、模型基本信息
+- ### 模型介绍
+  - 同声传译（Simultaneous Translation），即在句子完成之前进行翻译，同声传译的目标是实现同声传译的自动化，它可以与源语言同时翻译，延迟时间只有几秒钟。
+    STACL 是论文 [STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework](https://www.aclweb.org/anthology/P19-1289/) 中针对同传提出的适用于所有同传场景的翻译架构。
+    - STACL 主要具有以下优势：
+    - Prefix-to-Prefix架构拥有预测能力，即在未看到源词的情况下仍然可以翻译出对应的目标词，克服了SOV→SVO等词序差异
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/40840292/133761990-13e55d0f-5c3a-476c-8865-5808d13cba97.png"> <br />
+    </p>
+     和传统的机器翻译模型主要的区别在于翻译时是否需要利用全句的源句。上图中，Seq2Seq模型需要等到全句的源句（1-5）全部输入Encoder后，Decoder才开始解码进行翻译；而STACL架构采用了Wait-k（图中Wait-2）的策略，当源句只有两个词（1和2）输入到Encoder后，Decoder即可开始解码预测目标句的第一个词。
+    - Wait-k策略可以不需要全句的源句，直接预测目标句，可以实现任意的字级延迟，同时保持较高的翻译质量。
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/40840292/133762098-6ea6f3ca-0d70-4a0a-981d-0fcc6f3cd96b.png"> <br />
+    </p>
+     Wait-k策略首先等待源句单词，然后与源句的其余部分同时翻译，即输出总是隐藏在输入后面。这是受到同声传译人员的启发，同声传译人员通常会在几秒钟内开始翻译演讲者的演讲，在演讲者结束几秒钟后完成。例如，如果k=2，第一个目标词使用前2个源词预测，第二个目标词使用前3个源词预测，以此类推。上图中，(a)simultaneous: our wait-2 等到"布什"和"总统"输入后就开始解码预测"pres."，而(b) non-simultaneous baseline 为传统的翻译模型，需要等到整句"布什 总统 在 莫斯科 与 普京 会晤"才开始解码预测。
+  - 该PaddleHub Module基于transformer网络结构，采用wait-3策略进行中文到英文的翻译。
+## 二、安装
+- ### 1、环境依赖
+  - paddlepaddle >= 2.1.0
+  - paddlehub >= 2.1.0    | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
+- ### 2、安装
+  - ```shell
+    $ hub install transformer_nist_wait_3
+    ```
+  - 如您安装时遇到问题，可参考：[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
+ | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
+## 三、模型API预测
+- ### 1、预测代码示例
+  - ```python
+    import paddlehub as hub
+    model = hub.Module(name="transformer_nist_wait_3")
+    # 待预测数据（模拟同声传译实时输入）
+    text = [
+        "他", 
+        "他还", 
+        "他还说", 
+        "他还说现在", 
+        "他还说现在正在",
+        "他还说现在正在为",
+        "他还说现在正在为这",
+        "他还说现在正在为这一",
+        "他还说现在正在为这一会议",
+        "他还说现在正在为这一会议作出",
+        "他还说现在正在为这一会议作出安排",
+        "他还说现在正在为这一会议作出安排。",      
+    ]
+    for t in text:
+        print("input: {}".format(t))
+        result = model.translate(t)
+        print("model output: {}\n".format(result))
+    # input: 他
+    # model output: 
+    #
+    # input: 他还
+    # model output: 
+    #
+    # input: 他还说
+    # model output: he
+    #
+    # input: 他还说现在
+    # model output: he also
+    #
+    # input: 他还说现在正在
+    # model output: he also said
+    #
+    # input: 他还说现在正在为
+    # model output: he also said that
+    #
+    # input: 他还说现在正在为这
+    # model output: he also said that he
+    #
+    # input: 他还说现在正在为这一
+    # model output: he also said that he is
+    #
+    # input: 他还说现在正在为这一会议
+    # model output: he also said that he is making
+    #
+    # input: 他还说现在正在为这一会议作出
+    # model output: he also said that he is making preparations
+    #
+    # input: 他还说现在正在为这一会议作出安排
+    # model output: he also said that he is making preparations for
+    #
+    # input: 他还说现在正在为这一会议作出安排。
+    # model output: he also said that he is making preparations for this meeting .
+    ```
+- ### 2、 API
+    - ```python
+      __init__(max_length=256, max_out_len=256)
+      ```
+        - 初始化module， 可配置模型的输入文本的最大长度
+        - **参数**
+            - max_length(int): 输入文本的最大长度，默认值为256。
+            - max_out_len(int): 输出文本的最大解码长度，超过最大解码长度时会截断句子的后半部分，默认值为256。
+    - ```python
+      translate(text, use_gpu=False)
+      ```
+        - 预测API，输入源语言的文本（模拟同传语音输入），解码后输出翻译后的目标语言文本。
+        - **参数**
+            - text(str): 输入源语言的文本，数据类型为str
+            - use_gpu(bool): 是否使用gpu进行预测，默认为False
+        - **返回**
+            - result(str): 翻译后的目标语言文本。
+## 四、服务部署
+- PaddleHub Serving可以部署一个在线语义匹配服务，可以将此接口用于在线web应用。
+- ### 第一步：启动PaddleHub Serving
+  - 运行启动命令：
+  - ```shell
+    $ hub serving start -m transformer_nist_wait_3
+    ```
+  - 启动时会显示加载模型过程，启动成功后显示
+  - ```shell
+    Loading transformer_nist_wait_3 successful.
+    ```
+  - 这样就完成了服务化API的部署，默认端口号为8866。
+  - **NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+- ### 第二步：发送预测请求
+  - 配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果       
+  - ```python
+    import requests
+    import json
+    # 待预测数据（模拟同声传译实时输入）
+    text = [
+        "他", 
+        "他还", 
+        "他还说", 
+        "他还说现在", 
+        "他还说现在正在",
+        "他还说现在正在为",
+        "他还说现在正在为这",
+        "他还说现在正在为这一",
+        "他还说现在正在为这一会议",
+        "他还说现在正在为这一会议作出",
+        "他还说现在正在为这一会议作出安排",
+        "他还说现在正在为这一会议作出安排。",      
+    ]
+    # 指定预测方法为transformer_nist_wait_3并发送post请求，content-type类型应指定json方式
+    # HOST_IP为服务器IP
+    url = "http://HOST_IP:8866/predict/transformer_nist_wait_3"
+    headers = {"Content-Type": "application/json"}
+    for t in text:
+        print("input: {}".format(t))
+        r = requests.post(url=url, headers=headers, data=json.dumps(t))
+        # 打印预测结果
+        print("model output: {}\n".format(result))
+  - 关于PaddleHub Serving更多信息参考：[服务部署](../../../../docs/docs_ch/tutorial/serving.md)
+## 五、更新历史
+* 1.0.0
+    初始发布
+    ```shell
+    hub install transformer_nist_wait_3==1.0.0
+    ```
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_3/__init__.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_3/__init__.py
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_3/model.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_3/model.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddlenlp.transformers import WordEmbedding, PositionalEmbedding
+class DecoderLayer(nn.TransformerDecoderLayer):
+    def __init__(self, *args, **kwargs):
+        super(DecoderLayer, self).__init__(*args, **kwargs)
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+        if cache is None:
+            tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, None)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask,
+                                                    cache[0])
+        tgt = residual + self.dropout1(tgt)
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm2(tgt)
+        if len(memory) == 1:
+            # Full sent
+            tgt = self.cross_attn(tgt, memory[0], memory[0], memory_mask, None)
+        else:
+            # Wait-k policy
+            cross_attn_outputs = []
+            for i in range(tgt.shape[1]):
+                q = tgt[:, i:i + 1, :]
+                if i >= len(memory):
+                    e = memory[-1]
+                else:
+                    e = memory[i]
+                cross_attn_outputs.append(
+                    self.cross_attn(q, e, e, memory_mask[:, :, i:i + 1, :
+                                                         e.shape[1]], None))
+            tgt = paddle.concat(cross_attn_outputs, axis=1)
+        tgt = residual + self.dropout2(tgt)
+        if not self.normalize_before:
+            tgt = self.norm2(tgt)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm3(tgt)
+        tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
+        tgt = residual + self.dropout3(tgt)
+        if not self.normalize_before:
+            tgt = self.norm3(tgt)
+        return tgt if cache is None else (tgt, (incremental_cache, ))
+class Decoder(nn.TransformerDecoder):
+    """
+    PaddlePaddle 2.1 casts memory_mask.dtype to memory.dtype, but in STACL,
+    type of memory is list, having no dtype attribute.
+    """
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        output = tgt
+        new_caches = []
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                output = mod(output,
+                             memory,
+                             tgt_mask=tgt_mask,
+                             memory_mask=memory_mask,
+                             cache=None)
+            else:
+                output, new_cache = mod(output,
+                                        memory,
+                                        tgt_mask=tgt_mask,
+                                        memory_mask=memory_mask,
+                                        cache=cache[i])
+                new_caches.append(new_cache)
+        if self.norm is not None:
+            output = self.norm(output)
+        return output if cache is None else (output, new_caches)
+class SimultaneousTransformer(nn.Layer):
+    """
+    model
+    """
+    def __init__(self,
+                 src_vocab_size,
+                 trg_vocab_size,
+                 max_length=256,
+                 n_layer=6,
+                 n_head=8,
+                 d_model=512,
+                 d_inner_hid=2048,
+                 dropout=0.1,
+                 weight_sharing=False,
+                 bos_id=0,
+                 eos_id=1,
+                 waitk=-1):
+        super(SimultaneousTransformer, self).__init__()
+        self.trg_vocab_size = trg_vocab_size
+        self.emb_dim = d_model
+        self.bos_id = bos_id
+        self.eos_id = eos_id
+        self.dropout = dropout
+        self.waitk = waitk
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.d_model = d_model
+        self.src_word_embedding = WordEmbedding(
+            vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+        self.src_pos_embedding = PositionalEmbedding(
+            emb_dim=d_model, max_length=max_length+1)
+        if weight_sharing:
+            assert src_vocab_size == trg_vocab_size, (
+                "Vocabularies in source and target should be same for weight sharing."
+            )
+            self.trg_word_embedding = self.src_word_embedding
+            self.trg_pos_embedding = self.src_pos_embedding
+        else:
+            self.trg_word_embedding = WordEmbedding(
+                vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+            self.trg_pos_embedding = PositionalEmbedding(
+                emb_dim=d_model, max_length=max_length+1)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=d_model,
+            nhead=n_head,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            activation='relu',
+            normalize_before=True,
+            bias_attr=[False, True])
+        encoder_norm = nn.LayerNorm(d_model)
+        self.encoder = nn.TransformerEncoder(
+            encoder_layer=encoder_layer, num_layers=n_layer, norm=encoder_norm)
+        decoder_layer = DecoderLayer(
+            d_model=d_model,
+            nhead=n_head,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            activation='relu',
+            normalize_before=True,
+            bias_attr=[False, False, True])
+        decoder_norm = nn.LayerNorm(d_model)
+        self.decoder = Decoder(
+            decoder_layer=decoder_layer, num_layers=n_layer, norm=decoder_norm)
+        if weight_sharing:
+            self.linear = lambda x: paddle.matmul(
+                x=x, y=self.trg_word_embedding.word_embedding.weight, transpose_y=True)
+        else:
+            self.linear = nn.Linear(
+                in_features=d_model,
+                out_features=trg_vocab_size,
+                bias_attr=False)
+    def forward(self, src_word, trg_word):
+        src_max_len = paddle.shape(src_word)[-1]
+        trg_max_len = paddle.shape(trg_word)[-1]
+        base_attn_bias = paddle.cast(
+            src_word == self.bos_id,
+            dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        src_slf_attn_bias = base_attn_bias
+        src_slf_attn_bias.stop_gradient = True
+        trg_slf_attn_bias = paddle.tensor.triu(
+            (paddle.ones(
+                (trg_max_len, trg_max_len),
+                dtype=paddle.get_default_dtype()) * -np.inf),
+            1)
+        trg_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, trg_max_len, 1])
+        src_pos = paddle.cast(
+            src_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=src_max_len)
+        trg_pos = paddle.cast(
+            trg_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=trg_max_len)
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(
+            src_emb, p=self.dropout,
+            training=self.training) if self.dropout else src_emb
+        with paddle.static.amp.fp16_guard():
+            if self.waitk >= src_max_len or self.waitk == -1:
+                # Full sentence
+                enc_outputs = [
+                    self.encoder(
+                        enc_input, src_mask=src_slf_attn_bias)
+                ]
+            else:
+                # Wait-k policy
+                enc_outputs = []
+                for i in range(self.waitk, src_max_len + 1):
+                    enc_output = self.encoder(
+                        enc_input[:, :i, :],
+                        src_mask=src_slf_attn_bias[:, :, :, :i])
+                    enc_outputs.append(enc_output)
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(
+                trg_emb, p=self.dropout,
+                training=self.training) if self.dropout else trg_emb
+            dec_output = self.decoder(
+                dec_input,
+                enc_outputs,
+                tgt_mask=trg_slf_attn_bias,
+                memory_mask=trg_src_attn_bias)
+            predict = self.linear(dec_output)
+        return predict
+    def beam_search(self, src_word, beam_size=4, max_len=256, waitk=-1):
+        # TODO: "Speculative Beam Search for Simultaneous Translation"
+        raise NotImplementedError
+    def greedy_search(self,
+                      src_word,
+                      max_len=256,
+                      waitk=-1,
+                      caches=None,
+                      bos_id=None):
+        """
+        greedy_search uses streaming reader. It doesn't need calling
+        encoder many times, an a sub-sentence just needs calling encoder once.
+        So, it needs previous state(caches) and last one of generated
+        tokens id last time.
+        """
+        src_max_len = paddle.shape(src_word)[-1]
+        base_attn_bias = paddle.cast(
+            src_word == self.bos_id,
+            dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        src_slf_attn_bias = base_attn_bias
+        src_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, 1, 1])
+        src_pos = paddle.cast(
+            src_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=src_max_len)
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(
+            src_emb, p=self.dropout,
+            training=self.training) if self.dropout else src_emb
+        enc_outputs = [self.encoder(enc_input, src_mask=src_slf_attn_bias)]
+        # constant number
+        batch_size = enc_outputs[-1].shape[0]
+        max_len = (
+            enc_outputs[-1].shape[1] + 20) if max_len is None else max_len
+        end_token_tensor = paddle.full(
+            shape=[batch_size, 1], fill_value=self.eos_id, dtype="int64")
+        predict_ids = []
+        log_probs = paddle.full(
+            shape=[batch_size, 1], fill_value=0, dtype="float32")
+        if not bos_id:
+            trg_word = paddle.full(
+                shape=[batch_size, 1], fill_value=self.bos_id, dtype="int64")
+        else:
+            trg_word = paddle.full(
+                shape=[batch_size, 1], fill_value=bos_id, dtype="int64")
+        # init states (caches) for transformer
+        if not caches:
+            caches = self.decoder.gen_cache(enc_outputs[-1], do_zip=False)
+        for i in range(max_len):
+            trg_pos = paddle.full(
+                shape=trg_word.shape, fill_value=i, dtype="int64")
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(
+                trg_emb, p=self.dropout,
+                training=self.training) if self.dropout else trg_emb
+            if waitk < 0 or i >= len(enc_outputs):
+                # if the decoder step is full sent or longer than all source
+                # step, then read the whole src
+                _e = enc_outputs[-1]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None,
+                    trg_src_attn_bias[:, :, :, :_e.shape[1]], caches)
+            else:
+                _e = enc_outputs[i]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None,
+                    trg_src_attn_bias[:, :, :, :_e.shape[1]], caches)
+            dec_output = paddle.reshape(
+                dec_output, shape=[-1, dec_output.shape[-1]])
+            logits = self.linear(dec_output)
+            step_log_probs = paddle.log(F.softmax(logits, axis=-1))
+            log_probs = paddle.add(x=step_log_probs, y=log_probs)
+            scores = log_probs
+            topk_scores, topk_indices = paddle.topk(x=scores, k=1)
+            finished = paddle.equal(topk_indices, end_token_tensor)
+            trg_word = topk_indices
+            log_probs = topk_scores
+            predict_ids.append(topk_indices)
+            if paddle.all(finished).numpy():
+                break
+        predict_ids = paddle.stack(predict_ids, axis=0)
+        finished_seq = paddle.transpose(predict_ids, [1, 2, 0])
+        finished_scores = topk_scores
+        return finished_seq, finished_scores, caches
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_3/module.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_3/module.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import jieba
+import paddle
+from paddlenlp.transformers import position_encoding_init
+from paddlenlp.transformers import WordEmbedding, PositionalEmbedding
+from paddlehub.env import MODULE_HOME
+from paddlehub.module.module import moduleinfo, serving
+from transformer_nist_wait_3.model import SimultaneousTransformer
+from transformer_nist_wait_3.processor import STACLTokenizer, predict
+@moduleinfo(
+    name="transformer_nist_wait_3",
+    version="1.0.0",
+    summary="",
+    author="PaddlePaddle",
+    author_email="",
+    type="nlp/simultaneous_translation",
+)
+class STTransformer():
+    """
+    Transformer model for simultaneous translation.
+    """
+    # Model config
+    model_config = {
+        # Number of head used in multi-head attention.
+        "n_head": 8,
+        # Number of sub-layers to be stacked in the encoder and decoder.
+        "n_layer": 6,
+        # The dimension for word embeddings, which is also the last dimension of
+        # the input and output of multi-head attention, position-wise feed-forward
+        # networks, encoder and decoder.
+        "d_model": 512,
+    }
+    def __init__(self, 
+                 max_length=256,
+                 max_out_len=256,
+                 ):
+        super(STTransformer, self).__init__()
+        bpe_codes_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_3", "assets", "2M.zh2en.dict4bpe.zh")
+        src_vocab_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_3", "assets", "nist.20k.zh.vocab")
+        trg_vocab_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_3", "assets", "nist.10k.en.vocab")
+        params_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_3", "assets", "transformer.pdparams")
+        self.max_length = max_length
+        self.max_out_len = max_out_len
+        self.tokenizer = STACLTokenizer(
+            bpe_codes_fpath,
+            src_vocab_fpath,
+            trg_vocab_fpath,
+        )
+        src_vocab_size = self.tokenizer.src_vocab_size
+        trg_vocab_size = self.tokenizer.trg_vocab_size
+        self.transformer = SimultaneousTransformer(
+            src_vocab_size,
+            trg_vocab_size,
+            max_length=self.max_length,
+            n_layer=self.model_config['n_layer'],
+            n_head=self.model_config['n_head'],
+            d_model=self.model_config['d_model'],
+        )
+        model_dict = paddle.load(params_fpath)
+        # To avoid a longer length than training, reset the size of position
+        # encoding to max_length
+        model_dict["src_pos_embedding.pos_encoder.weight"] = position_encoding_init(
+            self.max_length + 1, self.model_config['d_model'])
+        model_dict["trg_pos_embedding.pos_encoder.weight"] = position_encoding_init(
+            self.max_length + 1, self.model_config['d_model'])
+        self.transformer.load_dict(model_dict)
+    @serving
+    def translate(self, text, use_gpu=False):
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        # Word segmentation
+        text = ' '.join(jieba.cut(text))
+        # For decoding max length
+        decoder_max_length = 1
+        # For decoding cache
+        cache = None
+        # For decoding start token id
+        bos_id = None
+        # Current source word index
+        i = 0
+        # For decoding: is_last=True, max_len=256
+        is_last = False
+        # Tokenized id
+        user_input_tokenized = []
+        # Store the translation
+        result = []
+        bpe_str, tokenized_src = self.tokenizer.tokenize(text)
+        while i < len(tokenized_src):
+            user_input_tokenized.append(tokenized_src[i])
+            if bpe_str[i] in ['。', '？', '！']:
+                is_last = True
+            result, cache, bos_id = predict(
+                user_input_tokenized, 
+                decoder_max_length,
+                is_last, 
+                cache, 
+                bos_id, 
+                result,
+                self.tokenizer, 
+                self.transformer,
+                max_out_len=self.max_out_len)
+            i += 1    
+        return " ".join(result)
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_3/processor.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_3/processor.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+from paddlenlp.data import Vocab
+from subword_nmt import subword_nmt
+class STACLTokenizer:
+    """
+    Jieba+BPE, and convert tokens to ids.
+    """
+    def __init__(self,
+                 bpe_codes_fpath,
+                 src_vocab_fpath,
+                 trg_vocab_fpath,
+                 special_token=["<s>", "<e>", "<unk>"]):
+        bpe_parser = subword_nmt.create_apply_bpe_parser()
+        bpe_args = bpe_parser.parse_args(args=['-c', bpe_codes_fpath])
+        self.bpe = subword_nmt.BPE(bpe_args.codes, bpe_args.merges,
+                                   bpe_args.separator, None,
+                                   bpe_args.glossaries)
+        self.src_vocab = Vocab.load_vocabulary(
+            src_vocab_fpath,
+            bos_token=special_token[0],
+            eos_token=special_token[1],
+            unk_token=special_token[2])
+        self.trg_vocab = Vocab.load_vocabulary(
+            trg_vocab_fpath,
+            bos_token=special_token[0],
+            eos_token=special_token[1],
+            unk_token=special_token[2])
+        self.src_vocab_size = len(self.src_vocab)
+        self.trg_vocab_size = len(self.trg_vocab)
+    def tokenize(self, text):
+        bpe_str = self.bpe.process_line(text)
+        ids = self.src_vocab.to_indices(bpe_str.split())
+        return bpe_str.split(), ids
+def post_process_seq(seq, 
+                     bos_idx=0, 
+                     eos_idx=1, 
+                     output_bos=False, 
+                     output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [
+        idx for idx in seq[:eos_pos + 1]
+        if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)
+    ]
+    return seq
+def predict(tokenized_src, 
+              decoder_max_length, 
+              is_last, 
+              cache, 
+              bos_id, 
+              result,
+              tokenizer, 
+              transformer,
+              n_best=1,
+              max_out_len=256,
+              eos_idx=1,
+              waitk=3,
+    ):
+    # Set evaluate mode
+    transformer.eval()
+    if len(tokenized_src) < waitk:
+        return result, cache, bos_id
+    with paddle.no_grad():
+        paddle.disable_static()
+        input_src = tokenized_src
+        if is_last:
+            decoder_max_length = max_out_len
+            input_src += [eos_idx]
+        src_word = paddle.to_tensor(input_src).unsqueeze(axis=0)
+        finished_seq, finished_scores, cache = transformer.greedy_search(
+            src_word,
+            max_len=decoder_max_length,
+            waitk=waitk,
+            caches=cache,
+            bos_id=bos_id)
+        finished_seq = finished_seq.numpy()
+        for beam_idx, beam in enumerate(finished_seq[0]):
+            if beam_idx >= n_best:
+                break
+            id_list = post_process_seq(beam)
+            if len(id_list) == 0:
+                continue
+            bos_id = id_list[-1]
+            word_list = tokenizer.trg_vocab.to_tokens(id_list)
+            for word in word_list:
+                result.append(word)
+            res = ' '.join(word_list).replace('@@ ', '')
+        paddle.enable_static()
+    return result, cache, bos_id
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_3/requirements.txt
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_3/requirements.txt
+jieba==0.42.1 
+subword-nmt==0.3.7
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_5/README.md
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_5/README.md
+# transformer_nist_wait_5
+|模型名称|transformer_nist_wait_5|
+| :--- | :---: | 
+|类别|同声传译|
+|网络|transformer|
+|数据集|NIST 2008-中英翻译数据集|
+|是否支持Fine-tuning|否|
+|模型大小|377MB|
+|最新更新日期|2021-09-17|
+|数据指标|-|
+## 一、模型基本信息
+- ### 模型介绍
+  - 同声传译（Simultaneous Translation），即在句子完成之前进行翻译，同声传译的目标是实现同声传译的自动化，它可以与源语言同时翻译，延迟时间只有几秒钟。
+    STACL 是论文 [STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework](https://www.aclweb.org/anthology/P19-1289/) 中针对同传提出的适用于所有同传场景的翻译架构。
+    - STACL 主要具有以下优势：
+    - Prefix-to-Prefix架构拥有预测能力，即在未看到源词的情况下仍然可以翻译出对应的目标词，克服了SOV→SVO等词序差异
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/40840292/133761990-13e55d0f-5c3a-476c-8865-5808d13cba97.png"> <br />
+    </p>
+     和传统的机器翻译模型主要的区别在于翻译时是否需要利用全句的源句。上图中，Seq2Seq模型需要等到全句的源句（1-5）全部输入Encoder后，Decoder才开始解码进行翻译；而STACL架构采用了Wait-k（图中Wait-2）的策略，当源句只有两个词（1和2）输入到Encoder后，Decoder即可开始解码预测目标句的第一个词。
+    - Wait-k策略可以不需要全句的源句，直接预测目标句，可以实现任意的字级延迟，同时保持较高的翻译质量。
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/40840292/133762098-6ea6f3ca-0d70-4a0a-981d-0fcc6f3cd96b.png"> <br />
+    </p>
+     Wait-k策略首先等待源句单词，然后与源句的其余部分同时翻译，即输出总是隐藏在输入后面。这是受到同声传译人员的启发，同声传译人员通常会在几秒钟内开始翻译演讲者的演讲，在演讲者结束几秒钟后完成。例如，如果k=2，第一个目标词使用前2个源词预测，第二个目标词使用前3个源词预测，以此类推。上图中，(a)simultaneous: our wait-2 等到"布什"和"总统"输入后就开始解码预测"pres."，而(b) non-simultaneous baseline 为传统的翻译模型，需要等到整句"布什 总统 在 莫斯科 与 普京 会晤"才开始解码预测。
+  - 该PaddleHub Module基于transformer网络结构，采用wait-5策略进行中文到英文的翻译。
+## 二、安装
+- ### 1、环境依赖
+  - paddlepaddle >= 2.1.0
+  - paddlehub >= 2.1.0    | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
+- ### 2、安装
+  - ```shell
+    $ hub install transformer_nist_wait_5
+    ```
+  - 如您安装时遇到问题，可参考：[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
+ | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
+## 三、模型API预测
+- ### 1、预测代码示例
+  - ```python
+    import paddlehub as hub
+    model = hub.Module(name="transformer_nist_wait_5")
+    # 待预测数据（模拟同声传译实时输入）
+    text = [
+        "他", 
+        "他还", 
+        "他还说", 
+        "他还说现在", 
+        "他还说现在正在",
+        "他还说现在正在为",
+        "他还说现在正在为这",
+        "他还说现在正在为这一",
+        "他还说现在正在为这一会议",
+        "他还说现在正在为这一会议作出",
+        "他还说现在正在为这一会议作出安排",
+        "他还说现在正在为这一会议作出安排。",      
+    ]
+    for t in text:
+        print("input: {}".format(t))
+        result = model.translate(t)
+        print("model output: {}\n".format(result))
+    # input: 他
+    # model output: 
+    #
+    # input: 他还
+    # model output: 
+    #
+    # input: 他还说
+    # model output:
+    #
+    # input: 他还说现在
+    # model output:
+    #
+    # input: 他还说现在正在
+    # model output: he
+    #
+    # input: 他还说现在正在为
+    # model output: he also
+    #
+    # input: 他还说现在正在为这
+    # model output: he also said
+    #
+    # input: 他还说现在正在为这一
+    # model output: he also said that
+    #
+    # input: 他还说现在正在为这一会议
+    # model output: he also said that he
+    #
+    # input: 他还说现在正在为这一会议作出
+    # model output: he also said that he was
+    #
+    # input: 他还说现在正在为这一会议作出安排
+    # model output: he also said that he was making
+    #
+    # input: 他还说现在正在为这一会议作出安排。
+    # model output: he also said that he was making arrangements for this meeting . 
+    ```
+- ### 2、 API
+    - ```python
+      __init__(max_length=256, max_out_len=256)
+      ```
+        - 初始化module， 可配置模型的输入文本的最大长度
+        - **参数**
+            - max_length(int): 输入文本的最大长度，默认值为256。
+            - max_out_len(int): 输出文本的最大解码长度，超过最大解码长度时会截断句子的后半部分，默认值为256。
+    - ```python
+      translate(text, use_gpu=False)
+      ```
+        - 预测API，输入源语言的文本（模拟同传语音输入），解码后输出翻译后的目标语言文本。
+        - **参数**
+            - text(str): 输入源语言的文本，数据类型为str
+            - use_gpu(bool): 是否使用gpu进行预测，默认为False
+        - **返回**
+            - result(str): 翻译后的目标语言文本。
+## 四、服务部署
+- PaddleHub Serving可以部署一个在线语义匹配服务，可以将此接口用于在线web应用。
+- ### 第一步：启动PaddleHub Serving
+  - 运行启动命令：
+  - ```shell
+    $ hub serving start -m transformer_nist_wait_5
+    ```
+  - 启动时会显示加载模型过程，启动成功后显示
+  - ```shell
+    Loading transformer_nist_wait_5 successful.
+    ```
+  - 这样就完成了服务化API的部署，默认端口号为8866。
+  - **NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+- ### 第二步：发送预测请求
+  - 配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果       
+  - ```python
+    import requests
+    import json
+    # 待预测数据（模拟同声传译实时输入）
+    text = [
+        "他", 
+        "他还", 
+        "他还说", 
+        "他还说现在", 
+        "他还说现在正在",
+        "他还说现在正在为",
+        "他还说现在正在为这",
+        "他还说现在正在为这一",
+        "他还说现在正在为这一会议",
+        "他还说现在正在为这一会议作出",
+        "他还说现在正在为这一会议作出安排",
+        "他还说现在正在为这一会议作出安排。",      
+    ]
+    # 指定预测方法为transformer_nist_wait_5并发送post请求，content-type类型应指定json方式
+    # HOST_IP为服务器IP
+    url = "http://HOST_IP:8866/predict/transformer_nist_wait_5"
+    headers = {"Content-Type": "application/json"}
+    for t in text:
+        print("input: {}".format(t))
+        r = requests.post(url=url, headers=headers, data=json.dumps(t))
+        # 打印预测结果
+        print("model output: {}\n".format(result))
+  - 关于PaddleHub Serving更多信息参考：[服务部署](../../../../docs/docs_ch/tutorial/serving.md)
+## 五、更新历史
+* 1.0.0
+    初始发布
+    ```shell
+    hub install transformer_nist_wait_5==1.0.0
+    ```
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_5/__init__.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_5/__init__.py
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_5/model.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_5/model.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddlenlp.transformers import WordEmbedding, PositionalEmbedding
+class DecoderLayer(nn.TransformerDecoderLayer):
+    def __init__(self, *args, **kwargs):
+        super(DecoderLayer, self).__init__(*args, **kwargs)
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+        if cache is None:
+            tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, None)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask,
+                                                    cache[0])
+        tgt = residual + self.dropout1(tgt)
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm2(tgt)
+        if len(memory) == 1:
+            # Full sent
+            tgt = self.cross_attn(tgt, memory[0], memory[0], memory_mask, None)
+        else:
+            # Wait-k policy
+            cross_attn_outputs = []
+            for i in range(tgt.shape[1]):
+                q = tgt[:, i:i + 1, :]
+                if i >= len(memory):
+                    e = memory[-1]
+                else:
+                    e = memory[i]
+                cross_attn_outputs.append(
+                    self.cross_attn(q, e, e, memory_mask[:, :, i:i + 1, :
+                                                         e.shape[1]], None))
+            tgt = paddle.concat(cross_attn_outputs, axis=1)
+        tgt = residual + self.dropout2(tgt)
+        if not self.normalize_before:
+            tgt = self.norm2(tgt)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm3(tgt)
+        tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
+        tgt = residual + self.dropout3(tgt)
+        if not self.normalize_before:
+            tgt = self.norm3(tgt)
+        return tgt if cache is None else (tgt, (incremental_cache, ))
+class Decoder(nn.TransformerDecoder):
+    """
+    PaddlePaddle 2.1 casts memory_mask.dtype to memory.dtype, but in STACL,
+    type of memory is list, having no dtype attribute.
+    """
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        output = tgt
+        new_caches = []
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                output = mod(output,
+                             memory,
+                             tgt_mask=tgt_mask,
+                             memory_mask=memory_mask,
+                             cache=None)
+            else:
+                output, new_cache = mod(output,
+                                        memory,
+                                        tgt_mask=tgt_mask,
+                                        memory_mask=memory_mask,
+                                        cache=cache[i])
+                new_caches.append(new_cache)
+        if self.norm is not None:
+            output = self.norm(output)
+        return output if cache is None else (output, new_caches)
+class SimultaneousTransformer(nn.Layer):
+    """
+    model
+    """
+    def __init__(self,
+                 src_vocab_size,
+                 trg_vocab_size,
+                 max_length=256,
+                 n_layer=6,
+                 n_head=8,
+                 d_model=512,
+                 d_inner_hid=2048,
+                 dropout=0.1,
+                 weight_sharing=False,
+                 bos_id=0,
+                 eos_id=1,
+                 waitk=-1):
+        super(SimultaneousTransformer, self).__init__()
+        self.trg_vocab_size = trg_vocab_size
+        self.emb_dim = d_model
+        self.bos_id = bos_id
+        self.eos_id = eos_id
+        self.dropout = dropout
+        self.waitk = waitk
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.d_model = d_model
+        self.src_word_embedding = WordEmbedding(
+            vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+        self.src_pos_embedding = PositionalEmbedding(
+            emb_dim=d_model, max_length=max_length+1)
+        if weight_sharing:
+            assert src_vocab_size == trg_vocab_size, (
+                "Vocabularies in source and target should be same for weight sharing."
+            )
+            self.trg_word_embedding = self.src_word_embedding
+            self.trg_pos_embedding = self.src_pos_embedding
+        else:
+            self.trg_word_embedding = WordEmbedding(
+                vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+            self.trg_pos_embedding = PositionalEmbedding(
+                emb_dim=d_model, max_length=max_length+1)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=d_model,
+            nhead=n_head,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            activation='relu',
+            normalize_before=True,
+            bias_attr=[False, True])
+        encoder_norm = nn.LayerNorm(d_model)
+        self.encoder = nn.TransformerEncoder(
+            encoder_layer=encoder_layer, num_layers=n_layer, norm=encoder_norm)
+        decoder_layer = DecoderLayer(
+            d_model=d_model,
+            nhead=n_head,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            activation='relu',
+            normalize_before=True,
+            bias_attr=[False, False, True])
+        decoder_norm = nn.LayerNorm(d_model)
+        self.decoder = Decoder(
+            decoder_layer=decoder_layer, num_layers=n_layer, norm=decoder_norm)
+        if weight_sharing:
+            self.linear = lambda x: paddle.matmul(
+                x=x, y=self.trg_word_embedding.word_embedding.weight, transpose_y=True)
+        else:
+            self.linear = nn.Linear(
+                in_features=d_model,
+                out_features=trg_vocab_size,
+                bias_attr=False)
+    def forward(self, src_word, trg_word):
+        src_max_len = paddle.shape(src_word)[-1]
+        trg_max_len = paddle.shape(trg_word)[-1]
+        base_attn_bias = paddle.cast(
+            src_word == self.bos_id,
+            dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        src_slf_attn_bias = base_attn_bias
+        src_slf_attn_bias.stop_gradient = True
+        trg_slf_attn_bias = paddle.tensor.triu(
+            (paddle.ones(
+                (trg_max_len, trg_max_len),
+                dtype=paddle.get_default_dtype()) * -np.inf),
+            1)
+        trg_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, trg_max_len, 1])
+        src_pos = paddle.cast(
+            src_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=src_max_len)
+        trg_pos = paddle.cast(
+            trg_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=trg_max_len)
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(
+            src_emb, p=self.dropout,
+            training=self.training) if self.dropout else src_emb
+        with paddle.static.amp.fp16_guard():
+            if self.waitk >= src_max_len or self.waitk == -1:
+                # Full sentence
+                enc_outputs = [
+                    self.encoder(
+                        enc_input, src_mask=src_slf_attn_bias)
+                ]
+            else:
+                # Wait-k policy
+                enc_outputs = []
+                for i in range(self.waitk, src_max_len + 1):
+                    enc_output = self.encoder(
+                        enc_input[:, :i, :],
+                        src_mask=src_slf_attn_bias[:, :, :, :i])
+                    enc_outputs.append(enc_output)
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(
+                trg_emb, p=self.dropout,
+                training=self.training) if self.dropout else trg_emb
+            dec_output = self.decoder(
+                dec_input,
+                enc_outputs,
+                tgt_mask=trg_slf_attn_bias,
+                memory_mask=trg_src_attn_bias)
+            predict = self.linear(dec_output)
+        return predict
+    def beam_search(self, src_word, beam_size=4, max_len=256, waitk=-1):
+        # TODO: "Speculative Beam Search for Simultaneous Translation"
+        raise NotImplementedError
+    def greedy_search(self,
+                      src_word,
+                      max_len=256,
+                      waitk=-1,
+                      caches=None,
+                      bos_id=None):
+        """
+        greedy_search uses streaming reader. It doesn't need calling
+        encoder many times, an a sub-sentence just needs calling encoder once.
+        So, it needs previous state(caches) and last one of generated
+        tokens id last time.
+        """
+        src_max_len = paddle.shape(src_word)[-1]
+        base_attn_bias = paddle.cast(
+            src_word == self.bos_id,
+            dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        src_slf_attn_bias = base_attn_bias
+        src_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, 1, 1])
+        src_pos = paddle.cast(
+            src_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=src_max_len)
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(
+            src_emb, p=self.dropout,
+            training=self.training) if self.dropout else src_emb
+        enc_outputs = [self.encoder(enc_input, src_mask=src_slf_attn_bias)]
+        # constant number
+        batch_size = enc_outputs[-1].shape[0]
+        max_len = (
+            enc_outputs[-1].shape[1] + 20) if max_len is None else max_len
+        end_token_tensor = paddle.full(
+            shape=[batch_size, 1], fill_value=self.eos_id, dtype="int64")
+        predict_ids = []
+        log_probs = paddle.full(
+            shape=[batch_size, 1], fill_value=0, dtype="float32")
+        if not bos_id:
+            trg_word = paddle.full(
+                shape=[batch_size, 1], fill_value=self.bos_id, dtype="int64")
+        else:
+            trg_word = paddle.full(
+                shape=[batch_size, 1], fill_value=bos_id, dtype="int64")
+        # init states (caches) for transformer
+        if not caches:
+            caches = self.decoder.gen_cache(enc_outputs[-1], do_zip=False)
+        for i in range(max_len):
+            trg_pos = paddle.full(
+                shape=trg_word.shape, fill_value=i, dtype="int64")
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(
+                trg_emb, p=self.dropout,
+                training=self.training) if self.dropout else trg_emb
+            if waitk < 0 or i >= len(enc_outputs):
+                # if the decoder step is full sent or longer than all source
+                # step, then read the whole src
+                _e = enc_outputs[-1]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None,
+                    trg_src_attn_bias[:, :, :, :_e.shape[1]], caches)
+            else:
+                _e = enc_outputs[i]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None,
+                    trg_src_attn_bias[:, :, :, :_e.shape[1]], caches)
+            dec_output = paddle.reshape(
+                dec_output, shape=[-1, dec_output.shape[-1]])
+            logits = self.linear(dec_output)
+            step_log_probs = paddle.log(F.softmax(logits, axis=-1))
+            log_probs = paddle.add(x=step_log_probs, y=log_probs)
+            scores = log_probs
+            topk_scores, topk_indices = paddle.topk(x=scores, k=1)
+            finished = paddle.equal(topk_indices, end_token_tensor)
+            trg_word = topk_indices
+            log_probs = topk_scores
+            predict_ids.append(topk_indices)
+            if paddle.all(finished).numpy():
+                break
+        predict_ids = paddle.stack(predict_ids, axis=0)
+        finished_seq = paddle.transpose(predict_ids, [1, 2, 0])
+        finished_scores = topk_scores
+        return finished_seq, finished_scores, caches
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_5/module.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_5/module.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import jieba
+import paddle
+from paddlenlp.transformers import position_encoding_init
+from paddlenlp.transformers import WordEmbedding, PositionalEmbedding
+from paddlehub.env import MODULE_HOME
+from paddlehub.module.module import moduleinfo, serving
+from transformer_nist_wait_5.model import SimultaneousTransformer
+from transformer_nist_wait_5.processor import STACLTokenizer, predict
+@moduleinfo(
+    name="transformer_nist_wait_5",
+    version="1.0.0",
+    summary="",
+    author="PaddlePaddle",
+    author_email="",
+    type="nlp/simultaneous_translation",
+)
+class STTransformer():
+    """
+    Transformer model for simultaneous translation.
+    """
+    # Model config
+    model_config = {
+        # Number of head used in multi-head attention.
+        "n_head": 8,
+        # Number of sub-layers to be stacked in the encoder and decoder.
+        "n_layer": 6,
+        # The dimension for word embeddings, which is also the last dimension of
+        # the input and output of multi-head attention, position-wise feed-forward
+        # networks, encoder and decoder.
+        "d_model": 512,
+    }
+    def __init__(self, 
+                 max_length=256,
+                 max_out_len=256,
+                 ):
+        super(STTransformer, self).__init__()
+        bpe_codes_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_5", "assets", "2M.zh2en.dict4bpe.zh")
+        src_vocab_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_5", "assets", "nist.20k.zh.vocab")
+        trg_vocab_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_5", "assets", "nist.10k.en.vocab")
+        params_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_5", "assets", "transformer.pdparams")
+        self.max_length = max_length
+        self.max_out_len = max_out_len
+        self.tokenizer = STACLTokenizer(
+            bpe_codes_fpath,
+            src_vocab_fpath,
+            trg_vocab_fpath,
+        )
+        src_vocab_size = self.tokenizer.src_vocab_size
+        trg_vocab_size = self.tokenizer.trg_vocab_size
+        self.transformer = SimultaneousTransformer(
+            src_vocab_size,
+            trg_vocab_size,
+            max_length=self.max_length,
+            n_layer=self.model_config['n_layer'],
+            n_head=self.model_config['n_head'],
+            d_model=self.model_config['d_model'],
+        )
+        model_dict = paddle.load(params_fpath)
+        # To avoid a longer length than training, reset the size of position
+        # encoding to max_length
+        model_dict["src_pos_embedding.pos_encoder.weight"] = position_encoding_init(
+            self.max_length + 1, self.model_config['d_model'])
+        model_dict["trg_pos_embedding.pos_encoder.weight"] = position_encoding_init(
+            self.max_length + 1, self.model_config['d_model'])
+        self.transformer.load_dict(model_dict)
+    @serving
+    def translate(self, text, use_gpu=False):
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        # Word segmentation
+        text = ' '.join(jieba.cut(text))
+        # For decoding max length
+        decoder_max_length = 1
+        # For decoding cache
+        cache = None
+        # For decoding start token id
+        bos_id = None
+        # Current source word index
+        i = 0
+        # For decoding: is_last=True, max_len=256
+        is_last = False
+        # Tokenized id
+        user_input_tokenized = []
+        # Store the translation
+        result = []
+        bpe_str, tokenized_src = self.tokenizer.tokenize(text)
+        while i < len(tokenized_src):
+            user_input_tokenized.append(tokenized_src[i])
+            if bpe_str[i] in ['。', '？', '！']:
+                is_last = True
+            result, cache, bos_id = predict(
+                user_input_tokenized, 
+                decoder_max_length,
+                is_last, 
+                cache, 
+                bos_id, 
+                result,
+                self.tokenizer, 
+                self.transformer,
+                max_out_len=self.max_out_len)
+            i += 1    
+        return " ".join(result)
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_5/processor.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_5/processor.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+from paddlenlp.data import Vocab
+from subword_nmt import subword_nmt
+class STACLTokenizer:
+    """
+    Jieba+BPE, and convert tokens to ids.
+    """
+    def __init__(self,
+                 bpe_codes_fpath,
+                 src_vocab_fpath,
+                 trg_vocab_fpath,
+                 special_token=["<s>", "<e>", "<unk>"]):
+        bpe_parser = subword_nmt.create_apply_bpe_parser()
+        bpe_args = bpe_parser.parse_args(args=['-c', bpe_codes_fpath])
+        self.bpe = subword_nmt.BPE(bpe_args.codes, bpe_args.merges,
+                                   bpe_args.separator, None,
+                                   bpe_args.glossaries)
+        self.src_vocab = Vocab.load_vocabulary(
+            src_vocab_fpath,
+            bos_token=special_token[0],
+            eos_token=special_token[1],
+            unk_token=special_token[2])
+        self.trg_vocab = Vocab.load_vocabulary(
+            trg_vocab_fpath,
+            bos_token=special_token[0],
+            eos_token=special_token[1],
+            unk_token=special_token[2])
+        self.src_vocab_size = len(self.src_vocab)
+        self.trg_vocab_size = len(self.trg_vocab)
+    def tokenize(self, text):
+        bpe_str = self.bpe.process_line(text)
+        ids = self.src_vocab.to_indices(bpe_str.split())
+        return bpe_str.split(), ids
+def post_process_seq(seq, 
+                     bos_idx=0, 
+                     eos_idx=1, 
+                     output_bos=False, 
+                     output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [
+        idx for idx in seq[:eos_pos + 1]
+        if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)
+    ]
+    return seq
+def predict(tokenized_src, 
+              decoder_max_length, 
+              is_last, 
+              cache, 
+              bos_id, 
+              result,
+              tokenizer, 
+              transformer,
+              n_best=1,
+              max_out_len=256,
+              eos_idx=1,
+              waitk=5,
+    ):
+    # Set evaluate mode
+    transformer.eval()
+    if len(tokenized_src) < waitk:
+        return result, cache, bos_id
+    with paddle.no_grad():
+        paddle.disable_static()
+        input_src = tokenized_src
+        if is_last:
+            decoder_max_length = max_out_len
+            input_src += [eos_idx]
+        src_word = paddle.to_tensor(input_src).unsqueeze(axis=0)
+        finished_seq, finished_scores, cache = transformer.greedy_search(
+            src_word,
+            max_len=decoder_max_length,
+            waitk=waitk,
+            caches=cache,
+            bos_id=bos_id)
+        finished_seq = finished_seq.numpy()
+        for beam_idx, beam in enumerate(finished_seq[0]):
+            if beam_idx >= n_best:
+                break
+            id_list = post_process_seq(beam)
+            if len(id_list) == 0:
+                continue
+            bos_id = id_list[-1]
+            word_list = tokenizer.trg_vocab.to_tokens(id_list)
+            for word in word_list:
+                result.append(word)
+            res = ' '.join(word_list).replace('@@ ', '')
+        paddle.enable_static()
+    return result, cache, bos_id
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_5/requirements.txt
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_5/requirements.txt
+jieba==0.42.1 
+subword-nmt==0.3.7
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_7/README.md
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_7/README.md
+# transformer_nist_wait_7
+|模型名称|transformer_nist_wait_7|
+| :--- | :---: | 
+|类别|同声传译|
+|网络|transformer|
+|数据集|NIST 2008-中英翻译数据集|
+|是否支持Fine-tuning|否|
+|模型大小|377MB|
+|最新更新日期|2021-09-17|
+|数据指标|-|
+## 一、模型基本信息
+- ### 模型介绍
+  - 同声传译（Simultaneous Translation），即在句子完成之前进行翻译，同声传译的目标是实现同声传译的自动化，它可以与源语言同时翻译，延迟时间只有几秒钟。
+    STACL 是论文 [STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework](https://www.aclweb.org/anthology/P19-1289/) 中针对同传提出的适用于所有同传场景的翻译架构。
+    - STACL 主要具有以下优势：
+    - Prefix-to-Prefix架构拥有预测能力，即在未看到源词的情况下仍然可以翻译出对应的目标词，克服了SOV→SVO等词序差异
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/40840292/133761990-13e55d0f-5c3a-476c-8865-5808d13cba97.png"> <br />
+    </p>
+     和传统的机器翻译模型主要的区别在于翻译时是否需要利用全句的源句。上图中，Seq2Seq模型需要等到全句的源句（1-5）全部输入Encoder后，Decoder才开始解码进行翻译；而STACL架构采用了Wait-k（图中Wait-2）的策略，当源句只有两个词（1和2）输入到Encoder后，Decoder即可开始解码预测目标句的第一个词。
+    - Wait-k策略可以不需要全句的源句，直接预测目标句，可以实现任意的字级延迟，同时保持较高的翻译质量。
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/40840292/133762098-6ea6f3ca-0d70-4a0a-981d-0fcc6f3cd96b.png"> <br />
+    </p>
+     Wait-k策略首先等待源句单词，然后与源句的其余部分同时翻译，即输出总是隐藏在输入后面。这是受到同声传译人员的启发，同声传译人员通常会在几秒钟内开始翻译演讲者的演讲，在演讲者结束几秒钟后完成。例如，如果k=2，第一个目标词使用前2个源词预测，第二个目标词使用前3个源词预测，以此类推。上图中，(a)simultaneous: our wait-2 等到"布什"和"总统"输入后就开始解码预测"pres."，而(b) non-simultaneous baseline 为传统的翻译模型，需要等到整句"布什 总统 在 莫斯科 与 普京 会晤"才开始解码预测。
+  - 该PaddleHub Module基于transformer网络结构，采用wait-7策略进行中文到英文的翻译。
+## 二、安装
+- ### 1、环境依赖
+  - paddlepaddle >= 2.1.0
+  - paddlehub >= 2.1.0    | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
+- ### 2、安装
+  - ```shell
+    $ hub install transformer_nist_wait_7
+    ```
+  - 如您安装时遇到问题，可参考：[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
+ | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
+## 三、模型API预测
+- ### 1、预测代码示例
+  - ```python
+    import paddlehub as hub
+    model = hub.Module(name="transformer_nist_wait_7")
+    # 待预测数据（模拟同声传译实时输入）
+    text = [
+        "他", 
+        "他还", 
+        "他还说", 
+        "他还说现在", 
+        "他还说现在正在",
+        "他还说现在正在为",
+        "他还说现在正在为这",
+        "他还说现在正在为这一",
+        "他还说现在正在为这一会议",
+        "他还说现在正在为这一会议作出",
+        "他还说现在正在为这一会议作出安排",
+        "他还说现在正在为这一会议作出安排。",      
+    ]
+    for t in text:
+        print("input: {}".format(t))
+        result = model.translate(t)
+        print("model output: {}\n".format(result))
+    # input: 他
+    # model output: 
+    #
+    # input: 他还
+    # model output: 
+    #
+    # input: 他还说
+    # model output:
+    #
+    # input: 他还说现在
+    # model output:
+    #
+    # input: 他还说现在正在
+    # model output:
+    #
+    # input: 他还说现在正在为
+    # model output:
+    #
+    # input: 他还说现在正在为这
+    # model output: he
+    #
+    # input: 他还说现在正在为这一
+    # model output: he also
+    #
+    # input: 他还说现在正在为这一会议
+    # model output: he also said
+    #
+    # input: 他还说现在正在为这一会议作出
+    # model output: he also said that
+    #
+    # input: 他还说现在正在为这一会议作出安排
+    # model output: he also said that arrangements
+    #
+    # input: 他还说现在正在为这一会议作出安排。
+    # model output: he also said that arrangements are now being made for this meeting .
+    ```
+- ### 2、 API
+    - ```python
+      __init__(max_length=256, max_out_len=256)
+      ```
+        - 初始化module， 可配置模型的输入文本的最大长度
+        - **参数**
+            - max_length(int): 输入文本的最大长度，默认值为256。
+            - max_out_len(int): 输出文本的最大解码长度，超过最大解码长度时会截断句子的后半部分，默认值为256。
+    - ```python
+      translate(text, use_gpu=False)
+      ```
+        - 预测API，输入源语言的文本（模拟同传语音输入），解码后输出翻译后的目标语言文本。
+        - **参数**
+            - text(str): 输入源语言的文本，数据类型为str
+            - use_gpu(bool): 是否使用gpu进行预测，默认为False
+        - **返回**
+            - result(str): 翻译后的目标语言文本。
+## 四、服务部署
+- PaddleHub Serving可以部署一个在线语义匹配服务，可以将此接口用于在线web应用。
+- ### 第一步：启动PaddleHub Serving
+  - 运行启动命令：
+  - ```shell
+    $ hub serving start -m transformer_nist_wait_7
+    ```
+  - 启动时会显示加载模型过程，启动成功后显示
+  - ```shell
+    Loading transformer_nist_wait_7 successful.
+    ```
+  - 这样就完成了服务化API的部署，默认端口号为8866。
+  - **NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+- ### 第二步：发送预测请求
+  - 配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果       
+  - ```python
+    import requests
+    import json
+    # 待预测数据（模拟同声传译实时输入）
+    text = [
+        "他", 
+        "他还", 
+        "他还说", 
+        "他还说现在", 
+        "他还说现在正在",
+        "他还说现在正在为",
+        "他还说现在正在为这",
+        "他还说现在正在为这一",
+        "他还说现在正在为这一会议",
+        "他还说现在正在为这一会议作出",
+        "他还说现在正在为这一会议作出安排",
+        "他还说现在正在为这一会议作出安排。",      
+    ]
+    # 指定预测方法为transformer_nist_wait_7并发送post请求，content-type类型应指定json方式
+    # HOST_IP为服务器IP
+    url = "http://HOST_IP:8866/predict/transformer_nist_wait_7"
+    headers = {"Content-Type": "application/json"}
+    for t in text:
+        print("input: {}".format(t))
+        r = requests.post(url=url, headers=headers, data=json.dumps(t))
+        # 打印预测结果
+        print("model output: {}\n".format(result))
+  - 关于PaddleHub Serving更多信息参考：[服务部署](../../../../docs/docs_ch/tutorial/serving.md)
+## 五、更新历史
+* 1.0.0
+    初始发布
+    ```shell
+    hub install transformer_nist_wait_7==1.0.0
+    ```
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_7/__init__.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_7/__init__.py
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_7/model.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_7/model.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddlenlp.transformers import WordEmbedding, PositionalEmbedding
+class DecoderLayer(nn.TransformerDecoderLayer):
+    def __init__(self, *args, **kwargs):
+        super(DecoderLayer, self).__init__(*args, **kwargs)
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+        if cache is None:
+            tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, None)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask,
+                                                    cache[0])
+        tgt = residual + self.dropout1(tgt)
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm2(tgt)
+        if len(memory) == 1:
+            # Full sent
+            tgt = self.cross_attn(tgt, memory[0], memory[0], memory_mask, None)
+        else:
+            # Wait-k policy
+            cross_attn_outputs = []
+            for i in range(tgt.shape[1]):
+                q = tgt[:, i:i + 1, :]
+                if i >= len(memory):
+                    e = memory[-1]
+                else:
+                    e = memory[i]
+                cross_attn_outputs.append(
+                    self.cross_attn(q, e, e, memory_mask[:, :, i:i + 1, :
+                                                         e.shape[1]], None))
+            tgt = paddle.concat(cross_attn_outputs, axis=1)
+        tgt = residual + self.dropout2(tgt)
+        if not self.normalize_before:
+            tgt = self.norm2(tgt)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm3(tgt)
+        tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
+        tgt = residual + self.dropout3(tgt)
+        if not self.normalize_before:
+            tgt = self.norm3(tgt)
+        return tgt if cache is None else (tgt, (incremental_cache, ))
+class Decoder(nn.TransformerDecoder):
+    """
+    PaddlePaddle 2.1 casts memory_mask.dtype to memory.dtype, but in STACL,
+    type of memory is list, having no dtype attribute.
+    """
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        output = tgt
+        new_caches = []
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                output = mod(output,
+                             memory,
+                             tgt_mask=tgt_mask,
+                             memory_mask=memory_mask,
+                             cache=None)
+            else:
+                output, new_cache = mod(output,
+                                        memory,
+                                        tgt_mask=tgt_mask,
+                                        memory_mask=memory_mask,
+                                        cache=cache[i])
+                new_caches.append(new_cache)
+        if self.norm is not None:
+            output = self.norm(output)
+        return output if cache is None else (output, new_caches)
+class SimultaneousTransformer(nn.Layer):
+    """
+    model
+    """
+    def __init__(self,
+                 src_vocab_size,
+                 trg_vocab_size,
+                 max_length=256,
+                 n_layer=6,
+                 n_head=8,
+                 d_model=512,
+                 d_inner_hid=2048,
+                 dropout=0.1,
+                 weight_sharing=False,
+                 bos_id=0,
+                 eos_id=1,
+                 waitk=-1):
+        super(SimultaneousTransformer, self).__init__()
+        self.trg_vocab_size = trg_vocab_size
+        self.emb_dim = d_model
+        self.bos_id = bos_id
+        self.eos_id = eos_id
+        self.dropout = dropout
+        self.waitk = waitk
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.d_model = d_model
+        self.src_word_embedding = WordEmbedding(
+            vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+        self.src_pos_embedding = PositionalEmbedding(
+            emb_dim=d_model, max_length=max_length+1)
+        if weight_sharing:
+            assert src_vocab_size == trg_vocab_size, (
+                "Vocabularies in source and target should be same for weight sharing."
+            )
+            self.trg_word_embedding = self.src_word_embedding
+            self.trg_pos_embedding = self.src_pos_embedding
+        else:
+            self.trg_word_embedding = WordEmbedding(
+                vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+            self.trg_pos_embedding = PositionalEmbedding(
+                emb_dim=d_model, max_length=max_length+1)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=d_model,
+            nhead=n_head,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            activation='relu',
+            normalize_before=True,
+            bias_attr=[False, True])
+        encoder_norm = nn.LayerNorm(d_model)
+        self.encoder = nn.TransformerEncoder(
+            encoder_layer=encoder_layer, num_layers=n_layer, norm=encoder_norm)
+        decoder_layer = DecoderLayer(
+            d_model=d_model,
+            nhead=n_head,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            activation='relu',
+            normalize_before=True,
+            bias_attr=[False, False, True])
+        decoder_norm = nn.LayerNorm(d_model)
+        self.decoder = Decoder(
+            decoder_layer=decoder_layer, num_layers=n_layer, norm=decoder_norm)
+        if weight_sharing:
+            self.linear = lambda x: paddle.matmul(
+                x=x, y=self.trg_word_embedding.word_embedding.weight, transpose_y=True)
+        else:
+            self.linear = nn.Linear(
+                in_features=d_model,
+                out_features=trg_vocab_size,
+                bias_attr=False)
+    def forward(self, src_word, trg_word):
+        src_max_len = paddle.shape(src_word)[-1]
+        trg_max_len = paddle.shape(trg_word)[-1]
+        base_attn_bias = paddle.cast(
+            src_word == self.bos_id,
+            dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        src_slf_attn_bias = base_attn_bias
+        src_slf_attn_bias.stop_gradient = True
+        trg_slf_attn_bias = paddle.tensor.triu(
+            (paddle.ones(
+                (trg_max_len, trg_max_len),
+                dtype=paddle.get_default_dtype()) * -np.inf),
+            1)
+        trg_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, trg_max_len, 1])
+        src_pos = paddle.cast(
+            src_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=src_max_len)
+        trg_pos = paddle.cast(
+            trg_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=trg_max_len)
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(
+            src_emb, p=self.dropout,
+            training=self.training) if self.dropout else src_emb
+        with paddle.static.amp.fp16_guard():
+            if self.waitk >= src_max_len or self.waitk == -1:
+                # Full sentence
+                enc_outputs = [
+                    self.encoder(
+                        enc_input, src_mask=src_slf_attn_bias)
+                ]
+            else:
+                # Wait-k policy
+                enc_outputs = []
+                for i in range(self.waitk, src_max_len + 1):
+                    enc_output = self.encoder(
+                        enc_input[:, :i, :],
+                        src_mask=src_slf_attn_bias[:, :, :, :i])
+                    enc_outputs.append(enc_output)
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(
+                trg_emb, p=self.dropout,
+                training=self.training) if self.dropout else trg_emb
+            dec_output = self.decoder(
+                dec_input,
+                enc_outputs,
+                tgt_mask=trg_slf_attn_bias,
+                memory_mask=trg_src_attn_bias)
+            predict = self.linear(dec_output)
+        return predict
+    def beam_search(self, src_word, beam_size=4, max_len=256, waitk=-1):
+        # TODO: "Speculative Beam Search for Simultaneous Translation"
+        raise NotImplementedError
+    def greedy_search(self,
+                      src_word,
+                      max_len=256,
+                      waitk=-1,
+                      caches=None,
+                      bos_id=None):
+        """
+        greedy_search uses streaming reader. It doesn't need calling
+        encoder many times, an a sub-sentence just needs calling encoder once.
+        So, it needs previous state(caches) and last one of generated
+        tokens id last time.
+        """
+        src_max_len = paddle.shape(src_word)[-1]
+        base_attn_bias = paddle.cast(
+            src_word == self.bos_id,
+            dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        src_slf_attn_bias = base_attn_bias
+        src_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, 1, 1])
+        src_pos = paddle.cast(
+            src_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=src_max_len)
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(
+            src_emb, p=self.dropout,
+            training=self.training) if self.dropout else src_emb
+        enc_outputs = [self.encoder(enc_input, src_mask=src_slf_attn_bias)]
+        # constant number
+        batch_size = enc_outputs[-1].shape[0]
+        max_len = (
+            enc_outputs[-1].shape[1] + 20) if max_len is None else max_len
+        end_token_tensor = paddle.full(
+            shape=[batch_size, 1], fill_value=self.eos_id, dtype="int64")
+        predict_ids = []
+        log_probs = paddle.full(
+            shape=[batch_size, 1], fill_value=0, dtype="float32")
+        if not bos_id:
+            trg_word = paddle.full(
+                shape=[batch_size, 1], fill_value=self.bos_id, dtype="int64")
+        else:
+            trg_word = paddle.full(
+                shape=[batch_size, 1], fill_value=bos_id, dtype="int64")
+        # init states (caches) for transformer
+        if not caches:
+            caches = self.decoder.gen_cache(enc_outputs[-1], do_zip=False)
+        for i in range(max_len):
+            trg_pos = paddle.full(
+                shape=trg_word.shape, fill_value=i, dtype="int64")
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(
+                trg_emb, p=self.dropout,
+                training=self.training) if self.dropout else trg_emb
+            if waitk < 0 or i >= len(enc_outputs):
+                # if the decoder step is full sent or longer than all source
+                # step, then read the whole src
+                _e = enc_outputs[-1]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None,
+                    trg_src_attn_bias[:, :, :, :_e.shape[1]], caches)
+            else:
+                _e = enc_outputs[i]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None,
+                    trg_src_attn_bias[:, :, :, :_e.shape[1]], caches)
+            dec_output = paddle.reshape(
+                dec_output, shape=[-1, dec_output.shape[-1]])
+            logits = self.linear(dec_output)
+            step_log_probs = paddle.log(F.softmax(logits, axis=-1))
+            log_probs = paddle.add(x=step_log_probs, y=log_probs)
+            scores = log_probs
+            topk_scores, topk_indices = paddle.topk(x=scores, k=1)
+            finished = paddle.equal(topk_indices, end_token_tensor)
+            trg_word = topk_indices
+            log_probs = topk_scores
+            predict_ids.append(topk_indices)
+            if paddle.all(finished).numpy():
+                break
+        predict_ids = paddle.stack(predict_ids, axis=0)
+        finished_seq = paddle.transpose(predict_ids, [1, 2, 0])
+        finished_scores = topk_scores
+        return finished_seq, finished_scores, caches
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_7/module.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_7/module.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import jieba
+import paddle
+from paddlenlp.transformers import position_encoding_init
+from paddlenlp.transformers import WordEmbedding, PositionalEmbedding
+from paddlehub.env import MODULE_HOME
+from paddlehub.module.module import moduleinfo, serving
+from transformer_nist_wait_7.model import SimultaneousTransformer
+from transformer_nist_wait_7.processor import STACLTokenizer, predict
+@moduleinfo(
+    name="transformer_nist_wait_7",
+    version="1.0.0",
+    summary="",
+    author="PaddlePaddle",
+    author_email="",
+    type="nlp/simultaneous_translation",
+)
+class STTransformer():
+    """
+    Transformer model for simultaneous translation.
+    """
+    # Model config
+    model_config = {
+        # Number of head used in multi-head attention.
+        "n_head": 8,
+        # Number of sub-layers to be stacked in the encoder and decoder.
+        "n_layer": 6,
+        # The dimension for word embeddings, which is also the last dimension of
+        # the input and output of multi-head attention, position-wise feed-forward
+        # networks, encoder and decoder.
+        "d_model": 512,
+    }
+    def __init__(self, 
+                 max_length=256,
+                 max_out_len=256,
+                 ):
+        super(STTransformer, self).__init__()
+        bpe_codes_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_7", "assets", "2M.zh2en.dict4bpe.zh")
+        src_vocab_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_7", "assets", "nist.20k.zh.vocab")
+        trg_vocab_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_7", "assets", "nist.10k.en.vocab")
+        params_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_7", "assets", "transformer.pdparams")
+        self.max_length = max_length
+        self.max_out_len = max_out_len
+        self.tokenizer = STACLTokenizer(
+            bpe_codes_fpath,
+            src_vocab_fpath,
+            trg_vocab_fpath,
+        )
+        src_vocab_size = self.tokenizer.src_vocab_size
+        trg_vocab_size = self.tokenizer.trg_vocab_size
+        self.transformer = SimultaneousTransformer(
+            src_vocab_size,
+            trg_vocab_size,
+            max_length=self.max_length,
+            n_layer=self.model_config['n_layer'],
+            n_head=self.model_config['n_head'],
+            d_model=self.model_config['d_model'],
+        )
+        model_dict = paddle.load(params_fpath)
+        # To avoid a longer length than training, reset the size of position
+        # encoding to max_length
+        model_dict["src_pos_embedding.pos_encoder.weight"] = position_encoding_init(
+            self.max_length + 1, self.model_config['d_model'])
+        model_dict["trg_pos_embedding.pos_encoder.weight"] = position_encoding_init(
+            self.max_length + 1, self.model_config['d_model'])
+        self.transformer.load_dict(model_dict)
+    @serving
+    def translate(self, text, use_gpu=False):
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        # Word segmentation
+        text = ' '.join(jieba.cut(text))
+        # For decoding max length
+        decoder_max_length = 1
+        # For decoding cache
+        cache = None
+        # For decoding start token id
+        bos_id = None
+        # Current source word index
+        i = 0
+        # For decoding: is_last=True, max_len=256
+        is_last = False
+        # Tokenized id
+        user_input_tokenized = []
+        # Store the translation
+        result = []
+        bpe_str, tokenized_src = self.tokenizer.tokenize(text)
+        while i < len(tokenized_src):
+            user_input_tokenized.append(tokenized_src[i])
+            if bpe_str[i] in ['。', '？', '！']:
+                is_last = True
+            result, cache, bos_id = predict(
+                user_input_tokenized, 
+                decoder_max_length,
+                is_last, 
+                cache, 
+                bos_id, 
+                result,
+                self.tokenizer, 
+                self.transformer,
+                max_out_len=self.max_out_len)
+            i += 1    
+        return " ".join(result)
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_7/processor.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_7/processor.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+from paddlenlp.data import Vocab
+from subword_nmt import subword_nmt
+class STACLTokenizer:
+    """
+    Jieba+BPE, and convert tokens to ids.
+    """
+    def __init__(self,
+                 bpe_codes_fpath,
+                 src_vocab_fpath,
+                 trg_vocab_fpath,
+                 special_token=["<s>", "<e>", "<unk>"]):
+        bpe_parser = subword_nmt.create_apply_bpe_parser()
+        bpe_args = bpe_parser.parse_args(args=['-c', bpe_codes_fpath])
+        self.bpe = subword_nmt.BPE(bpe_args.codes, bpe_args.merges,
+                                   bpe_args.separator, None,
+                                   bpe_args.glossaries)
+        self.src_vocab = Vocab.load_vocabulary(
+            src_vocab_fpath,
+            bos_token=special_token[0],
+            eos_token=special_token[1],
+            unk_token=special_token[2])
+        self.trg_vocab = Vocab.load_vocabulary(
+            trg_vocab_fpath,
+            bos_token=special_token[0],
+            eos_token=special_token[1],
+            unk_token=special_token[2])
+        self.src_vocab_size = len(self.src_vocab)
+        self.trg_vocab_size = len(self.trg_vocab)
+    def tokenize(self, text):
+        bpe_str = self.bpe.process_line(text)
+        ids = self.src_vocab.to_indices(bpe_str.split())
+        return bpe_str.split(), ids
+def post_process_seq(seq, 
+                     bos_idx=0, 
+                     eos_idx=1, 
+                     output_bos=False, 
+                     output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [
+        idx for idx in seq[:eos_pos + 1]
+        if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)
+    ]
+    return seq
+def predict(tokenized_src, 
+              decoder_max_length, 
+              is_last, 
+              cache, 
+              bos_id, 
+              result,
+              tokenizer, 
+              transformer,
+              n_best=1,
+              max_out_len=256,
+              eos_idx=1,
+              waitk=7,
+    ):
+    # Set evaluate mode
+    transformer.eval()
+    if len(tokenized_src) < waitk:
+        return result, cache, bos_id
+    with paddle.no_grad():
+        paddle.disable_static()
+        input_src = tokenized_src
+        if is_last:
+            decoder_max_length = max_out_len
+            input_src += [eos_idx]
+        src_word = paddle.to_tensor(input_src).unsqueeze(axis=0)
+        finished_seq, finished_scores, cache = transformer.greedy_search(
+            src_word,
+            max_len=decoder_max_length,
+            waitk=waitk,
+            caches=cache,
+            bos_id=bos_id)
+        finished_seq = finished_seq.numpy()
+        for beam_idx, beam in enumerate(finished_seq[0]):
+            if beam_idx >= n_best:
+                break
+            id_list = post_process_seq(beam)
+            if len(id_list) == 0:
+                continue
+            bos_id = id_list[-1]
+            word_list = tokenizer.trg_vocab.to_tokens(id_list)
+            for word in word_list:
+                result.append(word)
+            res = ' '.join(word_list).replace('@@ ', '')
+        paddle.enable_static()
+    return result, cache, bos_id
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_7/requirements.txt
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_7/requirements.txt
+jieba==0.42.1 
+subword-nmt==0.3.7
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_all/README.md
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_all/README.md
+# transformer_nist_wait_all
+|模型名称|transformer_nist_wait_all|
+| :--- | :---: | 
+|类别|同声传译|
+|网络|transformer|
+|数据集|NIST 2008-中英翻译数据集|
+|是否支持Fine-tuning|否|
+|模型大小|377MB|
+|最新更新日期|2021-09-17|
+|数据指标|-|
+## 一、模型基本信息
+- ### 模型介绍
+  - 同声传译（Simultaneous Translation），即在句子完成之前进行翻译，同声传译的目标是实现同声传译的自动化，它可以与源语言同时翻译，延迟时间只有几秒钟。
+    STACL 是论文 [STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework](https://www.aclweb.org/anthology/P19-1289/) 中针对同传提出的适用于所有同传场景的翻译架构。
+    - STACL 主要具有以下优势：
+    - Prefix-to-Prefix架构拥有预测能力，即在未看到源词的情况下仍然可以翻译出对应的目标词，克服了SOV→SVO等词序差异
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/40840292/133761990-13e55d0f-5c3a-476c-8865-5808d13cba97.png"> <br />
+    </p>
+     和传统的机器翻译模型主要的区别在于翻译时是否需要利用全句的源句。上图中，Seq2Seq模型需要等到全句的源句（1-5）全部输入Encoder后，Decoder才开始解码进行翻译；而STACL架构采用了Wait-k（图中Wait-2）的策略，当源句只有两个词（1和2）输入到Encoder后，Decoder即可开始解码预测目标句的第一个词。
+    - Wait-k策略可以不需要全句的源句，直接预测目标句，可以实现任意的字级延迟，同时保持较高的翻译质量。
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/40840292/133762098-6ea6f3ca-0d70-4a0a-981d-0fcc6f3cd96b.png"> <br />
+    </p>
+     Wait-k策略首先等待源句单词，然后与源句的其余部分同时翻译，即输出总是隐藏在输入后面。这是受到同声传译人员的启发，同声传译人员通常会在几秒钟内开始翻译演讲者的演讲，在演讲者结束几秒钟后完成。例如，如果k=2，第一个目标词使用前2个源词预测，第二个目标词使用前3个源词预测，以此类推。上图中，(a)simultaneous: our wait-2 等到"布什"和"总统"输入后就开始解码预测"pres."，而(b) non-simultaneous baseline 为传统的翻译模型，需要等到整句"布什 总统 在 莫斯科 与 普京 会晤"才开始解码预测。
+  - 该PaddleHub Module基于transformer网络结构，采用的策略是等到全句结束再进行中文到英文的翻译，即waitk=-1。
+## 二、安装
+- ### 1、环境依赖
+  - paddlepaddle >= 2.1.0
+  - paddlehub >= 2.1.0    | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
+- ### 2、安装
+  - ```shell
+    $ hub install transformer_nist_wait_all
+    ```
+  - 如您安装时遇到问题，可参考：[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
+ | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
+## 三、模型API预测
+- ### 1、预测代码示例
+  - ```python
+    import paddlehub as hub
+    model = hub.Module(name="transformer_nist_wait_all")
+    # 待预测数据（模拟同声传译实时输入）
+    text = [
+        "他", 
+        "他还", 
+        "他还说", 
+        "他还说现在", 
+        "他还说现在正在",
+        "他还说现在正在为",
+        "他还说现在正在为这",
+        "他还说现在正在为这一",
+        "他还说现在正在为这一会议",
+        "他还说现在正在为这一会议作出",
+        "他还说现在正在为这一会议作出安排",
+        "他还说现在正在为这一会议作出安排。",      
+    ]
+    for t in text:
+        print("input: {}".format(t))
+        result = model.translate(t)
+        print("model output: {}\n".format(result))
+    # input: 他
+    # model output: 
+    #
+    # input: 他还
+    # model output: 
+    #
+    # input: 他还说
+    # model output:
+    #
+    # input: 他还说现在
+    # model output:
+    #
+    # input: 他还说现在正在
+    # model output:
+    #
+    # input: 他还说现在正在为
+    # model output:
+    #
+    # input: 他还说现在正在为这
+    # model output:
+    #
+    # input: 他还说现在正在为这一
+    # model output:
+    #
+    # input: 他还说现在正在为这一会议
+    # model output:
+    #
+    # input: 他还说现在正在为这一会议作出
+    # model output:
+    #
+    # input: 他还说现在正在为这一会议作出安排
+    # model output:
+    #
+    # input: 他还说现在正在为这一会议作出安排。
+    # model output: he also said that arrangements are now being made for this meeting .
+    ```
+- ### 2、 API
+    - ```python
+      __init__(max_length=256, max_out_len=256)
+      ```
+        - 初始化module， 可配置模型的输入文本的最大长度
+        - **参数**
+            - max_length(int): 输入文本的最大长度，默认值为256。
+            - max_out_len(int): 输出文本的最大解码长度，超过最大解码长度时会截断句子的后半部分，默认值为256。
+    - ```python
+      translate(text, use_gpu=False)
+      ```
+        - 预测API，输入源语言的文本（模拟同传语音输入），解码后输出翻译后的目标语言文本。
+        - **参数**
+            - text(str): 输入源语言的文本，数据类型为str
+            - use_gpu(bool): 是否使用gpu进行预测，默认为False
+        - **返回**
+            - result(str): 翻译后的目标语言文本。
+## 四、服务部署
+- PaddleHub Serving可以部署一个在线语义匹配服务，可以将此接口用于在线web应用。
+- ### 第一步：启动PaddleHub Serving
+  - 运行启动命令：
+  - ```shell
+    $ hub serving start -m transformer_nist_wait_all
+    ```
+  - 启动时会显示加载模型过程，启动成功后显示
+  - ```shell
+    Loading transformer_nist_wait_all successful.
+    ```
+  - 这样就完成了服务化API的部署，默认端口号为8866。
+  - **NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+- ### 第二步：发送预测请求
+  - 配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果       
+  - ```python
+    import requests
+    import json
+    # 待预测数据（模拟同声传译实时输入）
+    text = [
+        "他", 
+        "他还", 
+        "他还说", 
+        "他还说现在", 
+        "他还说现在正在",
+        "他还说现在正在为",
+        "他还说现在正在为这",
+        "他还说现在正在为这一",
+        "他还说现在正在为这一会议",
+        "他还说现在正在为这一会议作出",
+        "他还说现在正在为这一会议作出安排",
+        "他还说现在正在为这一会议作出安排。",      
+    ]
+    # 指定预测方法为transformer_nist_wait_all并发送post请求，content-type类型应指定json方式
+    # HOST_IP为服务器IP
+    url = "http://HOST_IP:8866/predict/transformer_nist_wait_all"
+    headers = {"Content-Type": "application/json"}
+    for t in text:
+        print("input: {}".format(t))
+        r = requests.post(url=url, headers=headers, data=json.dumps(t))
+        # 打印预测结果
+        print("model output: {}\n".format(result))
+  - 关于PaddleHub Serving更多信息参考：[服务部署](../../../../docs/docs_ch/tutorial/serving.md)
+## 五、更新历史
+* 1.0.0
+    初始发布
+    ```shell
+    hub install transformer_nist_wait_all==1.0.0
+    ```
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_all/__init__.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_all/__init__.py
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_all/model.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_all/model.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddlenlp.transformers import WordEmbedding, PositionalEmbedding
+class DecoderLayer(nn.TransformerDecoderLayer):
+    def __init__(self, *args, **kwargs):
+        super(DecoderLayer, self).__init__(*args, **kwargs)
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+        if cache is None:
+            tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, None)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask,
+                                                    cache[0])
+        tgt = residual + self.dropout1(tgt)
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm2(tgt)
+        if len(memory) == 1:
+            # Full sent
+            tgt = self.cross_attn(tgt, memory[0], memory[0], memory_mask, None)
+        else:
+            # Wait-k policy
+            cross_attn_outputs = []
+            for i in range(tgt.shape[1]):
+                q = tgt[:, i:i + 1, :]
+                if i >= len(memory):
+                    e = memory[-1]
+                else:
+                    e = memory[i]
+                cross_attn_outputs.append(
+                    self.cross_attn(q, e, e, memory_mask[:, :, i:i + 1, :
+                                                         e.shape[1]], None))
+            tgt = paddle.concat(cross_attn_outputs, axis=1)
+        tgt = residual + self.dropout2(tgt)
+        if not self.normalize_before:
+            tgt = self.norm2(tgt)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm3(tgt)
+        tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
+        tgt = residual + self.dropout3(tgt)
+        if not self.normalize_before:
+            tgt = self.norm3(tgt)
+        return tgt if cache is None else (tgt, (incremental_cache, ))
+class Decoder(nn.TransformerDecoder):
+    """
+    PaddlePaddle 2.1 casts memory_mask.dtype to memory.dtype, but in STACL,
+    type of memory is list, having no dtype attribute.
+    """
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        output = tgt
+        new_caches = []
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                output = mod(output,
+                             memory,
+                             tgt_mask=tgt_mask,
+                             memory_mask=memory_mask,
+                             cache=None)
+            else:
+                output, new_cache = mod(output,
+                                        memory,
+                                        tgt_mask=tgt_mask,
+                                        memory_mask=memory_mask,
+                                        cache=cache[i])
+                new_caches.append(new_cache)
+        if self.norm is not None:
+            output = self.norm(output)
+        return output if cache is None else (output, new_caches)
+class SimultaneousTransformer(nn.Layer):
+    """
+    model
+    """
+    def __init__(self,
+                 src_vocab_size,
+                 trg_vocab_size,
+                 max_length=256,
+                 n_layer=6,
+                 n_head=8,
+                 d_model=512,
+                 d_inner_hid=2048,
+                 dropout=0.1,
+                 weight_sharing=False,
+                 bos_id=0,
+                 eos_id=1,
+                 waitk=-1):
+        super(SimultaneousTransformer, self).__init__()
+        self.trg_vocab_size = trg_vocab_size
+        self.emb_dim = d_model
+        self.bos_id = bos_id
+        self.eos_id = eos_id
+        self.dropout = dropout
+        self.waitk = waitk
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.d_model = d_model
+        self.src_word_embedding = WordEmbedding(
+            vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+        self.src_pos_embedding = PositionalEmbedding(
+            emb_dim=d_model, max_length=max_length+1)
+        if weight_sharing:
+            assert src_vocab_size == trg_vocab_size, (
+                "Vocabularies in source and target should be same for weight sharing."
+            )
+            self.trg_word_embedding = self.src_word_embedding
+            self.trg_pos_embedding = self.src_pos_embedding
+        else:
+            self.trg_word_embedding = WordEmbedding(
+                vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+            self.trg_pos_embedding = PositionalEmbedding(
+                emb_dim=d_model, max_length=max_length+1)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=d_model,
+            nhead=n_head,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            activation='relu',
+            normalize_before=True,
+            bias_attr=[False, True])
+        encoder_norm = nn.LayerNorm(d_model)
+        self.encoder = nn.TransformerEncoder(
+            encoder_layer=encoder_layer, num_layers=n_layer, norm=encoder_norm)
+        decoder_layer = DecoderLayer(
+            d_model=d_model,
+            nhead=n_head,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            activation='relu',
+            normalize_before=True,
+            bias_attr=[False, False, True])
+        decoder_norm = nn.LayerNorm(d_model)
+        self.decoder = Decoder(
+            decoder_layer=decoder_layer, num_layers=n_layer, norm=decoder_norm)
+        if weight_sharing:
+            self.linear = lambda x: paddle.matmul(
+                x=x, y=self.trg_word_embedding.word_embedding.weight, transpose_y=True)
+        else:
+            self.linear = nn.Linear(
+                in_features=d_model,
+                out_features=trg_vocab_size,
+                bias_attr=False)
+    def forward(self, src_word, trg_word):
+        src_max_len = paddle.shape(src_word)[-1]
+        trg_max_len = paddle.shape(trg_word)[-1]
+        base_attn_bias = paddle.cast(
+            src_word == self.bos_id,
+            dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        src_slf_attn_bias = base_attn_bias
+        src_slf_attn_bias.stop_gradient = True
+        trg_slf_attn_bias = paddle.tensor.triu(
+            (paddle.ones(
+                (trg_max_len, trg_max_len),
+                dtype=paddle.get_default_dtype()) * -np.inf),
+            1)
+        trg_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, trg_max_len, 1])
+        src_pos = paddle.cast(
+            src_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=src_max_len)
+        trg_pos = paddle.cast(
+            trg_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=trg_max_len)
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(
+            src_emb, p=self.dropout,
+            training=self.training) if self.dropout else src_emb
+        with paddle.static.amp.fp16_guard():
+            if self.waitk >= src_max_len or self.waitk == -1:
+                # Full sentence
+                enc_outputs = [
+                    self.encoder(
+                        enc_input, src_mask=src_slf_attn_bias)
+                ]
+            else:
+                # Wait-k policy
+                enc_outputs = []
+                for i in range(self.waitk, src_max_len + 1):
+                    enc_output = self.encoder(
+                        enc_input[:, :i, :],
+                        src_mask=src_slf_attn_bias[:, :, :, :i])
+                    enc_outputs.append(enc_output)
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(
+                trg_emb, p=self.dropout,
+                training=self.training) if self.dropout else trg_emb
+            dec_output = self.decoder(
+                dec_input,
+                enc_outputs,
+                tgt_mask=trg_slf_attn_bias,
+                memory_mask=trg_src_attn_bias)
+            predict = self.linear(dec_output)
+        return predict
+    def beam_search(self, src_word, beam_size=4, max_len=256, waitk=-1):
+        # TODO: "Speculative Beam Search for Simultaneous Translation"
+        raise NotImplementedError
+    def greedy_search(self,
+                      src_word,
+                      max_len=256,
+                      waitk=-1,
+                      caches=None,
+                      bos_id=None):
+        """
+        greedy_search uses streaming reader. It doesn't need calling
+        encoder many times, an a sub-sentence just needs calling encoder once.
+        So, it needs previous state(caches) and last one of generated
+        tokens id last time.
+        """
+        src_max_len = paddle.shape(src_word)[-1]
+        base_attn_bias = paddle.cast(
+            src_word == self.bos_id,
+            dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        src_slf_attn_bias = base_attn_bias
+        src_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, 1, 1])
+        src_pos = paddle.cast(
+            src_word != self.bos_id, dtype="int64") * paddle.arange(
+                start=0, end=src_max_len)
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(
+            src_emb, p=self.dropout,
+            training=self.training) if self.dropout else src_emb
+        enc_outputs = [self.encoder(enc_input, src_mask=src_slf_attn_bias)]
+        # constant number
+        batch_size = enc_outputs[-1].shape[0]
+        max_len = (
+            enc_outputs[-1].shape[1] + 20) if max_len is None else max_len
+        end_token_tensor = paddle.full(
+            shape=[batch_size, 1], fill_value=self.eos_id, dtype="int64")
+        predict_ids = []
+        log_probs = paddle.full(
+            shape=[batch_size, 1], fill_value=0, dtype="float32")
+        if not bos_id:
+            trg_word = paddle.full(
+                shape=[batch_size, 1], fill_value=self.bos_id, dtype="int64")
+        else:
+            trg_word = paddle.full(
+                shape=[batch_size, 1], fill_value=bos_id, dtype="int64")
+        # init states (caches) for transformer
+        if not caches:
+            caches = self.decoder.gen_cache(enc_outputs[-1], do_zip=False)
+        for i in range(max_len):
+            trg_pos = paddle.full(
+                shape=trg_word.shape, fill_value=i, dtype="int64")
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(
+                trg_emb, p=self.dropout,
+                training=self.training) if self.dropout else trg_emb
+            if waitk < 0 or i >= len(enc_outputs):
+                # if the decoder step is full sent or longer than all source
+                # step, then read the whole src
+                _e = enc_outputs[-1]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None,
+                    trg_src_attn_bias[:, :, :, :_e.shape[1]], caches)
+            else:
+                _e = enc_outputs[i]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None,
+                    trg_src_attn_bias[:, :, :, :_e.shape[1]], caches)
+            dec_output = paddle.reshape(
+                dec_output, shape=[-1, dec_output.shape[-1]])
+            logits = self.linear(dec_output)
+            step_log_probs = paddle.log(F.softmax(logits, axis=-1))
+            log_probs = paddle.add(x=step_log_probs, y=log_probs)
+            scores = log_probs
+            topk_scores, topk_indices = paddle.topk(x=scores, k=1)
+            finished = paddle.equal(topk_indices, end_token_tensor)
+            trg_word = topk_indices
+            log_probs = topk_scores
+            predict_ids.append(topk_indices)
+            if paddle.all(finished).numpy():
+                break
+        predict_ids = paddle.stack(predict_ids, axis=0)
+        finished_seq = paddle.transpose(predict_ids, [1, 2, 0])
+        finished_scores = topk_scores
+        return finished_seq, finished_scores, caches
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_all/module.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_all/module.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import jieba
+import paddle
+from paddlenlp.transformers import position_encoding_init
+from paddlenlp.transformers import WordEmbedding, PositionalEmbedding
+from paddlehub.env import MODULE_HOME
+from paddlehub.module.module import moduleinfo, serving
+from transformer_nist_wait_all.model import SimultaneousTransformer
+from transformer_nist_wait_all.processor import STACLTokenizer, predict
+@moduleinfo(
+    name="transformer_nist_wait_all",
+    version="1.0.0",
+    summary="",
+    author="PaddlePaddle",
+    author_email="",
+    type="nlp/simultaneous_translation",
+)
+class STTransformer():
+    """
+    Transformer model for simultaneous translation.
+    """
+    # Model config
+    model_config = {
+        # Number of head used in multi-head attention.
+        "n_head": 8,
+        # Number of sub-layers to be stacked in the encoder and decoder.
+        "n_layer": 6,
+        # The dimension for word embeddings, which is also the last dimension of
+        # the input and output of multi-head attention, position-wise feed-forward
+        # networks, encoder and decoder.
+        "d_model": 512,
+    }
+    def __init__(self, 
+                 max_length=256,
+                 max_out_len=256,
+                 ):
+        super(STTransformer, self).__init__()
+        bpe_codes_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_all", "assets", "2M.zh2en.dict4bpe.zh")
+        src_vocab_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_all", "assets", "nist.20k.zh.vocab")
+        trg_vocab_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_all", "assets", "nist.10k.en.vocab")
+        params_fpath = os.path.join(MODULE_HOME, "transformer_nist_wait_all", "assets", "transformer.pdparams")
+        self.max_length = max_length
+        self.max_out_len = max_out_len
+        self.tokenizer = STACLTokenizer(
+            bpe_codes_fpath,
+            src_vocab_fpath,
+            trg_vocab_fpath,
+        )
+        src_vocab_size = self.tokenizer.src_vocab_size
+        trg_vocab_size = self.tokenizer.trg_vocab_size
+        self.transformer = SimultaneousTransformer(
+            src_vocab_size,
+            trg_vocab_size,
+            max_length=self.max_length,
+            n_layer=self.model_config['n_layer'],
+            n_head=self.model_config['n_head'],
+            d_model=self.model_config['d_model'],
+        )
+        model_dict = paddle.load(params_fpath)
+        # To avoid a longer length than training, reset the size of position
+        # encoding to max_length
+        model_dict["src_pos_embedding.pos_encoder.weight"] = position_encoding_init(
+            self.max_length + 1, self.model_config['d_model'])
+        model_dict["trg_pos_embedding.pos_encoder.weight"] = position_encoding_init(
+            self.max_length + 1, self.model_config['d_model'])
+        self.transformer.load_dict(model_dict)
+    @serving
+    def translate(self, text, use_gpu=False):
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        # Word segmentation
+        text = ' '.join(jieba.cut(text))
+        # For decoding max length
+        decoder_max_length = 1
+        # For decoding cache
+        cache = None
+        # For decoding start token id
+        bos_id = None
+        # Current source word index
+        i = 0
+        # For decoding: is_last=True, max_len=256
+        is_last = False
+        # Tokenized id
+        user_input_tokenized = []
+        # Store the translation
+        result = []
+        bpe_str, tokenized_src = self.tokenizer.tokenize(text)
+        while i < len(tokenized_src):
+            user_input_tokenized.append(tokenized_src[i])
+            if bpe_str[i] in ['。', '？', '！']:
+                is_last = True
+            result, cache, bos_id = predict(
+                user_input_tokenized, 
+                decoder_max_length,
+                is_last, 
+                cache, 
+                bos_id, 
+                result,
+                self.tokenizer, 
+                self.transformer,
+                max_out_len=self.max_out_len)
+            i += 1    
+        return " ".join(result)
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_all/processor.py
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_all/processor.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+from paddlenlp.data import Vocab
+from subword_nmt import subword_nmt
+class STACLTokenizer:
+    """
+    Jieba+BPE, and convert tokens to ids.
+    """
+    def __init__(self,
+                 bpe_codes_fpath,
+                 src_vocab_fpath,
+                 trg_vocab_fpath,
+                 special_token=["<s>", "<e>", "<unk>"]):
+        bpe_parser = subword_nmt.create_apply_bpe_parser()
+        bpe_args = bpe_parser.parse_args(args=['-c', bpe_codes_fpath])
+        self.bpe = subword_nmt.BPE(bpe_args.codes, bpe_args.merges,
+                                   bpe_args.separator, None,
+                                   bpe_args.glossaries)
+        self.src_vocab = Vocab.load_vocabulary(
+            src_vocab_fpath,
+            bos_token=special_token[0],
+            eos_token=special_token[1],
+            unk_token=special_token[2])
+        self.trg_vocab = Vocab.load_vocabulary(
+            trg_vocab_fpath,
+            bos_token=special_token[0],
+            eos_token=special_token[1],
+            unk_token=special_token[2])
+        self.src_vocab_size = len(self.src_vocab)
+        self.trg_vocab_size = len(self.trg_vocab)
+    def tokenize(self, text):
+        bpe_str = self.bpe.process_line(text)
+        ids = self.src_vocab.to_indices(bpe_str.split())
+        return bpe_str.split(), ids
+def post_process_seq(seq, 
+                     bos_idx=0, 
+                     eos_idx=1, 
+                     output_bos=False, 
+                     output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [
+        idx for idx in seq[:eos_pos + 1]
+        if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)
+    ]
+    return seq
+def predict(tokenized_src, 
+              decoder_max_length, 
+              is_last, 
+              cache, 
+              bos_id, 
+              result,
+              tokenizer, 
+              transformer,
+              n_best=1,
+              max_out_len=256,
+              eos_idx=1,
+              waitk=-1,
+    ):
+    # Set evaluate mode
+    transformer.eval()
+    if not is_last:
+        return result, cache, bos_id
+    with paddle.no_grad():
+        paddle.disable_static()
+        input_src = tokenized_src
+        if is_last:
+            decoder_max_length = max_out_len
+            input_src += [eos_idx]
+        src_word = paddle.to_tensor(input_src).unsqueeze(axis=0)
+        finished_seq, finished_scores, cache = transformer.greedy_search(
+            src_word,
+            max_len=decoder_max_length,
+            waitk=waitk,
+            caches=cache,
+            bos_id=bos_id)
+        finished_seq = finished_seq.numpy()
+        for beam_idx, beam in enumerate(finished_seq[0]):
+            if beam_idx >= n_best:
+                break
+            id_list = post_process_seq(beam)
+            if len(id_list) == 0:
+                continue
+            bos_id = id_list[-1]
+            word_list = tokenizer.trg_vocab.to_tokens(id_list)
+            for word in word_list:
+                result.append(word)
+            res = ' '.join(word_list).replace('@@ ', '')
+        paddle.enable_static()
+    return result, cache, bos_id
\ No newline at end of file
--- a/modules/text/simultaneous_translation/stacl/transformer_nist_wait_all/requirements.txt
+++ b/modules/text/simultaneous_translation/stacl/transformer_nist_wait_all/requirements.txt
+jieba==0.42.1 
+subword-nmt==0.3.7