test=pre-develop, test=documents_fix

edbe41ba · grasswolfs · c093331e · edbe41ba · edbe41ba · edbe41ba
263 changed file
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@
 ![support os](https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-yellow.svg)
 ## 简介
- PaddleHub旨在为开发者提供丰富的、高质量的、直接可用的预训练模型，**【无需深度学习背景、无需数据与训练过程】，**也可快速使用AI模型。
+- PaddleHub旨在为开发者提供丰富的、高质量的、直接可用的预训练模型，**【无需深度学习背景、无需数据与训练过程】**，也可快速使用AI模型。
 - 涵盖CV、NLP、Audio、Video主流四大品类，支持**一键预测**、**一键服务化部署**和**快速迁移学习**
 - 全部模型开源下载，**离线可运行**。
@@ -15,10 +15,10 @@
 - **2020.09.27**，新增文本生成模型6个，图像分割模型1个，预训练模型总量到达 **【154】** 个。
 - **2020.08.13**，发布v1.8.1，新增人像分割模型Humanseg，支持EMNLP2019-Sentence-BERT作为文本匹配任务网络，预训练模型总量到达 **【147】** 个。
 - **2020.07.29**，发布v1.8.0，新增AI对联和AI写诗、jieba切词，文本数据LDA、语义相似度计算，新增目标检测，短视频分类模型，超轻量中英文OCR，新增行人检测、车辆检测、动物识别等工业级模型，支持VisualDL可视化训练，预训练模型总量到达 **【135】** 个。
- [More]()
+- [More](./docs/release.md)
-## 特性
+## [特性](./docs/figures.md)
 - **【丰富的预训练模型】**：涵盖CV、NLP、Audio、Video主流四大品类的 180+ 预训练模型，全部开源下载，离线可运行。
 - **【一键模型快速预测】**：通过一行命令行或者极简的Python API实现模型调用，可快速体验模型效果。
 - **【一键模型转服务化】**：一行命令，搭建深度学习模型API服务化部署能力。
@@ -60,27 +60,27 @@
    - [在线运行体验demo【Official】](https://github.com/PaddlePaddle/PaddleHub/tree/release/v1.8/demo)
    - [生态趣味项目demo【ThirdPary】](./docs/quick_experience/more_demos.md)
 - 丰富的预训练模型 182 个
-    - [精品特色模型](./docs/pretrained_models.md)
+    - [精品特色模型](./docs/figure.md)
    - 计算机视觉 126 个
-      - [图像分类 64 个](./modules/image/classification)
+      - [图像分类 64 个](./modules/image/classification/README.md)
-      - [目标检测 13 个](./modules/image/object_detection)
+      - [目标检测 13 个](./modules/image/object_detection/README.md)
-      - [人脸检测 7 个](./modules/image/face_detection)  
+      - [人脸检测 7 个](./modules/image/face_detection/README.md)  
-      - [关键点检测 3 个](./modules/image/keypoint_detection)
+      - [关键点检测 3 个](./modules/image/keypoint_detection/README.md)
-      - [图像分割 7 个](./modules/image/semantic_segmentation)
+      - [图像分割 7 个](./modules/image/semantic_segmentation/README.md)
-      - [文本识别 8 个](./modules/image/text_recognition)
+      - [文本识别 8 个](./modules/image/text_recognition/README.md)
-      - [图像生成 17 个](./modules/image/gan)
+      - [图像生成 17 个](./modules/image/Image_gan/README.md)
-      - [图像编辑 7 个](./modules/image/style_transfer)
+      - [图像编辑 7 个](./modules/image/Image_editing/README.md)
    - 自然语言处理 48 个
-      - [词法分析 2 个](./modules/text/lexical_analysis)
+      - [词法分析 2 个](./modules/text/lexical_analysis/README.md)
-      - [句法分析 1 个](./modules/text/syntactic_analysis)
+      - [句法分析 1 个](./modules/text/syntactic_analysis/README.md)
-      - [情感分析 7 个](./modules/text/semantic_model)
+      - [情感分析 7 个](./modules/text/sentiment_analysis/README.md)
-      - [文本审核 3 个](./modules/text/text_review)
+      - [文本审核 3 个](./modules/text/text_review/README.md)
-      - [文本生成 9 个](./modules/text/text_generation)
+      - [文本生成 9 个](./modules/text/text_generation/README.md)
-      - [语义模型 26 个](./modules/text/semantic_model)
+      - [语义模型 26 个](./modules/text/language_model/README.md)
    - 语音 3 个
-      - [语音合成 3 个](./modules/audio)
+      - [语音合成 3 个](./modules/audio/README.md)
    - 视频5个
-      - [视频分类 5 个](./modules/video)
+      - [视频分类 5 个](./modules/video/README.md)
 - 部署
    - [一行代码服务化部署](./docs/tutorial/serving.md)
    - C++ Inference 部署（建议加群沟通）

--- a/docs/figures.md
+++ b/docs/figures.md
+## 特性详解
+<a name="丰富的预训练模型"></a>
+### 1、丰富的预训练模型
+- 1.1、图像
+|            | **精品模型举例**                                             |
+| ---------- | :----------------------------------------------------------- |
+| 图像分类 | [菜品识别](https://www.paddlepaddle.org.cn/hubdetail?name=resnet50_vd_dishes&en_category=ImageClassification)、[动物识别](https://www.paddlepaddle.org.cn/hubdetail?name=resnet50_vd_animals&en_category=ImageClassification)、[动物识别](https://www.paddlepaddle.org.cn/hubdetail?name=resnet50_vd_animals&en_category=ImageClassification)、[-->More](../modules/image/classification/README.md) |
+| 目标检测   | [通用检测](https://www.paddlepaddle.org.cn/hubdetail?name=yolov3_darknet53_coco2017&en_category=ObjectDetection)、[行人检测](https://www.paddlepaddle.org.cn/hubdetail?name=yolov3_darknet53_pedestrian&en_category=ObjectDetection)、[车辆检测](https://www.paddlepaddle.org.cn/hubdetail?name=yolov3_darknet53_vehicles&en_category=ObjectDetection)、[-->More](../modules/image/object_detection/README.md) |
+| 人脸检测 | [人脸检测](https://www.paddlepaddle.org.cn/hubdetail?name=pyramidbox_lite_server&en_category=FaceDetection)、[口罩检测](https://www.paddlepaddle.org.cn/hubdetail?name=pyramidbox_lite_server_mask&en_category=FaceDetection)、[-->More](../modules/image/face_detection/README.md) |
+| 图像分割   | [人像分割](https://www.paddlepaddle.org.cn/hubdetail?name=deeplabv3p_xception65_humanseg&en_category=ImageSegmentation)、[人体解析](https://www.paddlepaddle.org.cn/hubdetail?name=ace2p&en_category=ImageSegmentation)、[肺炎CT影像分析](https://www.paddlepaddle.org.cn/hubdetail?name=Pneumonia_CT_LKM_PP&en_category=ImageSegmentation)、[-->More](../modules/image/semantic_segmentation/README.md) |
+| 关键点检测 | [人体关键点](https://www.paddlepaddle.org.cn/hubdetail?name=human_pose_estimation_resnet50_mpii&en_category=KeyPointDetection)、[人脸关键点](https://www.paddlepaddle.org.cn/hubdetail?name=face_landmark_localization&en_category=KeyPointDetection)、[手部关键点](https://www.paddlepaddle.org.cn/hubdetail?name=hand_pose_localization&en_category=KeyPointDetection)、[-->More](./modules/image/keypoint_detection/README.md) |
+| 文本识别 | [超轻量中英文OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=chinese_ocr_db_crnn_mobile&en_category=TextRecognition)、[-->More](../modules/image/text_recognition/README.md) |
+| 图像生成    | [风格迁移](https://www.paddlepaddle.org.cn/hubdetail?name=stylepro_artistic&en_category=GANs)、[街景动漫画]()、[-->More](../modules/image/Image_gan/README.md) |
+| 图像编辑 | [超分辨率](https://www.paddlepaddle.org.cn/hubdetail?name=realsr&en_category=ImageEditing)、[黑白上色](https://www.paddlepaddle.org.cn/hubdetail?name=deoldify&en_category=ImageEditing)、[-->More](../modules/image/Image_editing/README.md) |
+- 1.2、文本
+|            | **精品模型举例**                                           |
+| ---------- | :----------------------------------------------------------- |
+| 词句分析 | [词法分析 ](https://www.paddlepaddle.org.cn/hubdetail?name=lac&en_category=LexicalAnalysis)、[句法分析](https://www.paddlepaddle.org.cn/hubdetail?name=ddparser&en_category=SyntacticAnalysis)、[-->More](../modules/text/lexical_analysis/README.md) |
+| 情感分析   | [情感判断](https://www.paddlepaddle.org.cn/hubdetail?name=lac&en_category=LexicalAnalysis)、[情绪分析](https://www.paddlepaddle.org.cn/hubdetail?name=emotion_detection_textcnn&en_category=SentimentAnalysis) 、[-->More](../modules/text/sentiment_analysis/README.md)|
+| 文本审核 | [色情审核](https://www.paddlepaddle.org.cn/hubdetail?name=porn_detection_gru&en_category=TextCensorship)、[-->More](../modules/text/text_review/README.md) |
+| 文本生成 | [对联生成]()、[情话生成]()、[藏图诗生成]()、[土味情话]() 、[-->More](../modules/text/text_generation/README.md)|
+| 语义模型   | [ERNIE](https://www.paddlepaddle.org.cn/hubdetail?name=ERNIE&en_category=SemanticModel)、[文本相似度](https://www.paddlepaddle.org.cn/hubdetail?name=simnet_bow&en_category=SemanticModel)、[-->More](../modules/text/language_model/README.md) |
+- 1.3、语音
+|            | **精品模型举例**                                           |
+| ---------- | :----------------------------------------------------------- |
+| 语音合成   | [语音合成]() 、[-->More](../modules/audio/README.md)                         |
+- 1.4、视频
+|            | **精品模型举例**                                       |
+| ---------- | :----------------------------------------------------------- |
+| 视频分类 | [视频分类]()、[-->More](../modules/video/README.md) |
+<a name="一键模型预测"></a>
+### 2、一键模型预测
+* 举例，假如考虑使用文字识别轻量级中文OCR模型chinese_ocr_db_crnn_mobile即可一键快速识别图片中的文字。
+```shell
+$ pip install paddlehub
+$ wget https://paddlehub.bj.bcebos.com/model/image/ocr/test_ocr.jpg
+$ hub run chinese_ocr_db_crnn_mobile --input_path test_ocr.jpg --visualization=True
+```
+* 预测结果图片保存在当前运行路径下ocr_result文件夹中，如下图所示。
+<p align="center">
+ <img src="./imgs/ocr_res.jpg" width='70%' align="middle"  
+</p>
+* 使用词法分析模型LAC进行分词
+```shell
+$ hub run lac --input_text "现在，慕尼黑再保险公司不仅是此类行动的倡议者，更是将其大量气候数据整合进保险产品中，并与公众共享大量天气信息，参与到新能源领域的保障中。"
+[{
+    'word': ['现在', '，', '慕尼黑再保险公司', '不仅', '是', '此类', '行动', '的', '倡议者', '，', '更是', '将', '其', '大量', '气候', '数据', '整合', '进', '保险', '产品', '中', '，', '并', '与', '公众', '共享', '大量', '天气', '信息', '，', '参与', '到', '新能源', '领域', '的', '保障', '中', '。'],
+    'tag':  ['TIME', 'w', 'ORG', 'c', 'v', 'r', 'n', 'u', 'n', 'w', 'd', 'p', 'r', 'a', 'n', 'n', 'v', 'v', 'n', 'n', 'f', 'w', 'c', 'p', 'n', 'v', 'a', 'n', 'n', 'w', 'v', 'v', 'n', 'n', 'u', 'vn', 'f', 'w']
+}]
+```
+除了一行代码预测之外，PaddleHub也支持使用API调用模型的方式，可以参考每个模型的详细文档。
+<a name="一键模型转服务"></a>
+### 3、一键模型转服务
+PaddleHub提供便捷的模型转服务的能力，只需简单一行命令即可完成模型的HTTP服务部署。通过以下命令即可快速启动LAC词法分析服务：
+```shell
+$ hub serving start -m chinese_ocr_db_crnn_mobile
+```
+更多关于模型服务化使用说明参见[PaddleHub模型一键服务化部署](./tutorial/serving.md)。
+<a name="十行代码迁移学习"></a>
+### 4、十行代码迁移学习
+通过Fine-tune API，只需要少量代码即可完成深度学习模型在计算机视觉场景下的迁移学习。
+* [Demo示例](../demo)提供丰富的Fine-tune API的使用代码，包括[图像分类](../demo/image_classification)、[图像着色](../demo/colorization)、[风格迁移](../demo/style_transfer)、等场景的模型迁移示例。
+<p align="center">
+ <img src="./imgs/paddlehub_finetune.gif" align="middle"  
+</p>
+<p align='center'>
+ 十行代码完成工业级文本分类
+</p>
+* 如需在线快速体验，请点击[PaddleHub教程合集](https://aistudio.baidu.com/aistudio/projectdetail/231146)，可使用AI Studio平台提供的GPU算力进行快速尝试。
+<a name="许可证书"></a>
+## 许可证书
+本项目的发布受<a href="./LICENSE">Apache 2.0 license</a>许可认证。
+<a name="致谢"></a>
+## 致谢
+我们非常欢迎您为PaddleHub贡献代码，也十分感谢您的反馈。
+* 非常感谢[Austendeng](https://github.com/Austendeng)贡献了修复SequenceLabelReader的pr
+* 非常感谢[cclauss](https://github.com/cclauss)贡献了优化travis-ci检查的pr
+* 非常感谢[奇想天外](http://www.cheerthink.com/)贡献了口罩检测的demo
+* 非常感谢[mhlwsk](https://github.com/mhlwsk)贡献了修复序列标注预测demo的pr
+* 非常感谢[zbp-xxxp](https://github.com/zbp-xxxp)贡献了看图作诗的module
+* 非常感谢[zbp-xxxp](https://github.com/zbp-xxxp)和[七年期限](https://github.com/1084667371)联合贡献了看图写诗中秋特别版module
+* 非常感谢[livingbody](https://github.com/livingbody)贡献了基于PaddleHub能力的风格迁移和中秋看图写诗微信小程序
--- a/modules/audio/README.md
+++ b/modules/audio/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【语音合成】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 文字识别
+语音合成（TTS）任务可以实现讲文字转化为语音，已经广泛应用于各种语音交互设备中。
+- 推荐模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [语音合成transformer_tts_ljspeech](https://www.paddlepaddle.org.cn/hubdetail?name=transformer_tts_ljspeech&en_category=TextToSpeech) | TansformerTTS 对 Transformer 和 Tacotron2 进行了融合，取得了令人满意的效果，英文TTS模型，仅支持预测。 |
+| [语音合成fastspeech_ljspeech](https://www.paddlepaddle.org.cn/hubdetail?name=fastspeech_ljspeech&en_category=TextToSpeech) | FastSpeech是基于encoder-decoder结构的teacher model中提取attention对角线来做发音持续时间预测，英文TTS模型，仅支持预测。 |
+| [语音合成deepvoice3_ljspeech](https://www.paddlepaddle.org.cn/hubdetail?name=deepvoice3_ljspeech&en_category=TextToSpeech) | Deep Voice 3是百度研究院2017年发布的端到端的TTS模型（论文录用于ICLR 2018）。它是一个基于卷积神经网络和注意力机制的seq2seq模型,英文TTS模型，仅支持预测。|
--- a/modules/image/Image_editing/README.md
+++ b/modules/image/Image_editing/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【图像编辑】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 图像编辑
+图像编辑是指在输入图像的基础上，对图像的像素点进行进一步的编辑和调整，输出新的目标图像，具体的应用场景有：超分辨率、黑白片上色，老照片修复等。
+- 精选推荐模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+ | [超分辨率](https://www.paddlepaddle.org.cn/hubdetail?name=realsr&en_category=ImageEditing) | 可用于图像和视频超分模型，它能够将输入的图片和视频超分四倍。 |
+ | [黑白图像上色](https://www.paddlepaddle.org.cn/hubdetail?name=deoldify&en_category=ImageEditing) | deoldify是用于图像和视频的着色渲染模型，该模型能够实现给黑白照片和视频恢复原彩。 |
+  | [老照片修复](https://www.paddlepaddle.org.cn/hubdetail?name=photo_restoration&en_category=ImageEditing) | 针对老照片修复的模型。它主要由两个部分组成：着色和超分。|
--- a/modules/image/colorization/user_guided_colorization/data_feed.py
+++ b/modules/image/colorization/user_guided_colorization/data_feed.py
--- a/modules/image/colorization/user_guided_colorization/module.py
+++ b/modules/image/colorization/user_guided_colorization/module.py
@@ -40,7 +40,6 @@ class UserGuidedColorization(nn.Layer):
        load_checkpoint (str): Pretrained checkpoint path.
    """
    def __init__(self, use_tanh: bool = True, load_checkpoint: str = None):
        super(UserGuidedColorization, self).__init__()
        self.input_nc = 4
@@ -119,8 +118,8 @@ class UserGuidedColorization(nn.Layer):
        )
        # Conv8
-        model8up = (Conv2DTranspose(512, 256, kernel_size=4, stride=2, padding=1),)
+        model8up = (Conv2DTranspose(512, 256, kernel_size=4, stride=2, padding=1), )
-        model3short8 = (Conv2D(256, 256, 3, 1, 1),)
+        model3short8 = (Conv2D(256, 256, 3, 1, 1), )
        model8 = (
            nn.ReLU(),
            Conv2D(256, 256, 3, 1, 1),
@@ -131,20 +130,26 @@ class UserGuidedColorization(nn.Layer):
        )
        # Conv9
-        model9up = (Conv2DTranspose(256, 128, kernel_size=4, stride=2, padding=1),)
+        model9up = (Conv2DTranspose(256, 128, kernel_size=4, stride=2, padding=1), )
-        model2short9 = (Conv2D(128, 128, 3, 1, 1,),)
+        model2short9 = (Conv2D(
+            128,
+            128,
+            3,
+            1,
+            1,
+        ), )
        model9 = (nn.ReLU(), Conv2D(128, 128, 3, 1, 1), nn.ReLU(), nn.BatchNorm(128))
        # Conv10
-        model10up = (Conv2DTranspose(128, 128, kernel_size=4, stride=2, padding=1),)
+        model10up = (Conv2DTranspose(128, 128, kernel_size=4, stride=2, padding=1), )
-        model1short10 = (Conv2D(64, 128, 3, 1, 1),)
+        model1short10 = (Conv2D(64, 128, 3, 1, 1), )
        model10 = (nn.ReLU(), Conv2D(128, 128, 3, 1, 1), nn.LeakyReLU(negative_slope=0.2))
-        model_class = (Conv2D(256, 529, 1),)
+        model_class = (Conv2D(256, 529, 1), )
        if use_tanh:
            model_out = (Conv2D(128, 2, 1, 1, 0, 1), nn.Tanh())
        else:
-            model_out = (Conv2D(128, 2, 1, 1, 0, 1),)
+            model_out = (Conv2D(128, 2, 1, 1, 0, 1), )
        self.model1 = nn.Sequential(*model1)
        self.model2 = nn.Sequential(*model2)

--- a/modules/image/Image_gan/README.md
+++ b/modules/image/Image_gan/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【图像生成】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 图像生成
+图像生成是指根据输入向量，生成目标图像。这里的输入向量可以是随机的噪声或用户指定的条件向量。具体的应用场景有：风格迁移、图像动漫画等。
+- 精选推荐模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+ | [艺术风格迁移](https://www.paddlepaddle.org.cn/hubdetail?name=stylepro_artistic&en_category=GANs) | 将给定的图像转换为任意的艺术风格。确保模型高保真还原内容图片的语义细节信息与风格图片的风格信息。 |
+ | [图像动漫化-新海诚](https://www.paddlepaddle.org.cn/hubdetail?name=animegan_v2_shinkai_53&en_category=GANs) | AnimeGAN V2 图像风格转换模型, 模型可将输入的图像转换成新海诚动漫风格 |
+  | [图像动漫化-宫崎骏](https://www.paddlepaddle.org.cn/hubdetail?name=animegan_v2_hayao_64&en_category=GANs) | AnimeGAN V2 图像风格转换模型, 模型可将输入的图像转换成宫崎骏动漫风格|
+  | [图像动漫化-今敏红辣椒](https://www.paddlepaddle.org.cn/hubdetail?name=animegan_v2_paprika_97&en_category=GANs) | AnimeGAN V2 图像风格转换模型, 模型可将输入的图像转换成今敏红辣椒动漫风格。|
--- a/modules/image/style_transfer/msgnet/module.py
+++ b/modules/image/style_transfer/msgnet/module.py
@@ -14,7 +14,6 @@ from paddlehub.module.cv_module import StyleTransferModule
 class GramMatrix(nn.Layer):
    """Calculate gram matrix"""
    def forward(self, y):
        (b, ch, h, w) = y.shape
        features = y.reshape((b, ch, w * h))
@@ -25,7 +24,6 @@ class GramMatrix(nn.Layer):
 class ConvLayer(nn.Layer):
    """Basic conv layer with reflection padding layer"""
    def __init__(self, in_channels: int, out_channels: int, kernel_size: int, stride: int):
        super(ConvLayer, self).__init__()
        pad = int(np.floor(kernel_size / 2))
@@ -53,7 +51,6 @@ class UpsampleConvLayer(nn.Layer):
    Return:
        img(paddle.Tensor): UpsampleConvLayer output.
    """
    def __init__(self, in_channels: int, out_channels: int, kernel_size: int, stride: int, upsample=None):
        super(UpsampleConvLayer, self).__init__()
        self.upsample = upsample
@@ -88,7 +85,6 @@ class Bottleneck(nn.Layer):
    Return:
        img(paddle.Tensor): Bottleneck output.
    """
    def __init__(self,
                 inplanes: int,
                 planes: int,
@@ -102,8 +98,8 @@ class Bottleneck(nn.Layer):
            self.residual_layer = nn.Conv2D(inplanes, planes * self.expansion, kernel_size=1, stride=stride)
        conv_block = (norm_layer(inplanes), nn.ReLU(), nn.Conv2D(inplanes, planes, kernel_size=1, stride=1),
                      norm_layer(planes), nn.ReLU(), ConvLayer(planes, planes, kernel_size=3, stride=stride),
-                      norm_layer(planes), nn.ReLU(), nn.Conv2D(
+                      norm_layer(planes), nn.ReLU(), nn.Conv2D(planes, planes * self.expansion, kernel_size=1,
-                          planes, planes * self.expansion, kernel_size=1, stride=1))
+                                                               stride=1))
        self.conv_block = nn.Sequential(*conv_block)
    def forward(self, x: paddle.Tensor):
@@ -129,12 +125,14 @@ class UpBottleneck(nn.Layer):
    Return:
        img(paddle.Tensor): UpBottleneck output.
    """
    def __init__(self, inplanes: int, planes: int, stride: int = 2, norm_layer: nn.Layer = nn.BatchNorm2D):
        super(UpBottleneck, self).__init__()
        self.expansion = 4
-        self.residual_layer = UpsampleConvLayer(
+        self.residual_layer = UpsampleConvLayer(inplanes,
-            inplanes, planes * self.expansion, kernel_size=1, stride=1, upsample=stride)
+                                                planes * self.expansion,
+                                                kernel_size=1,
+                                                stride=1,
+                                                upsample=stride)
        conv_block = []
        conv_block += [norm_layer(inplanes), nn.ReLU(), nn.Conv2D(inplanes, planes, kernel_size=1, stride=1)]
        conv_block += [
@@ -165,7 +163,6 @@ class Inspiration(nn.Layer):
    Return:
        img(paddle.Tensor): UpBottleneck output.
    """
    def __init__(self, C: int, B: int = 1):
        super(Inspiration, self).__init__()
@@ -182,8 +179,8 @@ class Inspiration(nn.Layer):
        self.P = paddle.bmm(self.weight.expand_as(self.G), self.G)
        x = paddle.bmm(
-            self.P.transpose((0, 2, 1)).expand((X.shape[0], self.C, self.C)), X.reshape((X.shape[0], X.shape[1],
+            self.P.transpose((0, 2, 1)).expand((X.shape[0], self.C, self.C)), X.reshape(
-                                                                                         -1))).reshape(X.shape)
+                (X.shape[0], X.shape[1], -1))).reshape(X.shape)
        return x
    def __repr__(self):
@@ -193,7 +190,6 @@ class Inspiration(nn.Layer):
 class Vgg16(nn.Layer):
    """ First four layers from Vgg16."""
    def __init__(self):
        super(Vgg16, self).__init__()
        self.conv1_1 = nn.Conv2D(3, 64, kernel_size=3, stride=1, padding=1)
@@ -268,8 +264,12 @@ class MSGNet(nn.Layer):
    Return:
        img(paddle.Tensor): MSGNet output.
    """
+    def __init__(self,
-    def __init__(self, input_nc=3, output_nc=3, ngf=128, n_blocks=6, norm_layer=nn.InstanceNorm2D,
+                 input_nc=3,
+                 output_nc=3,
+                 ngf=128,
+                 n_blocks=6,
+                 norm_layer=nn.InstanceNorm2D,
                 load_checkpoint=None):
        super(MSGNet, self).__init__()
        self.gram = GramMatrix()

--- a/modules/image/style_transfer/stylepro_artistic/README.md
+++ b/modules/image/style_transfer/stylepro_artistic/README.md
--- a/modules/image/style_transfer/stylepro_artistic/__init__.py
+++ b/modules/image/style_transfer/stylepro_artistic/__init__.py
--- a/modules/image/style_transfer/stylepro_artistic/data_feed.py
+++ b/modules/image/style_transfer/stylepro_artistic/data_feed.py
--- a/modules/image/Image_gan/style_transfer/stylepro_artistic/decoder_network.py
+++ b/modules/image/Image_gan/style_transfer/stylepro_artistic/decoder_network.py
+# coding=utf-8
+from paddle.fluid.initializer import Constant
+from paddle.fluid.param_attr import ParamAttr
+import paddle.fluid as fluid
+def decoder_net():
+    x2paddle_22 = fluid.layers.create_parameter(dtype='float32',
+                                                shape=[4],
+                                                name='x2paddle_22',
+                                                attr='x2paddle_22',
+                                                default_initializer=Constant(0.0))
+    x2paddle_36 = fluid.layers.create_parameter(dtype='float32',
+                                                shape=[4],
+                                                name='x2paddle_36',
+                                                attr='x2paddle_36',
+                                                default_initializer=Constant(0.0))
+    x2paddle_44 = fluid.layers.create_parameter(dtype='float32',
+                                                shape=[4],
+                                                name='x2paddle_44',
+                                                attr='x2paddle_44',
+                                                default_initializer=Constant(0.0))
+    x2paddle_input_1 = fluid.layers.data(dtype='float32',
+                                         shape=[1, 512, 64, 64],
+                                         name='x2paddle_input_1',
+                                         append_batch_size=False)
+    x2paddle_19 = fluid.layers.pad2d(x2paddle_input_1,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_19')
+    x2paddle_20 = fluid.layers.conv2d(x2paddle_19,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_1',
+                                      name='x2paddle_20',
+                                      bias_attr='x2paddle_2')
+    x2paddle_21 = fluid.layers.relu(x2paddle_20, name='x2paddle_21')
+    x2paddle_23 = fluid.layers.resize_nearest(x2paddle_21, name='x2paddle_23', out_shape=[128, 128])
+    x2paddle_24 = fluid.layers.pad2d(x2paddle_23,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_24')
+    x2paddle_25 = fluid.layers.conv2d(x2paddle_24,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_3',
+                                      name='x2paddle_25',
+                                      bias_attr='x2paddle_4')
+    x2paddle_26 = fluid.layers.relu(x2paddle_25, name='x2paddle_26')
+    x2paddle_27 = fluid.layers.pad2d(x2paddle_26,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_27')
+    x2paddle_28 = fluid.layers.conv2d(x2paddle_27,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_5',
+                                      name='x2paddle_28',
+                                      bias_attr='x2paddle_6')
+    x2paddle_29 = fluid.layers.relu(x2paddle_28, name='x2paddle_29')
+    x2paddle_30 = fluid.layers.pad2d(x2paddle_29,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_30')
+    x2paddle_31 = fluid.layers.conv2d(x2paddle_30,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_7',
+                                      name='x2paddle_31',
+                                      bias_attr='x2paddle_8')
+    x2paddle_32 = fluid.layers.relu(x2paddle_31, name='x2paddle_32')
+    x2paddle_33 = fluid.layers.pad2d(x2paddle_32,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_33')
+    x2paddle_34 = fluid.layers.conv2d(x2paddle_33,
+                                      num_filters=128,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_9',
+                                      name='x2paddle_34',
+                                      bias_attr='x2paddle_10')
+    x2paddle_35 = fluid.layers.relu(x2paddle_34, name='x2paddle_35')
+    x2paddle_37 = fluid.layers.resize_nearest(x2paddle_35, name='x2paddle_37', out_shape=[256, 256])
+    x2paddle_38 = fluid.layers.pad2d(x2paddle_37,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_38')
+    x2paddle_39 = fluid.layers.conv2d(x2paddle_38,
+                                      num_filters=128,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_11',
+                                      name='x2paddle_39',
+                                      bias_attr='x2paddle_12')
+    x2paddle_40 = fluid.layers.relu(x2paddle_39, name='x2paddle_40')
+    x2paddle_41 = fluid.layers.pad2d(x2paddle_40,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_41')
+    x2paddle_42 = fluid.layers.conv2d(x2paddle_41,
+                                      num_filters=64,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_13',
+                                      name='x2paddle_42',
+                                      bias_attr='x2paddle_14')
+    x2paddle_43 = fluid.layers.relu(x2paddle_42, name='x2paddle_43')
+    x2paddle_45 = fluid.layers.resize_nearest(x2paddle_43, name='x2paddle_45', out_shape=[512, 512])
+    x2paddle_46 = fluid.layers.pad2d(x2paddle_45,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_46')
+    x2paddle_47 = fluid.layers.conv2d(x2paddle_46,
+                                      num_filters=64,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_15',
+                                      name='x2paddle_47',
+                                      bias_attr='x2paddle_16')
+    x2paddle_48 = fluid.layers.relu(x2paddle_47, name='x2paddle_48')
+    x2paddle_49 = fluid.layers.pad2d(x2paddle_48,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_49')
+    x2paddle_50 = fluid.layers.conv2d(x2paddle_49,
+                                      num_filters=3,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_17',
+                                      name='x2paddle_50',
+                                      bias_attr='x2paddle_18')
+    return x2paddle_input_1, x2paddle_50
--- a/modules/image/Image_gan/style_transfer/stylepro_artistic/encoder_network.py
+++ b/modules/image/Image_gan/style_transfer/stylepro_artistic/encoder_network.py
+# coding=utf-8
+from paddle.fluid.initializer import Constant
+from paddle.fluid.param_attr import ParamAttr
+import paddle.fluid as fluid
+def encoder_net():
+    x2paddle_0 = fluid.layers.data(dtype='float32', shape=[1, 3, 512, 512], name='x2paddle_0', append_batch_size=False)
+    x2paddle_21 = fluid.layers.conv2d(x2paddle_0,
+                                      num_filters=3,
+                                      filter_size=[1, 1],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_1',
+                                      name='x2paddle_21',
+                                      bias_attr='x2paddle_2')
+    x2paddle_22 = fluid.layers.pad2d(x2paddle_21,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_22')
+    x2paddle_23 = fluid.layers.conv2d(x2paddle_22,
+                                      num_filters=64,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_3',
+                                      name='x2paddle_23',
+                                      bias_attr='x2paddle_4')
+    x2paddle_24 = fluid.layers.relu(x2paddle_23, name='x2paddle_24')
+    x2paddle_25 = fluid.layers.pad2d(x2paddle_24,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_25')
+    x2paddle_26 = fluid.layers.conv2d(x2paddle_25,
+                                      num_filters=64,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_5',
+                                      name='x2paddle_26',
+                                      bias_attr='x2paddle_6')
+    x2paddle_27 = fluid.layers.relu(x2paddle_26, name='x2paddle_27')
+    x2paddle_28 = fluid.layers.pool2d(x2paddle_27,
+                                      pool_size=[2, 2],
+                                      pool_type='max',
+                                      pool_stride=[2, 2],
+                                      pool_padding=[0, 0],
+                                      ceil_mode=False,
+                                      name='x2paddle_28',
+                                      exclusive=False)
+    x2paddle_29 = fluid.layers.pad2d(x2paddle_28,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_29')
+    x2paddle_30 = fluid.layers.conv2d(x2paddle_29,
+                                      num_filters=128,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_7',
+                                      name='x2paddle_30',
+                                      bias_attr='x2paddle_8')
+    x2paddle_31 = fluid.layers.relu(x2paddle_30, name='x2paddle_31')
+    x2paddle_32 = fluid.layers.pad2d(x2paddle_31,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_32')
+    x2paddle_33 = fluid.layers.conv2d(x2paddle_32,
+                                      num_filters=128,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_9',
+                                      name='x2paddle_33',
+                                      bias_attr='x2paddle_10')
+    x2paddle_34 = fluid.layers.relu(x2paddle_33, name='x2paddle_34')
+    x2paddle_35 = fluid.layers.pool2d(x2paddle_34,
+                                      pool_size=[2, 2],
+                                      pool_type='max',
+                                      pool_stride=[2, 2],
+                                      pool_padding=[0, 0],
+                                      ceil_mode=False,
+                                      name='x2paddle_35',
+                                      exclusive=False)
+    x2paddle_36 = fluid.layers.pad2d(x2paddle_35,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_36')
+    x2paddle_37 = fluid.layers.conv2d(x2paddle_36,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_11',
+                                      name='x2paddle_37',
+                                      bias_attr='x2paddle_12')
+    x2paddle_38 = fluid.layers.relu(x2paddle_37, name='x2paddle_38')
+    x2paddle_39 = fluid.layers.pad2d(x2paddle_38,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_39')
+    x2paddle_40 = fluid.layers.conv2d(x2paddle_39,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_13',
+                                      name='x2paddle_40',
+                                      bias_attr='x2paddle_14')
+    x2paddle_41 = fluid.layers.relu(x2paddle_40, name='x2paddle_41')
+    x2paddle_42 = fluid.layers.pad2d(x2paddle_41,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_42')
+    x2paddle_43 = fluid.layers.conv2d(x2paddle_42,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_15',
+                                      name='x2paddle_43',
+                                      bias_attr='x2paddle_16')
+    x2paddle_44 = fluid.layers.relu(x2paddle_43, name='x2paddle_44')
+    x2paddle_45 = fluid.layers.pad2d(x2paddle_44,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_45')
+    x2paddle_46 = fluid.layers.conv2d(x2paddle_45,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_17',
+                                      name='x2paddle_46',
+                                      bias_attr='x2paddle_18')
+    x2paddle_47 = fluid.layers.relu(x2paddle_46, name='x2paddle_47')
+    x2paddle_48 = fluid.layers.pool2d(x2paddle_47,
+                                      pool_size=[2, 2],
+                                      pool_type='max',
+                                      pool_stride=[2, 2],
+                                      pool_padding=[0, 0],
+                                      ceil_mode=False,
+                                      name='x2paddle_48',
+                                      exclusive=False)
+    x2paddle_49 = fluid.layers.pad2d(x2paddle_48,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_49')
+    x2paddle_50 = fluid.layers.conv2d(x2paddle_49,
+                                      num_filters=512,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_19',
+                                      name='x2paddle_50',
+                                      bias_attr='x2paddle_20')
+    x2paddle_51 = fluid.layers.relu(x2paddle_50, name='x2paddle_51')
+    return x2paddle_0, x2paddle_51
--- a/modules/image/style_transfer/stylepro_artistic/module.py
+++ b/modules/image/style_transfer/stylepro_artistic/module.py
@@ -140,8 +140,7 @@ class StyleProjection(hub.Module):
        encode_program, encode_feeded_var_names, encode_target_vars = fluid.io.load_inference_model(
            dirname=self.pretrained_encoder_net, executor=exe)
-        fluid.io.save_inference_model(
+        fluid.io.save_inference_model(dirname=dirname,
-            dirname=dirname,
                                      main_program=encode_program,
                                      executor=exe,
                                      feeded_var_names=encode_feeded_var_names,
@@ -159,8 +158,7 @@ class StyleProjection(hub.Module):
        decode_program, decode_feeded_var_names, decode_target_vars = fluid.io.load_inference_model(
            dirname=self.pretrained_decoder_net, executor=exe)
-        fluid.io.save_inference_model(
+        fluid.io.save_inference_model(dirname=dirname,
-            dirname=dirname,
                                      main_program=decode_program,
                                      executor=exe,
                                      feeded_var_names=decode_feeded_var_names,
@@ -186,8 +184,7 @@ class StyleProjection(hub.Module):
        """
        Run as a command.
        """
-        self.parser = argparse.ArgumentParser(
+        self.parser = argparse.ArgumentParser(description="Run the {} module.".format(self.name),
-            description="Run the {} module.".format(self.name),
                                              prog='hub run {}'.format(self.name),
                                              usage='%(prog)s',
                                              add_help=True)
@@ -202,20 +199,29 @@ class StyleProjection(hub.Module):
            paths = [{'content': args.content, 'styles': args.styles.split(',')}]
        else:
            paths = [{'content': args.content, 'styles': args.styles.split(','), 'weights': list(args.weights)}]
-        results = self.style_transfer(
+        results = self.style_transfer(paths=paths,
-            paths=paths, alpha=args.alpha, use_gpu=args.use_gpu, output_dir=args.output_dir, visualization=True)
+                                      alpha=args.alpha,
+                                      use_gpu=args.use_gpu,
+                                      output_dir=args.output_dir,
+                                      visualization=True)
        return results
    def add_module_config_arg(self):
        """
        Add the command config options.
        """
-        self.arg_config_group.add_argument(
+        self.arg_config_group.add_argument('--use_gpu',
-            '--use_gpu', type=ast.literal_eval, default=False, help="whether use GPU or not")
+                                           type=ast.literal_eval,
-        self.arg_config_group.add_argument(
+                                           default=False,
-            '--output_dir', type=str, default='transfer_result', help="The directory to save output images.")
+                                           help="whether use GPU or not")
-        self.arg_config_group.add_argument(
+        self.arg_config_group.add_argument('--output_dir',
-            '--visualization', type=ast.literal_eval, default=True, help="whether to save output as images.")
+                                           type=str,
+                                           default='transfer_result',
+                                           help="The directory to save output images.")
+        self.arg_config_group.add_argument('--visualization',
+                                           type=ast.literal_eval,
+                                           default=True,
+                                           help="whether to save output as images.")
    def add_module_input_arg(self):
        """
@@ -223,7 +229,11 @@ class StyleProjection(hub.Module):
        """
        self.arg_input_group.add_argument('--content', type=str, help="path to content.")
        self.arg_input_group.add_argument('--styles', type=str, help="path to styles.")
-        self.arg_input_group.add_argument(
+        self.arg_input_group.add_argument('--weights',
-            '--weights', type=ast.literal_eval, default=None, help="interpolation weights of styles.")
+                                          type=ast.literal_eval,
-        self.arg_config_group.add_argument(
+                                          default=None,
-            '--alpha', type=ast.literal_eval, default=1, help="The parameter to control the tranform degree.")
+                                          help="interpolation weights of styles.")
+        self.arg_config_group.add_argument('--alpha',
+                                           type=ast.literal_eval,
+                                           default=1,
+                                           help="The parameter to control the tranform degree.")
--- a/modules/image/style_transfer/stylepro_artistic/processor.py
+++ b/modules/image/style_transfer/stylepro_artistic/processor.py
--- a/modules/image/classification/README.md
+++ b/modules/image/classification/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【图像分类】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 图像分类
+图像分类是根据图像的语义信息对不同类别图像进行区分，是计算机视觉中重要的基础问题，是物体检测、图像分割、物体跟踪、行为分析、人脸识别等其他高层视觉任务的基础，在许多领域都有着广泛的应用。如：安防领域的人脸识别和智能视频分析等，交通领域的交通场景识别，互联网领域基于内容的图像检索和相册自动归类，医学领域的图像识别等。
+**注：** **如果你是资深开发者，那可以随意按需使用**，**假如你是新手，服务器端优先选择Resnet50，移动端优先选择MobileNetV3**
+- 精选模型推荐
+|            | **模型名称**                                                 | **模型特色**                                       |
+| ---------- | :----------------------------------------------------------- | ---------------------------------------------------------- |
+| 图像分类 | [菜品识别](https://www.paddlepaddle.org.cn/hubdetail?name=resnet50_vd_dishes&en_category=ImageClassification) | 私有数据集训练，支持8416种菜品的分类识别，适合进一步菜品方向微调 |
+| 图像分类 | [动物识别](https://www.paddlepaddle.org.cn/hubdetail?name=resnet50_vd_animals&en_category=ImageClassification) | 私有数据集训练，支持7978种动物的分类识别，适合进一步动物方向微调 |
+| 图像分类 | [野生动物制品识别](https://www.paddlepaddle.org.cn/hubdetail?name=resnet50_vd_wildanimals&en_category=ImageClassification) | 支持'象牙制品', '象牙', '大象', '虎皮', '老虎', '虎牙/虎爪/虎骨', '穿山甲甲片', '穿山甲', '穿山甲爪子', '其他' 这十个标签的识别。 |
+- 更多模型
+| **模型名称** | **模型简介** |
+| - | - |
+| [AlexNet](https://www.paddlepaddle.org.cn/hubdetail?name=alexnet_imagenet&en_category=ImageClassification) | 首次在 CNN 中成功的应用了 ReLU, Dropout 和 LRN，并使用 GPU 进行运算加速 |
+| [VGG19](https://www.paddlepaddle.org.cn/hubdetail?name=vgg19_imagenet&en_category=ImageClassification) | 在 AlexNet 的基础上使用 3*3 小卷积核，增加网络深度，具有很好的泛化能力 |
+| [GoogLeNet](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleCV/image_classification) | 在不增加计算负载的前提下增加了网络的深度和宽度，性能更加优越 |
+| [ResNet50](https://www.paddlepaddle.org.cn/hubdetail?name=resnet_v2_50_imagenet&en_category=ImageClassification) | Residual Network，引入了新的残差结构，解决了随着网络加深，准确率下降的问题 |
+| [Inceptionv4](https://www.paddlepaddle.org.cn/hubdetail?name=inception_v4_imagenet&en_category=ImageClassification) | 将 Inception 模块与 Residual Connection 进行结合，通过ResNet的结构极大地加速训练并获得性能的提升 |
+| [MobileNetV2](https://www.paddlepaddle.org.cn/hubdetail?name=mobilenet_v2_imagenet&en_category=ImageClassification) | MobileNet结构的微调，直接在 thinner 的 bottleneck层上进行 skip learning 连接以及对 bottleneck layer 不进行 ReLu 非线性处理可取得更好的结果 |
+| [se_resnext50](https://www.paddlepaddle.org.cn/hubdetail?name=se_resnext50_32x4d_imagenet&en_category=ImageClassification) | 在ResNeXt 基础、上加入了 SE(Sequeeze-and-Excitation) 模块，提高了识别准确率，在 ILSVRC 2017 的分类项目中取得了第一名 |
+| [ShuffleNetV2](https://www.paddlepaddle.org.cn/hubdetail?name=shufflenet_v2_imagenet&en_category=ImageClassification) | ECCV2018，轻量级 CNN 网络，在速度和准确度之间做了很好地平衡。在同等复杂度下，比 ShuffleNet 和 MobileNetv2 更准确，更适合移动端以及无人车领域 |
+| [efficientNetb7](https://www.paddlepaddle.org.cn/hubdetail?name=efficientnetb7_imagenet&en_category=ImageClassification) | 同时对模型的分辨率，通道数和深度进行缩放，用极少的参数就可以达到SOTA的精度。 |
+| [xception71](https://www.paddlepaddle.org.cn/hubdetail?name=xception71_imagenet&en_category=ImageClassification) | 对inception-v3的改进，用深度可分离卷积代替普通卷积，降低参数量同时提高了精度。 |
+| [dpn107](https://www.paddlepaddle.org.cn/hubdetail?name=dpn107_imagenet&en_category=ImageClassification) | 融合了densenet和resnext的特点。 |
+| [DarkNet53](https://www.paddlepaddle.org.cn/hubdetail?name=darknet53_imagenet&en_category=ImageClassification) | 检测框架yolov3使用的backbone，在分类和检测任务上都有不错表现。 |
+| [DenseNet161](https://www.paddlepaddle.org.cn/hubdetail?name=densenet161_imagenet&en_category=ImageClassification) | 提出了密集连接的网络结构，更加有利于信息流的传递。 |
+| [ResNeXt152_vd](https://www.paddlepaddle.org.cn/hubdetail?name=resnext152_64x4d_imagenet&en_category=ImageClassification) | 提出了cardinatity的概念，用于作为模型复杂度的另外一个度量，有效地提升模型精度。 |
--- a/modules/image/face_detection/README.md
+++ b/modules/image/face_detection/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【人脸检测】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 人脸检测
+人脸检测属于目标检测的一个重要分支，由于近年来安防市场、人脸识别、人脸安全方面的原因，成为目标检测中最重要的任务之一。
+- 推荐模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+ | [人脸检测](https://www.paddlepaddle.org.cn/hubdetail?name=pyramidbox_lite_server&en_category=FaceDetection) | 百度自研，18年3月WIDER Face 数据集**冠军模型**，           |
+| [超轻量人脸检测](https://www.paddlepaddle.org.cn/hubdetail?name=ultra_light_fast_generic_face_detector_1mb_640&en_category=FaceDetection) | 针对边缘计算设备或低算力设备(如用ARM推理)设计的实时超轻量级通用人脸检测模型，可以在低算力设备中如用ARM进行实时的通用场景的人脸检测推理。 |
+| [口罩人脸检测与识别](https://www.paddlepaddle.org.cn/hubdetail?name=pyramidbox_lite_server_mask&en_category=FaceDetection) | 业界**首个开源口罩人脸检测与识别模型**，引起广泛关注。     |
--- a/modules/image/gan/README.md
+++ b/modules/image/gan/README.md
--- a/modules/image/keypoint_detection/README.md
+++ b/modules/image/keypoint_detection/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【关键点检测】](https://www.paddlepaddle.org.cn/hubdetail)**
+#### 关键点检测
+人体骨骼关键点检测 (Pose Estimation) 主要检测人体的一些关键点，如关节，五官等，通过关键点描述人体骨骼信息。人体骨骼关键点检测对于描述人体姿态，预测人体行为至关重要。是诸多计算机视觉任务的基础，例如动作分类，异常行为检测，以及自动驾驶等等。
+- 精选模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [单人--人体骨骼关键点检测](https://www.paddlepaddle.org.cn/hubdetail?name=human_pose_estimation_resnet50_mpii&en_category=KeyPointDetection) | 可用于行为识别、人物跟踪、步态识别等相关领域。具体应用主要集中在智能视频监控，病人监护系统，人机交互，虚拟现实，人体动画，智能家居，智能安防，运动员辅助训练等等。  |
+| [多人-人体骨骼关键点检测](https://www.paddlepaddle.org.cn/hubdetail?name=openpose_body_estimation&en_category=KeyPointDetection) | 可用于行为识别、人物跟踪、步态识别等相关领域。具体应用主要集中在智能视频监控，病人监护系统，人机交互，虚拟现实，人体动画，智能家居，智能安防，运动员辅助训练等等。  |
+| [面部关键点检测](https://www.paddlepaddle.org.cn/hubdetail?name=face_landmark_localization&en_category=KeyPointDetection) |可用于人脸识别、表情分析、三维人脸重建及三维动画等其它人脸相关问题，支持同一张图中的多个人脸检测  |
+| [手部关键点检测](https://www.paddlepaddle.org.cn/hubdetail?name=hand_pose_localization&en_category=KeyPointDetection) |可用于手势识别，配合人体骨骼关键点，可用于异常行为检测等多种场景  |
--- a/modules/image/object_detection/README.md
+++ b/modules/image/object_detection/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【目标检测】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 目标检测
+目标检测任务的目标是给定一张图像或是一个视频帧，让计算机找出其中所有目标的位置，并给出每个目标的具体类别。对于计算机而言，能够“看到”的是图像被编码之后的数字，但很难解图像或是视频帧中出现了人或是物体这样的高层语义概念，也就更加难以定位目标出现在图像中哪个区域。
+- 精选推荐模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+ | [YOLOv3](https://www.paddlepaddle.org.cn/hubdetail?name=yolov3_darknet53_coco2017&en_category=ObjectDetection) | 实现精度相比原作者**提高5.9 个绝对百分点**，性能极致优化。 |
+ | [行人检测](https://www.paddlepaddle.org.cn/hubdetail?name=yolov3_darknet53_pedestrian&en_category=ObjectDetection) | 百度自研模型，海量私有数据集训练，可以应用于智能视频监控，人体行为分析，客流统计系统，智能交通等领域 |
+ | [车辆检测](https://www.paddlepaddle.org.cn/hubdetail?name=yolov3_darknet53_vehicles&en_category=ObjectDetection) | 百度自研模型，支持car (汽车)，truck (卡车)，bus (公交车)，motorbike (摩托车)，tricycle (三轮车)等车型的识别 |
--- a/modules/image/semantic_segmentation/README.md
+++ b/modules/image/semantic_segmentation/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【图像分割】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 图像分割
+图像语义分割顾名思义是将图像像素按照表达的语义含义的不同进行分组/分割，图像语义是指对图像内容的理解，例如，能够描绘出什么物体在哪里做了什么事情等，分割是指对图片中的每个像素点进行标注，标注属于哪一类别。近年来用在无人车驾驶技术中分割街景来避让行人和车辆、医疗影像分析中辅助诊断等。
+- 精选模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [人像分割](https://www.paddlepaddle.org.cn/hubdetail?name=deeplabv3p_xception65_humanseg&en_category=ImageSegmentation) | 百度**自建数据集**训练，人像分割效果卓越。                 |
+| [人体解析](https://www.paddlepaddle.org.cn/hubdetail?name=ace2p&en_category=ImageSegmentation) | CVPR2019 LIP挑战赛中**满贯三冠王**。人体解析任务必选。     |
+| [肺炎CT影像分析](https://www.paddlepaddle.org.cn/hubdetail?name=Pneumonia_CT_LKM_PP&en_category=ImageSegmentation) | 助力连心医疗开源**业界首个**肺炎CT影像分析模型
--- a/modules/image/style_transfer/stylepro_artistic/decoder_network.py
+++ b/modules/image/style_transfer/stylepro_artistic/decoder_network.py
-# coding=utf-8
-from paddle.fluid.initializer import Constant
-from paddle.fluid.param_attr import ParamAttr
-import paddle.fluid as fluid
-def decoder_net():
-    x2paddle_22 = fluid.layers.create_parameter(
-        dtype='float32', shape=[4], name='x2paddle_22', attr='x2paddle_22', default_initializer=Constant(0.0))
-    x2paddle_36 = fluid.layers.create_parameter(
-        dtype='float32', shape=[4], name='x2paddle_36', attr='x2paddle_36', default_initializer=Constant(0.0))
-    x2paddle_44 = fluid.layers.create_parameter(
-        dtype='float32', shape=[4], name='x2paddle_44', attr='x2paddle_44', default_initializer=Constant(0.0))
-    x2paddle_input_1 = fluid.layers.data(
-        dtype='float32', shape=[1, 512, 64, 64], name='x2paddle_input_1', append_batch_size=False)
-    x2paddle_19 = fluid.layers.pad2d(
-        x2paddle_input_1, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_19')
-    x2paddle_20 = fluid.layers.conv2d(
-        x2paddle_19,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_1',
-        name='x2paddle_20',
-        bias_attr='x2paddle_2')
-    x2paddle_21 = fluid.layers.relu(x2paddle_20, name='x2paddle_21')
-    x2paddle_23 = fluid.layers.resize_nearest(x2paddle_21, name='x2paddle_23', out_shape=[128, 128])
-    x2paddle_24 = fluid.layers.pad2d(
-        x2paddle_23, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_24')
-    x2paddle_25 = fluid.layers.conv2d(
-        x2paddle_24,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_3',
-        name='x2paddle_25',
-        bias_attr='x2paddle_4')
-    x2paddle_26 = fluid.layers.relu(x2paddle_25, name='x2paddle_26')
-    x2paddle_27 = fluid.layers.pad2d(
-        x2paddle_26, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_27')
-    x2paddle_28 = fluid.layers.conv2d(
-        x2paddle_27,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_5',
-        name='x2paddle_28',
-        bias_attr='x2paddle_6')
-    x2paddle_29 = fluid.layers.relu(x2paddle_28, name='x2paddle_29')
-    x2paddle_30 = fluid.layers.pad2d(
-        x2paddle_29, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_30')
-    x2paddle_31 = fluid.layers.conv2d(
-        x2paddle_30,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_7',
-        name='x2paddle_31',
-        bias_attr='x2paddle_8')
-    x2paddle_32 = fluid.layers.relu(x2paddle_31, name='x2paddle_32')
-    x2paddle_33 = fluid.layers.pad2d(
-        x2paddle_32, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_33')
-    x2paddle_34 = fluid.layers.conv2d(
-        x2paddle_33,
-        num_filters=128,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_9',
-        name='x2paddle_34',
-        bias_attr='x2paddle_10')
-    x2paddle_35 = fluid.layers.relu(x2paddle_34, name='x2paddle_35')
-    x2paddle_37 = fluid.layers.resize_nearest(x2paddle_35, name='x2paddle_37', out_shape=[256, 256])
-    x2paddle_38 = fluid.layers.pad2d(
-        x2paddle_37, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_38')
-    x2paddle_39 = fluid.layers.conv2d(
-        x2paddle_38,
-        num_filters=128,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_11',
-        name='x2paddle_39',
-        bias_attr='x2paddle_12')
-    x2paddle_40 = fluid.layers.relu(x2paddle_39, name='x2paddle_40')
-    x2paddle_41 = fluid.layers.pad2d(
-        x2paddle_40, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_41')
-    x2paddle_42 = fluid.layers.conv2d(
-        x2paddle_41,
-        num_filters=64,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_13',
-        name='x2paddle_42',
-        bias_attr='x2paddle_14')
-    x2paddle_43 = fluid.layers.relu(x2paddle_42, name='x2paddle_43')
-    x2paddle_45 = fluid.layers.resize_nearest(x2paddle_43, name='x2paddle_45', out_shape=[512, 512])
-    x2paddle_46 = fluid.layers.pad2d(
-        x2paddle_45, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_46')
-    x2paddle_47 = fluid.layers.conv2d(
-        x2paddle_46,
-        num_filters=64,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_15',
-        name='x2paddle_47',
-        bias_attr='x2paddle_16')
-    x2paddle_48 = fluid.layers.relu(x2paddle_47, name='x2paddle_48')
-    x2paddle_49 = fluid.layers.pad2d(
-        x2paddle_48, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_49')
-    x2paddle_50 = fluid.layers.conv2d(
-        x2paddle_49,
-        num_filters=3,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_17',
-        name='x2paddle_50',
-        bias_attr='x2paddle_18')
-    return x2paddle_input_1, x2paddle_50
--- a/modules/image/style_transfer/stylepro_artistic/encoder_network.py
+++ b/modules/image/style_transfer/stylepro_artistic/encoder_network.py
-# coding=utf-8
-from paddle.fluid.initializer import Constant
-from paddle.fluid.param_attr import ParamAttr
-import paddle.fluid as fluid
-def encoder_net():
-    x2paddle_0 = fluid.layers.data(dtype='float32', shape=[1, 3, 512, 512], name='x2paddle_0', append_batch_size=False)
-    x2paddle_21 = fluid.layers.conv2d(
-        x2paddle_0,
-        num_filters=3,
-        filter_size=[1, 1],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_1',
-        name='x2paddle_21',
-        bias_attr='x2paddle_2')
-    x2paddle_22 = fluid.layers.pad2d(
-        x2paddle_21, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_22')
-    x2paddle_23 = fluid.layers.conv2d(
-        x2paddle_22,
-        num_filters=64,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_3',
-        name='x2paddle_23',
-        bias_attr='x2paddle_4')
-    x2paddle_24 = fluid.layers.relu(x2paddle_23, name='x2paddle_24')
-    x2paddle_25 = fluid.layers.pad2d(
-        x2paddle_24, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_25')
-    x2paddle_26 = fluid.layers.conv2d(
-        x2paddle_25,
-        num_filters=64,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_5',
-        name='x2paddle_26',
-        bias_attr='x2paddle_6')
-    x2paddle_27 = fluid.layers.relu(x2paddle_26, name='x2paddle_27')
-    x2paddle_28 = fluid.layers.pool2d(
-        x2paddle_27,
-        pool_size=[2, 2],
-        pool_type='max',
-        pool_stride=[2, 2],
-        pool_padding=[0, 0],
-        ceil_mode=False,
-        name='x2paddle_28',
-        exclusive=False)
-    x2paddle_29 = fluid.layers.pad2d(
-        x2paddle_28, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_29')
-    x2paddle_30 = fluid.layers.conv2d(
-        x2paddle_29,
-        num_filters=128,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_7',
-        name='x2paddle_30',
-        bias_attr='x2paddle_8')
-    x2paddle_31 = fluid.layers.relu(x2paddle_30, name='x2paddle_31')
-    x2paddle_32 = fluid.layers.pad2d(
-        x2paddle_31, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_32')
-    x2paddle_33 = fluid.layers.conv2d(
-        x2paddle_32,
-        num_filters=128,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_9',
-        name='x2paddle_33',
-        bias_attr='x2paddle_10')
-    x2paddle_34 = fluid.layers.relu(x2paddle_33, name='x2paddle_34')
-    x2paddle_35 = fluid.layers.pool2d(
-        x2paddle_34,
-        pool_size=[2, 2],
-        pool_type='max',
-        pool_stride=[2, 2],
-        pool_padding=[0, 0],
-        ceil_mode=False,
-        name='x2paddle_35',
-        exclusive=False)
-    x2paddle_36 = fluid.layers.pad2d(
-        x2paddle_35, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_36')
-    x2paddle_37 = fluid.layers.conv2d(
-        x2paddle_36,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_11',
-        name='x2paddle_37',
-        bias_attr='x2paddle_12')
-    x2paddle_38 = fluid.layers.relu(x2paddle_37, name='x2paddle_38')
-    x2paddle_39 = fluid.layers.pad2d(
-        x2paddle_38, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_39')
-    x2paddle_40 = fluid.layers.conv2d(
-        x2paddle_39,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_13',
-        name='x2paddle_40',
-        bias_attr='x2paddle_14')
-    x2paddle_41 = fluid.layers.relu(x2paddle_40, name='x2paddle_41')
-    x2paddle_42 = fluid.layers.pad2d(
-        x2paddle_41, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_42')
-    x2paddle_43 = fluid.layers.conv2d(
-        x2paddle_42,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_15',
-        name='x2paddle_43',
-        bias_attr='x2paddle_16')
-    x2paddle_44 = fluid.layers.relu(x2paddle_43, name='x2paddle_44')
-    x2paddle_45 = fluid.layers.pad2d(
-        x2paddle_44, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_45')
-    x2paddle_46 = fluid.layers.conv2d(
-        x2paddle_45,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_17',
-        name='x2paddle_46',
-        bias_attr='x2paddle_18')
-    x2paddle_47 = fluid.layers.relu(x2paddle_46, name='x2paddle_47')
-    x2paddle_48 = fluid.layers.pool2d(
-        x2paddle_47,
-        pool_size=[2, 2],
-        pool_type='max',
-        pool_stride=[2, 2],
-        pool_padding=[0, 0],
-        ceil_mode=False,
-        name='x2paddle_48',
-        exclusive=False)
-    x2paddle_49 = fluid.layers.pad2d(
-        x2paddle_48, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_49')
-    x2paddle_50 = fluid.layers.conv2d(
-        x2paddle_49,
-        num_filters=512,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_19',
-        name='x2paddle_50',
-        bias_attr='x2paddle_20')
-    x2paddle_51 = fluid.layers.relu(x2paddle_50, name='x2paddle_51')
-    return x2paddle_0, x2paddle_51
--- a/modules/image/text_recognition/README.md
+++ b/modules/image/text_recognition/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【文字识别】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 文字识别
+文字识别（OCR）是计算机视觉重要任务之一，主要用于图像中文本信息的提取，具有重要的产业实践意义。
+- 推荐模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [超轻量-中英文OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=chinese_ocr_db_crnn_mobile&en_category=TextRecognition) | 业界开源最小，8.1M超轻量中英文识别模型。支持中英文识别；支持倾斜、竖排等多种方向文字识别，**强力推荐** |
+| [高精度-中英文OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=chinese_ocr_db_crnn_mobile&en_category=TextRecognition) | 业界开源效果最好，155M高精度中英文识别模型。支持中英文识别；支持倾斜、竖排等多种方向文字识别，**强力推荐** |
+| [德语-超轻量OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=german_ocr_db_crnn_mobile&en_category=TextRecognition) | 德语OCR识别，超轻量|
+| [法语-超轻量OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=french_ocr_db_crnn_mobile&en_category=TextRecognition) | 法语OCR识别，超轻量|
+| [日语-超轻量OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=japan_ocr_db_crnn_mobile&en_category=TextRecognition) | 日语OCR识别，超轻量|
+| [韩语-超轻量OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=korean_ocr_db_crnn_mobile&en_category=TextRecognition) | 韩语OCR识别，超轻量|
--- a/modules/text/language_model/README.md
+++ b/modules/text/language_model/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【语言模型】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 语言模型
+- 推荐模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [词嵌入模型](https://www.paddlepaddle.org.cn/hubdetail?name=word2vec_skipgram&en_category=SemanticModel) |在海量百度搜索数据集下预训练得到中文单词预训练词嵌入。其支持Fine-tune。Word2vec的预训练数据集的词汇表大小为1700249，word embedding维度为128。 |
+| [文本相似度](https://www.paddlepaddle.org.cn/hubdetail?name=simnet_bow&en_category=SemanticModel) |根据用户输入的两个文本，计算出文本相似度得分。 |
+| [ERNIE](https://www.paddlepaddle.org.cn/hubdetail?name=ERNIE&en_category=SemanticModel) |基于百科类、资讯类、论坛对话类数据等中文语料自研模型，其可用于文本分类、序列标注、阅读理解等任务。
+.
--- a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/README.md
+++ b/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/README.md
--- a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/__init__.py
+++ b/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/__init__.py
--- a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/__init__.py
+++ b/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/__init__.py
--- a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class BertModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -130,11 +129,11 @@ class BertModel(object):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -150,43 +149,45 @@ class BertModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/transformer_encoder.py
@@ -50,20 +50,17 @@ def multi_head_attention(queries,
        """
        Add linear projection to queries, keys, and values.
        """
-        q = layers.fc(
+        q = layers.fc(input=queries,
-            input=queries,
                      size=d_key * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
+        k = layers.fc(input=keys,
-            input=keys,
                      size=d_key * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
+        v = layers.fc(input=values,
-            input=values,
                      size=d_value * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
@@ -110,8 +107,10 @@ def multi_head_attention(queries,
            product += attn_bias
        weights = layers.softmax(product)
        if dropout_rate:
-            weights = layers.dropout(
+            weights = layers.dropout(weights,
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
        out = layers.matmul(weights, v)
        return out
@@ -133,8 +132,7 @@ def multi_head_attention(queries,
    out = __combine_heads(ctx_multiheads)
    # Project back to the model size.
-    proj_out = layers.fc(
+    proj_out = layers.fc(input=out,
-        input=out,
                         size=d_model,
                         num_flatten_dims=2,
                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
@@ -148,18 +146,18 @@ def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, p
    This module consists of two linear transformations with a ReLU activation
    in between, which is applied to each position separately and identically.
    """
-    hidden = layers.fc(
+    hidden = layers.fc(input=x,
-        input=x,
                       size=d_inner_hid,
                       num_flatten_dims=2,
                       act=hidden_act,
                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
                       bias_attr=name + '_fc_0.b_0')
    if dropout_rate:
-        hidden = layers.dropout(
+        hidden = layers.dropout(hidden,
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                                dropout_prob=dropout_rate,
-    out = layers.fc(
+                                dropout_implementation="upscale_in_train",
-        input=hidden,
+                                is_test=False)
+    out = layers.fc(input=hidden,
                    size=d_hid,
                    num_flatten_dims=2,
                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
@@ -181,17 +179,20 @@ def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name='')
            out_dtype = out.dtype
            if out_dtype == fluid.core.VarDesc.VarType.FP16:
                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
+            out = layers.layer_norm(out,
-                out,
                                    begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
            if out_dtype == fluid.core.VarDesc.VarType.FP16:
                out = layers.cast(x=out, dtype="float16")
        elif cmd == "d":  # add dropout
            if dropout_rate:
-                out = layers.dropout(
+                out = layers.dropout(out,
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
    return out
@@ -220,8 +221,10 @@ def encoder_layer(enc_input,
    with the post_process_layer to add residual connection, layer normalization
    and droput.
    """
-    attn_output = multi_head_attention(
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
                                       None,
                                       None,
                                       attn_bias,
@@ -232,10 +235,15 @@ def encoder_layer(enc_input,
                                       attention_dropout,
                                       param_initializer=param_initializer,
                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(
+    attn_output = post_process_layer(enc_input,
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
+                                     attn_output,
-    ffd_output = positionwise_feed_forward(
+                                     postprocess_cmd,
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
                                           d_inner_hid,
                                           d_model,
                                           relu_dropout,
@@ -266,8 +274,7 @@ def encoder(enc_input,
    encoder_layer.
    """
    for i in range(n_layer):
-        enc_output = encoder_layer(
+        enc_output = encoder_layer(enc_input,
-            enc_input,
                                   attn_bias,
                                   n_head,
                                   d_key,

--- a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/module.py
+++ b/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/module.py
@@ -58,8 +58,7 @@ class Bert(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        bert = BertModel(
+        bert = BertModel(src_ids=input_ids,
-            src_ids=input_ids,
                         position_ids=position_ids,
                         sentence_ids=segment_ids,
                         input_mask=input_mask,

--- a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/README.md
+++ b/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/README.md
--- a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/__init__.py
+++ b/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/__init__.py
--- a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/__init__.py
+++ b/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/__init__.py
--- a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/bert.py
+++ b/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class BertModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -130,11 +129,11 @@ class BertModel(object):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -150,43 +149,45 @@ class BertModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/transformer_encoder.py
+++ b/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/transformer_encoder.py
@@ -50,20 +50,17 @@ def multi_head_attention(queries,
        """
        Add linear projection to queries, keys, and values.
        """
-        q = layers.fc(
+        q = layers.fc(input=queries,
-            input=queries,
                      size=d_key * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
+        k = layers.fc(input=keys,
-            input=keys,
                      size=d_key * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
+        v = layers.fc(input=values,
-            input=values,
                      size=d_value * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
@@ -110,8 +107,10 @@ def multi_head_attention(queries,
            product += attn_bias
        weights = layers.softmax(product)
        if dropout_rate:
-            weights = layers.dropout(
+            weights = layers.dropout(weights,
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
        out = layers.matmul(weights, v)
        return out
@@ -133,8 +132,7 @@ def multi_head_attention(queries,
    out = __combine_heads(ctx_multiheads)
    # Project back to the model size.
-    proj_out = layers.fc(
+    proj_out = layers.fc(input=out,
-        input=out,
                         size=d_model,
                         num_flatten_dims=2,
                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
@@ -148,18 +146,18 @@ def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, p
    This module consists of two linear transformations with a ReLU activation
    in between, which is applied to each position separately and identically.
    """
-    hidden = layers.fc(
+    hidden = layers.fc(input=x,
-        input=x,
                       size=d_inner_hid,
                       num_flatten_dims=2,
                       act=hidden_act,
                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
                       bias_attr=name + '_fc_0.b_0')
    if dropout_rate:
-        hidden = layers.dropout(
+        hidden = layers.dropout(hidden,
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                                dropout_prob=dropout_rate,
-    out = layers.fc(
+                                dropout_implementation="upscale_in_train",
-        input=hidden,
+                                is_test=False)
+    out = layers.fc(input=hidden,
                    size=d_hid,
                    num_flatten_dims=2,
                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
@@ -181,17 +179,20 @@ def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name='')
            out_dtype = out.dtype
            if out_dtype == fluid.core.VarDesc.VarType.FP16:
                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
+            out = layers.layer_norm(out,
-                out,
                                    begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
            if out_dtype == fluid.core.VarDesc.VarType.FP16:
                out = layers.cast(x=out, dtype="float16")
        elif cmd == "d":  # add dropout
            if dropout_rate:
-                out = layers.dropout(
+                out = layers.dropout(out,
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
    return out
@@ -220,8 +221,10 @@ def encoder_layer(enc_input,
    with the post_process_layer to add residual connection, layer normalization
    and droput.
    """
-    attn_output = multi_head_attention(
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
                                       None,
                                       None,
                                       attn_bias,
@@ -232,10 +235,15 @@ def encoder_layer(enc_input,
                                       attention_dropout,
                                       param_initializer=param_initializer,
                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(
+    attn_output = post_process_layer(enc_input,
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
+                                     attn_output,
-    ffd_output = positionwise_feed_forward(
+                                     postprocess_cmd,
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
                                           d_inner_hid,
                                           d_model,
                                           relu_dropout,
@@ -266,8 +274,7 @@ def encoder(enc_input,
    encoder_layer.
    """
    for i in range(n_layer):
-        enc_output = encoder_layer(
+        enc_output = encoder_layer(enc_input,
-            enc_input,
                                   attn_bias,
                                   n_head,
                                   d_key,

--- a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/module.py
+++ b/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/module.py
@@ -58,8 +58,7 @@ class Bert(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        bert = BertModel(
+        bert = BertModel(src_ids=input_ids,
-            src_ids=input_ids,
                         position_ids=position_ids,
                         sentence_ids=segment_ids,
                         input_mask=input_mask,

--- a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/README.md
+++ b/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/README.md
--- a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/__init__.py
+++ b/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/__init__.py
--- a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/__init__.py
+++ b/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/__init__.py
--- a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class BertModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -130,11 +129,11 @@ class BertModel(object):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -150,43 +149,45 @@ class BertModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/transformer_encoder.py
@@ -50,20 +50,17 @@ def multi_head_attention(queries,
        """
        Add linear projection to queries, keys, and values.
        """
-        q = layers.fc(
+        q = layers.fc(input=queries,
-            input=queries,
                      size=d_key * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
+        k = layers.fc(input=keys,
-            input=keys,
                      size=d_key * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
+        v = layers.fc(input=values,
-            input=values,
                      size=d_value * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
@@ -110,8 +107,10 @@ def multi_head_attention(queries,
            product += attn_bias
        weights = layers.softmax(product)
        if dropout_rate:
-            weights = layers.dropout(
+            weights = layers.dropout(weights,
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
        out = layers.matmul(weights, v)
        return out
@@ -133,8 +132,7 @@ def multi_head_attention(queries,
    out = __combine_heads(ctx_multiheads)
    # Project back to the model size.
-    proj_out = layers.fc(
+    proj_out = layers.fc(input=out,
-        input=out,
                         size=d_model,
                         num_flatten_dims=2,
                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
@@ -148,18 +146,18 @@ def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, p
    This module consists of two linear transformations with a ReLU activation
    in between, which is applied to each position separately and identically.
    """
-    hidden = layers.fc(
+    hidden = layers.fc(input=x,
-        input=x,
                       size=d_inner_hid,
                       num_flatten_dims=2,
                       act=hidden_act,
                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
                       bias_attr=name + '_fc_0.b_0')
    if dropout_rate:
-        hidden = layers.dropout(
+        hidden = layers.dropout(hidden,
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                                dropout_prob=dropout_rate,
-    out = layers.fc(
+                                dropout_implementation="upscale_in_train",
-        input=hidden,
+                                is_test=False)
+    out = layers.fc(input=hidden,
                    size=d_hid,
                    num_flatten_dims=2,
                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
@@ -181,17 +179,20 @@ def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name='')
            out_dtype = out.dtype
            if out_dtype == fluid.core.VarDesc.VarType.FP16:
                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
+            out = layers.layer_norm(out,
-                out,
                                    begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
            if out_dtype == fluid.core.VarDesc.VarType.FP16:
                out = layers.cast(x=out, dtype="float16")
        elif cmd == "d":  # add dropout
            if dropout_rate:
-                out = layers.dropout(
+                out = layers.dropout(out,
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
    return out
@@ -220,8 +221,10 @@ def encoder_layer(enc_input,
    with the post_process_layer to add residual connection, layer normalization
    and droput.
    """
-    attn_output = multi_head_attention(
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
                                       None,
                                       None,
                                       attn_bias,
@@ -232,10 +235,15 @@ def encoder_layer(enc_input,
                                       attention_dropout,
                                       param_initializer=param_initializer,
                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(
+    attn_output = post_process_layer(enc_input,
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
+                                     attn_output,
-    ffd_output = positionwise_feed_forward(
+                                     postprocess_cmd,
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
                                           d_inner_hid,
                                           d_model,
                                           relu_dropout,
@@ -266,8 +274,7 @@ def encoder(enc_input,
    encoder_layer.
    """
    for i in range(n_layer):
-        enc_output = encoder_layer(
+        enc_output = encoder_layer(enc_input,
-            enc_input,
                                   attn_bias,
                                   n_head,
                                   d_key,

--- a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/module.py
+++ b/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/module.py
@@ -58,8 +58,7 @@ class BertChinese(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        bert = BertModel(
+        bert = BertModel(src_ids=input_ids,
-            src_ids=input_ids,
                         position_ids=position_ids,
                         sentence_ids=segment_ids,
                         input_mask=input_mask,

--- a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/README.md
+++ b/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/README.md
--- a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/__init__.py
+++ b/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/__init__.py
--- a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/__init__.py
+++ b/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/__init__.py
--- a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class BertModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -130,11 +129,11 @@ class BertModel(object):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -150,43 +149,45 @@ class BertModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/transformer_encoder.py
@@ -50,20 +50,17 @@ def multi_head_attention(queries,
        """
        Add linear projection to queries, keys, and values.
        """
-        q = layers.fc(
+        q = layers.fc(input=queries,
-            input=queries,
                      size=d_key * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
+        k = layers.fc(input=keys,
-            input=keys,
                      size=d_key * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
+        v = layers.fc(input=values,
-            input=values,
                      size=d_value * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
@@ -110,8 +107,10 @@ def multi_head_attention(queries,
            product += attn_bias
        weights = layers.softmax(product)
        if dropout_rate:
-            weights = layers.dropout(
+            weights = layers.dropout(weights,
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
        out = layers.matmul(weights, v)
        return out
@@ -133,8 +132,7 @@ def multi_head_attention(queries,
    out = __combine_heads(ctx_multiheads)
    # Project back to the model size.
-    proj_out = layers.fc(
+    proj_out = layers.fc(input=out,
-        input=out,
                         size=d_model,
                         num_flatten_dims=2,
                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
@@ -148,18 +146,18 @@ def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, p
    This module consists of two linear transformations with a ReLU activation
    in between, which is applied to each position separately and identically.
    """
-    hidden = layers.fc(
+    hidden = layers.fc(input=x,
-        input=x,
                       size=d_inner_hid,
                       num_flatten_dims=2,
                       act=hidden_act,
                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
                       bias_attr=name + '_fc_0.b_0')
    if dropout_rate:
-        hidden = layers.dropout(
+        hidden = layers.dropout(hidden,
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                                dropout_prob=dropout_rate,
-    out = layers.fc(
+                                dropout_implementation="upscale_in_train",
-        input=hidden,
+                                is_test=False)
+    out = layers.fc(input=hidden,
                    size=d_hid,
                    num_flatten_dims=2,
                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
@@ -181,17 +179,20 @@ def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name='')
            out_dtype = out.dtype
            if out_dtype == fluid.core.VarDesc.VarType.FP16:
                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
+            out = layers.layer_norm(out,
-                out,
                                    begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
            if out_dtype == fluid.core.VarDesc.VarType.FP16:
                out = layers.cast(x=out, dtype="float16")
        elif cmd == "d":  # add dropout
            if dropout_rate:
-                out = layers.dropout(
+                out = layers.dropout(out,
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
    return out
@@ -220,8 +221,10 @@ def encoder_layer(enc_input,
    with the post_process_layer to add residual connection, layer normalization
    and droput.
    """
-    attn_output = multi_head_attention(
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
                                       None,
                                       None,
                                       attn_bias,
@@ -232,10 +235,15 @@ def encoder_layer(enc_input,
                                       attention_dropout,
                                       param_initializer=param_initializer,
                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(
+    attn_output = post_process_layer(enc_input,
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
+                                     attn_output,
-    ffd_output = positionwise_feed_forward(
+                                     postprocess_cmd,
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
                                           d_inner_hid,
                                           d_model,
                                           relu_dropout,
@@ -266,8 +274,7 @@ def encoder(enc_input,
    encoder_layer.
    """
    for i in range(n_layer):
-        enc_output = encoder_layer(
+        enc_output = encoder_layer(enc_input,
-            enc_input,
                                   attn_bias,
                                   n_head,
                                   d_key,

--- a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/module.py
+++ b/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/module.py
@@ -58,8 +58,7 @@ class Bert(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        bert = BertModel(
+        bert = BertModel(src_ids=input_ids,
-            src_ids=input_ids,
                         position_ids=position_ids,
                         sentence_ids=segment_ids,
                         input_mask=input_mask,

--- a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/README.md
+++ b/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/README.md
--- a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/__init__.py
+++ b/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/__init__.py
--- a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/__init__.py
+++ b/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/__init__.py
--- a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class BertModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -130,11 +129,11 @@ class BertModel(object):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -150,43 +149,45 @@ class BertModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/module.py
+++ b/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/module.py
@@ -58,8 +58,7 @@ class Bert(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        bert = BertModel(
+        bert = BertModel(src_ids=input_ids,
-            src_ids=input_ids,
                         position_ids=position_ids,
                         sentence_ids=segment_ids,
                         input_mask=input_mask,

--- a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/README.md
+++ b/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/README.md
--- a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/__init__.py
+++ b/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/__init__.py
--- a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/__init__.py
+++ b/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/__init__.py
--- a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class BertModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -130,11 +129,11 @@ class BertModel(object):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -150,43 +149,45 @@ class BertModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/module.py
+++ b/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/module.py
@@ -58,8 +58,7 @@ class Bert(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        bert = BertModel(
+        bert = BertModel(src_ids=input_ids,
-            src_ids=input_ids,
                         position_ids=position_ids,
                         sentence_ids=segment_ids,
                         input_mask=input_mask,

--- a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/README.md
+++ b/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/README.md
--- a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/__init__.py
+++ b/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/__init__.py
--- a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/__init__.py
+++ b/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/__init__.py
--- a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/bert.py
+++ b/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class BertModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -130,11 +129,11 @@ class BertModel(object):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -150,43 +149,45 @@ class BertModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/module.py
+++ b/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/module.py
@@ -58,8 +58,7 @@ class Bert(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        bert = BertModel(
+        bert = BertModel(src_ids=input_ids,
-            src_ids=input_ids,
                         position_ids=position_ids,
                         sentence_ids=segment_ids,
                         input_mask=input_mask,

--- a/modules/text/semantic_model/chinese_bert_wwm/README.md
+++ b/modules/text/semantic_model/chinese_bert_wwm/README.md
--- a/modules/text/semantic_model/chinese_bert_wwm/__init__.py
+++ b/modules/text/semantic_model/chinese_bert_wwm/__init__.py
--- a/modules/text/semantic_model/chinese_bert_wwm/model/__init__.py
+++ b/modules/text/semantic_model/chinese_bert_wwm/model/__init__.py
--- a/modules/text/semantic_model/chinese_bert_wwm/model/bert.py
+++ b/modules/text/semantic_model/chinese_bert_wwm/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class BertModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -130,11 +129,11 @@ class BertModel(object):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -150,43 +149,45 @@ class BertModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/language_model/chinese_bert_wwm/model/transformer_encoder.py
+++ b/modules/text/language_model/chinese_bert_wwm/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/chinese_bert_wwm/module.py
+++ b/modules/text/semantic_model/chinese_bert_wwm/module.py
@@ -58,8 +58,7 @@ class BertWwm(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        bert = BertModel(
+        bert = BertModel(src_ids=input_ids,
-            src_ids=input_ids,
                         position_ids=position_ids,
                         sentence_ids=segment_ids,
                         input_mask=input_mask,

--- a/modules/text/semantic_model/chinese_bert_wwm_ext/README.md
+++ b/modules/text/semantic_model/chinese_bert_wwm_ext/README.md
--- a/modules/text/semantic_model/chinese_bert_wwm_ext/__init__.py
+++ b/modules/text/semantic_model/chinese_bert_wwm_ext/__init__.py
--- a/modules/text/semantic_model/chinese_bert_wwm_ext/model/__init__.py
+++ b/modules/text/semantic_model/chinese_bert_wwm_ext/model/__init__.py
--- a/modules/text/semantic_model/chinese_bert_wwm_ext/model/bert.py
+++ b/modules/text/semantic_model/chinese_bert_wwm_ext/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class BertModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -130,11 +129,11 @@ class BertModel(object):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -150,43 +149,45 @@ class BertModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/language_model/chinese_bert_wwm_ext/model/transformer_encoder.py
+++ b/modules/text/language_model/chinese_bert_wwm_ext/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/chinese_bert_wwm_ext/module.py
+++ b/modules/text/semantic_model/chinese_bert_wwm_ext/module.py
@@ -58,8 +58,7 @@ class BertWwm(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        bert = BertModel(
+        bert = BertModel(src_ids=input_ids,
-            src_ids=input_ids,
                         position_ids=position_ids,
                         sentence_ids=segment_ids,
                         input_mask=input_mask,

--- a/modules/text/semantic_model/chinese_electra_base/README.md
+++ b/modules/text/semantic_model/chinese_electra_base/README.md
--- a/modules/text/semantic_model/chinese_electra_base/__init__.py
+++ b/modules/text/semantic_model/chinese_electra_base/__init__.py
--- a/modules/text/semantic_model/chinese_electra_base/model/__init__.py
+++ b/modules/text/semantic_model/chinese_electra_base/model/__init__.py
--- a/modules/text/semantic_model/chinese_electra_base/model/electra.py
+++ b/modules/text/semantic_model/chinese_electra_base/model/electra.py
@@ -74,23 +74,23 @@ class ElectraModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class ElectraModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -143,43 +142,45 @@ class ElectraModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/language_model/chinese_electra_base/model/transformer_encoder.py
+++ b/modules/text/language_model/chinese_electra_base/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/chinese_electra_base/module.py
+++ b/modules/text/semantic_model/chinese_electra_base/module.py
@@ -58,8 +58,7 @@ class Electra(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        electra = ElectraModel(
+        electra = ElectraModel(src_ids=input_ids,
-            src_ids=input_ids,
                               position_ids=position_ids,
                               sentence_ids=segment_ids,
                               input_mask=input_mask,

--- a/modules/text/semantic_model/chinese_electra_small/README.md
+++ b/modules/text/semantic_model/chinese_electra_small/README.md
--- a/modules/text/semantic_model/chinese_electra_small/__init__.py
+++ b/modules/text/semantic_model/chinese_electra_small/__init__.py
--- a/modules/text/semantic_model/chinese_electra_small/model/__init__.py
+++ b/modules/text/semantic_model/chinese_electra_small/model/__init__.py
--- a/modules/text/semantic_model/chinese_electra_small/model/electra.py
+++ b/modules/text/semantic_model/chinese_electra_small/model/electra.py
@@ -75,23 +75,23 @@ class ElectraModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -99,11 +99,11 @@ class ElectraModel(object):
        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
        if self._emb_size != self._hidden_size:
-            emb_out = fluid.layers.fc(
+            emb_out = fluid.layers.fc(input=emb_out,
-                input=emb_out,
                                      size=self._hidden_size,
                                      act=None,
-                param_attr=fluid.ParamAttr(name="embeddings_project.w_0", initializer=self._param_initializer),
+                                      param_attr=fluid.ParamAttr(name="embeddings_project.w_0",
+                                                                 initializer=self._param_initializer),
                                      num_flatten_dims=2,
                                      bias_attr="embeddings_project.b_0")
@@ -115,8 +115,7 @@ class ElectraModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -153,43 +152,45 @@ class ElectraModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._hidden_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/language_model/chinese_electra_small/model/transformer_encoder.py
+++ b/modules/text/language_model/chinese_electra_small/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/chinese_electra_small/module.py
+++ b/modules/text/semantic_model/chinese_electra_small/module.py
@@ -58,8 +58,7 @@ class Electra(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        electra = ElectraModel(
+        electra = ElectraModel(src_ids=input_ids,
-            src_ids=input_ids,
                               position_ids=position_ids,
                               sentence_ids=segment_ids,
                               input_mask=input_mask,

--- a/modules/text/semantic_model/chinese_roberta_wwm_ext/README.md
+++ b/modules/text/semantic_model/chinese_roberta_wwm_ext/README.md
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext/__init__.py
+++ b/modules/text/semantic_model/chinese_roberta_wwm_ext/__init__.py
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext/model/__init__.py
+++ b/modules/text/semantic_model/chinese_roberta_wwm_ext/model/__init__.py
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext/model/bert.py
+++ b/modules/text/semantic_model/chinese_roberta_wwm_ext/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class BertModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -130,11 +129,11 @@ class BertModel(object):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -150,43 +149,45 @@ class BertModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/language_model/chinese_roberta_wwm_ext/model/transformer_encoder.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext/module.py
+++ b/modules/text/semantic_model/chinese_roberta_wwm_ext/module.py
@@ -58,8 +58,7 @@ class BertWwm(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        bert = BertModel(
+        bert = BertModel(src_ids=input_ids,
-            src_ids=input_ids,
                         position_ids=position_ids,
                         sentence_ids=segment_ids,
                         input_mask=input_mask,

--- a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/README.md
+++ b/modules/text/semantic_model/chinese_roberta_wwm_ext_large/README.md
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/__init__.py
+++ b/modules/text/semantic_model/chinese_roberta_wwm_ext_large/__init__.py
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/__init__.py
+++ b/modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/__init__.py
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/bert.py
+++ b/modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class BertModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -130,11 +129,11 @@ class BertModel(object):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -150,43 +149,45 @@ class BertModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/language_model/chinese_roberta_wwm_ext_large/model/transformer_encoder.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext_large/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/module.py
+++ b/modules/text/semantic_model/chinese_roberta_wwm_ext_large/module.py
@@ -58,8 +58,7 @@ class BertWwm(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        bert = BertModel(
+        bert = BertModel(src_ids=input_ids,
-            src_ids=input_ids,
                         position_ids=position_ids,
                         sentence_ids=segment_ids,
                         input_mask=input_mask,

--- a/modules/text/semantic_model/ernie/README.md
+++ b/modules/text/semantic_model/ernie/README.md
--- a/modules/text/semantic_model/ernie/__init__.py
+++ b/modules/text/semantic_model/ernie/__init__.py
--- a/modules/text/semantic_model/ernie/model/__init__.py
+++ b/modules/text/semantic_model/ernie/model/__init__.py
--- a/modules/text/semantic_model/ernie/model/ernie.py
+++ b/modules/text/semantic_model/ernie/model/ernie.py
@@ -78,23 +78,23 @@ class ErnieModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -109,8 +109,7 @@ class ErnieModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -133,11 +132,11 @@ class ErnieModel(object):
    def get_pooled_output(self):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -153,43 +152,45 @@ class ErnieModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/language_model/ernie/model/transformer_encoder.py
+++ b/modules/text/language_model/ernie/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/ernie/module.py
+++ b/modules/text/semantic_model/ernie/module.py
@@ -58,8 +58,7 @@ class Ernie(TransformerModule):
            sequence_output (tensor): token-level output for sequence task.
        """
        self.ernie_config._config_dict['use_task_id'] = False
-        ernie = ErnieModel(
+        ernie = ErnieModel(src_ids=input_ids,
-            src_ids=input_ids,
                           position_ids=position_ids,
                           sentence_ids=segment_ids,
                           input_mask=input_mask,

--- a/modules/text/semantic_model/ernie_tiny/README.md
+++ b/modules/text/semantic_model/ernie_tiny/README.md
--- a/modules/text/semantic_model/ernie_tiny/__init__.py
+++ b/modules/text/semantic_model/ernie_tiny/__init__.py
--- a/modules/text/semantic_model/ernie_tiny/model/__init__.py
+++ b/modules/text/semantic_model/ernie_tiny/model/__init__.py
--- a/modules/text/semantic_model/ernie_tiny/model/ernie.py
+++ b/modules/text/semantic_model/ernie_tiny/model/ernie.py
@@ -95,34 +95,34 @@ class ErnieModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
        if self._use_task_id:
-            task_emb_out = fluid.layers.embedding(
+            task_emb_out = fluid.layers.embedding(task_ids,
-                task_ids,
                                                  size=[self._task_types, self._emb_size],
                                                  dtype=self._emb_dtype,
-                param_attr=fluid.ParamAttr(name=self._task_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._task_emb_name,
+                                                                             initializer=self._param_initializer))
            emb_out = emb_out + task_emb_out
@@ -137,8 +137,7 @@ class ErnieModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -163,11 +162,11 @@ class ErnieModel(object):
    def get_pooled_output(self):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -183,39 +182,41 @@ class ErnieModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
-        mask_trans_feat = fluid.layers.layer_norm(
+        mask_trans_feat = fluid.layers.layer_norm(mask_trans_feat,
-            mask_trans_feat,
                                                  begin_norm_axis=len(mask_trans_feat.shape) - 1,
                                                  param_attr=fluid.ParamAttr(
-                name='mask_lm_trans_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
+                                                      name='mask_lm_trans_layer_norm_scale',
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias', initializer=fluid.initializer.Constant(1.)))
+                                                      initializer=fluid.initializer.Constant(1.)),
+                                                  bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias',
+                                                                            initializer=fluid.initializer.Constant(1.)))
        # transform: layer norm
        #mask_trans_feat = pre_process_layer(
        #    mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._emb_dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._emb_dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
@@ -224,13 +225,14 @@ class ErnieModel(object):
        return mean_mask_lm_loss
    def get_task_output(self, task, task_labels):
-        task_fc_out = fluid.layers.fc(
+        task_fc_out = fluid.layers.fc(input=self.next_sent_feat,
-            input=self.next_sent_feat,
                                      size=task["num_labels"],
-            param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0", initializer=self._param_initializer),
+                                      param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0",
+                                                                 initializer=self._param_initializer),
                                      bias_attr=task["task_name"] + "_fc.b_0")
-        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(
+        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(logits=task_fc_out,
-            logits=task_fc_out, label=task_labels, return_softmax=True)
+                                                                          label=task_labels,
+                                                                          return_softmax=True)
        task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels)
        mean_task_loss = fluid.layers.mean(task_loss)
        return mean_task_loss, task_acc
--- a/modules/text/language_model/ernie_tiny/model/transformer_encoder.py
+++ b/modules/text/language_model/ernie_tiny/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/ernie_tiny/module.py
+++ b/modules/text/semantic_model/ernie_tiny/module.py
@@ -60,8 +60,7 @@ class ErnieTiny(TransformerModule):
            sequence_output (tensor): token-level output for sequence task.
        """
        self.ernie_config._config_dict['use_task_id'] = False
-        ernie = ErnieModel(
+        ernie = ErnieModel(src_ids=input_ids,
-            src_ids=input_ids,
                           position_ids=position_ids,
                           sentence_ids=segment_ids,
                           task_ids=None,

--- a/modules/text/semantic_model/ernie_v2_eng_base/README.md
+++ b/modules/text/semantic_model/ernie_v2_eng_base/README.md
--- a/modules/text/semantic_model/ernie_v2_eng_base/__init__.py
+++ b/modules/text/semantic_model/ernie_v2_eng_base/__init__.py
--- a/modules/text/semantic_model/ernie_v2_eng_base/model/__init__.py
+++ b/modules/text/semantic_model/ernie_v2_eng_base/model/__init__.py
--- a/modules/text/semantic_model/ernie_v2_eng_base/model/ernie.py
+++ b/modules/text/semantic_model/ernie_v2_eng_base/model/ernie.py
@@ -95,34 +95,34 @@ class ErnieModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
        if self._use_task_id:
-            task_emb_out = fluid.layers.embedding(
+            task_emb_out = fluid.layers.embedding(task_ids,
-                task_ids,
                                                  size=[self._task_types, self._emb_size],
                                                  dtype=self._emb_dtype,
-                param_attr=fluid.ParamAttr(name=self._task_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._task_emb_name,
+                                                                             initializer=self._param_initializer))
            emb_out = emb_out + task_emb_out
@@ -137,8 +137,7 @@ class ErnieModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -163,11 +162,11 @@ class ErnieModel(object):
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
        if self._dtype == "float16":
            next_sent_feat = fluid.layers.cast(x=next_sent_feat, dtype=self._emb_dtype)
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -185,39 +184,41 @@ class ErnieModel(object):
            mask_feat = fluid.layers.cast(x=mask_feat, dtype=self._emb_dtype)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
-        mask_trans_feat = fluid.layers.layer_norm(
+        mask_trans_feat = fluid.layers.layer_norm(mask_trans_feat,
-            mask_trans_feat,
                                                  begin_norm_axis=len(mask_trans_feat.shape) - 1,
                                                  param_attr=fluid.ParamAttr(
-                name='mask_lm_trans_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
+                                                      name='mask_lm_trans_layer_norm_scale',
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias', initializer=fluid.initializer.Constant(1.)))
+                                                      initializer=fluid.initializer.Constant(1.)),
+                                                  bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias',
+                                                                            initializer=fluid.initializer.Constant(1.)))
        # transform: layer norm
        #mask_trans_feat = pre_process_layer(
        #    mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._emb_dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._emb_dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
@@ -226,13 +227,14 @@ class ErnieModel(object):
        return mean_mask_lm_loss
    def get_task_output(self, task, task_labels):
-        task_fc_out = fluid.layers.fc(
+        task_fc_out = fluid.layers.fc(input=self.next_sent_feat,
-            input=self.next_sent_feat,
                                      size=task["num_labels"],
-            param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0", initializer=self._param_initializer),
+                                      param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0",
+                                                                 initializer=self._param_initializer),
                                      bias_attr=task["task_name"] + "_fc.b_0")
-        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(
+        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(logits=task_fc_out,
-            logits=task_fc_out, label=task_labels, return_softmax=True)
+                                                                          label=task_labels,
+                                                                          return_softmax=True)
        task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels)
        mean_task_loss = fluid.layers.mean(task_loss)
        return mean_task_loss, task_acc
--- a/modules/text/language_model/ernie_v2_eng_base/model/transformer_encoder.py
+++ b/modules/text/language_model/ernie_v2_eng_base/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/ernie_v2_eng_base/module.py
+++ b/modules/text/semantic_model/ernie_v2_eng_base/module.py
@@ -59,8 +59,7 @@ class ErnieV2EngBase(TransformerModule):
            sequence_output (tensor): token-level output for sequence task.
        """
        self.ernie_config._config_dict['use_task_id'] = False
-        ernie = ErnieModel(
+        ernie = ErnieModel(src_ids=input_ids,
-            src_ids=input_ids,
                           position_ids=position_ids,
                           sentence_ids=segment_ids,
                           task_ids=None,

--- a/modules/text/semantic_model/ernie_v2_eng_large/README.md
+++ b/modules/text/semantic_model/ernie_v2_eng_large/README.md
--- a/modules/text/semantic_model/ernie_v2_eng_large/__init__.py
+++ b/modules/text/semantic_model/ernie_v2_eng_large/__init__.py
--- a/modules/text/semantic_model/ernie_v2_eng_large/model/__init__.py
+++ b/modules/text/semantic_model/ernie_v2_eng_large/model/__init__.py
--- a/modules/text/semantic_model/ernie_v2_eng_large/model/ernie.py
+++ b/modules/text/semantic_model/ernie_v2_eng_large/model/ernie.py
@@ -95,34 +95,34 @@ class ErnieModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
        if self._use_task_id:
-            task_emb_out = fluid.layers.embedding(
+            task_emb_out = fluid.layers.embedding(task_ids,
-                task_ids,
                                                  size=[self._task_types, self._emb_size],
                                                  dtype=self._emb_dtype,
-                param_attr=fluid.ParamAttr(name=self._task_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._task_emb_name,
+                                                                             initializer=self._param_initializer))
            emb_out = emb_out + task_emb_out
@@ -137,8 +137,7 @@ class ErnieModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -163,11 +162,11 @@ class ErnieModel(object):
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
        if self._dtype == "float16":
            next_sent_feat = fluid.layers.cast(x=next_sent_feat, dtype=self._emb_dtype)
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -185,39 +184,41 @@ class ErnieModel(object):
            mask_feat = fluid.layers.cast(x=mask_feat, dtype=self._emb_dtype)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
-        mask_trans_feat = fluid.layers.layer_norm(
+        mask_trans_feat = fluid.layers.layer_norm(mask_trans_feat,
-            mask_trans_feat,
                                                  begin_norm_axis=len(mask_trans_feat.shape) - 1,
                                                  param_attr=fluid.ParamAttr(
-                name='mask_lm_trans_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
+                                                      name='mask_lm_trans_layer_norm_scale',
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias', initializer=fluid.initializer.Constant(1.)))
+                                                      initializer=fluid.initializer.Constant(1.)),
+                                                  bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias',
+                                                                            initializer=fluid.initializer.Constant(1.)))
        # transform: layer norm
        #mask_trans_feat = pre_process_layer(
        #    mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._emb_dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._emb_dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
@@ -226,13 +227,14 @@ class ErnieModel(object):
        return mean_mask_lm_loss
    def get_task_output(self, task, task_labels):
-        task_fc_out = fluid.layers.fc(
+        task_fc_out = fluid.layers.fc(input=self.next_sent_feat,
-            input=self.next_sent_feat,
                                      size=task["num_labels"],
-            param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0", initializer=self._param_initializer),
+                                      param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0",
+                                                                 initializer=self._param_initializer),
                                      bias_attr=task["task_name"] + "_fc.b_0")
-        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(
+        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(logits=task_fc_out,
-            logits=task_fc_out, label=task_labels, return_softmax=True)
+                                                                          label=task_labels,
+                                                                          return_softmax=True)
        task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels)
        mean_task_loss = fluid.layers.mean(task_loss)
        return mean_task_loss, task_acc
--- a/modules/text/language_model/ernie_v2_eng_large/model/transformer_encoder.py
+++ b/modules/text/language_model/ernie_v2_eng_large/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/ernie_v2_eng_large/module.py
+++ b/modules/text/semantic_model/ernie_v2_eng_large/module.py
@@ -59,8 +59,7 @@ class ErnieV2EngLarge(TransformerModule):
            sequence_output (tensor): token-level output for sequence task.
        """
        self.ernie_config._config_dict['use_task_id'] = False
-        ernie = ErnieModel(
+        ernie = ErnieModel(src_ids=input_ids,
-            src_ids=input_ids,
                           position_ids=position_ids,
                           sentence_ids=segment_ids,
                           task_ids=None,

--- a/modules/text/semantic_model/lda_news/README.md
+++ b/modules/text/semantic_model/lda_news/README.md
--- a/modules/text/semantic_model/lda_news/__init__.py
+++ b/modules/text/semantic_model/lda_news/__init__.py
--- a/modules/text/semantic_model/lda_news/config.py
+++ b/modules/text/semantic_model/lda_news/config.py
--- a/modules/text/semantic_model/lda_novel/document.py
+++ b/modules/text/semantic_model/lda_novel/document.py
@@ -5,7 +5,6 @@ class Topic(object):
    """Basic data structure of topic, contains topic id and
       corresponding probability.
    """
    def __init__(self, tid, prob):
        self.tid = tid  # topic id
        self.prob = prob  # topic probability
@@ -15,7 +14,6 @@ class Token(object):
    """Basic storage unit of LDA documents, contains word id
       and corresponding topic.
    """
    def __init__(self, topic, id):
        self.topic = topic
        self.id = id
@@ -25,7 +23,6 @@ class Sentence(object):
    """Basic storage unit of SentenceLDA documents, contains word ids
       of the sentence and its corresponding topic id.
    """
    def __init__(self, topic, tokens):
        self.topic = topic
        self.tokens = tokens
@@ -34,7 +31,6 @@ class Sentence(object):
 class LDADoc(object):
    """The storage structure of LDA model's inference result.
    """
    def __init__(self):
        self._num_topics = None  # Number of topics.
        self._num_accum = None  # Number of accumulated sample rounds.
@@ -120,8 +116,8 @@ class LDADoc(object):
        dense_dist = np.zeros(self._num_topics)
        if self.size() == 0:
            return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
-            self.size() + self._alpha * self._num_topics)
+                                                                                      self._alpha * self._num_topics)
        return dense_dist
    def accumulate_topic_num(self):
@@ -133,7 +129,6 @@ class SLDADoc(LDADoc):
    """Sentence LDA Document, inherited from LDADoc.
       Add add_sentence interface.
    """
    def __init__(self):
        super().__init__()
        self.__sentences = None

--- a/modules/text/semantic_model/lda_news/inference_engine.py
+++ b/modules/text/semantic_model/lda_news/inference_engine.py
--- a/modules/text/semantic_model/lda_news/model.py
+++ b/modules/text/semantic_model/lda_news/model.py
@@ -11,7 +11,6 @@ from lda_news.vocab import Vocab, WordCount
 class TopicModel(object):
    """Storage Structure of Topic model, including vocabulary and word topic count.
    """
    def __init__(self, model_dir, config):
        """
        Args:

--- a/modules/text/semantic_model/lda_news/module.py
+++ b/modules/text/semantic_model/lda_news/module.py
@@ -105,8 +105,9 @@ class TopicModel(hub.Module):
            wd = WordAndDis()
            wd.word = word
            sm = SemanticMatching()
-            wd.distance = sm.likelihood_based_similarity(
+            wd.distance = sm.likelihood_based_similarity(terms=[word],
-                terms=[word], doc_topic_dist=doc_topic_dist, model=self.__engine.get_model())
+                                                         doc_topic_dist=doc_topic_dist,
+                                                         model=self.__engine.get_model())
            items.append(wd)
        def take_elem(word_dis):

--- a/modules/text/semantic_model/lda_news/sampler.py
+++ b/modules/text/semantic_model/lda_news/sampler.py
--- a/modules/text/semantic_model/lda_news/semantic_matching.py
+++ b/modules/text/semantic_model/lda_news/semantic_matching.py
--- a/modules/text/semantic_model/lda_news/tokenizer.py
+++ b/modules/text/semantic_model/lda_news/tokenizer.py
@@ -5,7 +5,6 @@
 class Tokenizer(object):
    """Base tokenizer class.
    """
    def __init__(self):
        pass
@@ -19,7 +18,6 @@ class SimpleTokenizer(Tokenizer):
       Notes: This tokenizer can only recognize the words in the corresponding vocab file.
    """
    def __init__(self, vocab_path):
        super().__init__()
        self.__max_word_len = 0

--- a/modules/text/semantic_model/lda_news/util.py
+++ b/modules/text/semantic_model/lda_news/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
    """Return time cost of function f.
    """
    def timed(*args, **kwargs):
        start_time = time.time()
        result = f(*args, **kwargs)

--- a/modules/text/semantic_model/lda_news/vocab.py
+++ b/modules/text/semantic_model/lda_news/vocab.py
--- a/modules/text/semantic_model/lda_news/vose_alias.py
+++ b/modules/text/semantic_model/lda_news/vose_alias.py
@@ -6,7 +6,6 @@ from lda_news.util import rand, rand_k
 class VoseAlias(object):
    """Vose's Alias Method.
    """
    def __init__(self):
        self.__alias = None
        self.__prob = None  # np.array

--- a/modules/text/semantic_model/lda_novel/README.md
+++ b/modules/text/semantic_model/lda_novel/README.md
--- a/modules/text/semantic_model/lda_novel/__init__.py
+++ b/modules/text/semantic_model/lda_novel/__init__.py
--- a/modules/text/semantic_model/lda_novel/config.py
+++ b/modules/text/semantic_model/lda_novel/config.py
--- a/modules/text/semantic_model/lda_news/document.py
+++ b/modules/text/semantic_model/lda_news/document.py
@@ -5,7 +5,6 @@ class Topic(object):
    """Basic data structure of topic, contains topic id and
       corresponding probability.
    """
    def __init__(self, tid, prob):
        self.tid = tid  # topic id
        self.prob = prob  # topic probability
@@ -15,7 +14,6 @@ class Token(object):
    """Basic storage unit of LDA documents, contains word id
       and corresponding topic.
    """
    def __init__(self, topic, id):
        self.topic = topic
        self.id = id
@@ -25,7 +23,6 @@ class Sentence(object):
    """Basic storage unit of SentenceLDA documents, contains word ids
       of the sentence and its corresponding topic id.
    """
    def __init__(self, topic, tokens):
        self.topic = topic
        self.tokens = tokens
@@ -34,7 +31,6 @@ class Sentence(object):
 class LDADoc(object):
    """The storage structure of LDA model's inference result.
    """
    def __init__(self):
        self._num_topics = None  # Number of topics.
        self._num_accum = None  # Number of accumulated sample rounds.
@@ -120,8 +116,8 @@ class LDADoc(object):
        dense_dist = np.zeros(self._num_topics)
        if self.size() == 0:
            return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
-            self.size() + self._alpha * self._num_topics)
+                                                                                      self._alpha * self._num_topics)
        return dense_dist
    def accumulate_topic_num(self):
@@ -133,7 +129,6 @@ class SLDADoc(LDADoc):
    """Sentence LDA Document, inherited from LDADoc.
       Add add_sentence interface.
    """
    def __init__(self):
        super().__init__()
        self.__sentences = None

--- a/modules/text/semantic_model/lda_novel/inference_engine.py
+++ b/modules/text/semantic_model/lda_novel/inference_engine.py
--- a/modules/text/semantic_model/lda_novel/model.py
+++ b/modules/text/semantic_model/lda_novel/model.py
@@ -11,7 +11,6 @@ from lda_novel.vocab import Vocab, WordCount
 class TopicModel(object):
    """Storage Structure of Topic model, including vocabulary and word topic count.
    """
    def __init__(self, model_dir, config):
        """
        Args:

--- a/modules/text/semantic_model/lda_novel/module.py
+++ b/modules/text/semantic_model/lda_novel/module.py
@@ -105,8 +105,9 @@ class TopicModel(hub.Module):
            wd = WordAndDis()
            wd.word = word
            sm = SemanticMatching()
-            wd.distance = sm.likelihood_based_similarity(
+            wd.distance = sm.likelihood_based_similarity(terms=[word],
-                terms=[word], doc_topic_dist=doc_topic_dist, model=self.__engine.get_model())
+                                                         doc_topic_dist=doc_topic_dist,
+                                                         model=self.__engine.get_model())
            items.append(wd)
        def take_elem(word_dis):

--- a/modules/text/semantic_model/lda_novel/sampler.py
+++ b/modules/text/semantic_model/lda_novel/sampler.py
--- a/modules/text/semantic_model/lda_novel/semantic_matching.py
+++ b/modules/text/semantic_model/lda_novel/semantic_matching.py
--- a/modules/text/semantic_model/lda_webpage/tokenizer.py
+++ b/modules/text/semantic_model/lda_webpage/tokenizer.py
@@ -7,7 +7,6 @@ from paddlehub.common.logger import logger
 class Tokenizer(object):
    """Base tokenizer class.
    """
    def __init__(self):
        pass
@@ -21,7 +20,6 @@ class SimpleTokenizer(Tokenizer):
       Notes: This tokenizer can only recognize the words in the corresponding vocab file.
    """
    def __init__(self, vocab_path):
        super().__init__()
        self.__max_word_len = 0

--- a/modules/text/semantic_model/lda_novel/util.py
+++ b/modules/text/semantic_model/lda_novel/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
    """Return time cost of function f.
    """
    def timed(*args, **kwargs):
        start_time = time.time()
        result = f(*args, **kwargs)

--- a/modules/text/semantic_model/lda_novel/vocab.py
+++ b/modules/text/semantic_model/lda_novel/vocab.py
--- a/modules/text/semantic_model/lda_novel/vose_alias.py
+++ b/modules/text/semantic_model/lda_novel/vose_alias.py
@@ -9,7 +9,6 @@ from lda_novel.util import rand, rand_k
 class VoseAlias(object):
    """Vose's Alias Method.
    """
    def __init__(self):
        self.__alias = None
        self.__prob = None  # np.array

--- a/modules/text/semantic_model/lda_webpage/README.md
+++ b/modules/text/semantic_model/lda_webpage/README.md
--- a/modules/text/semantic_model/lda_webpage/__init__.py
+++ b/modules/text/semantic_model/lda_webpage/__init__.py
--- a/modules/text/semantic_model/lda_webpage/config.py
+++ b/modules/text/semantic_model/lda_webpage/config.py
--- a/modules/text/semantic_model/slda_news/document.py
+++ b/modules/text/semantic_model/slda_news/document.py
@@ -5,7 +5,6 @@ class Topic(object):
    """Basic data structure of topic, contains topic id and
       corresponding probability.
    """
    def __init__(self, tid, prob):
        self.tid = tid  # topic id
        self.prob = prob  # topic probability
@@ -15,7 +14,6 @@ class Token(object):
    """Basic storage unit of LDA documents, contains word id
       and corresponding topic.
    """
    def __init__(self, topic, id):
        self.topic = topic
        self.id = id
@@ -25,7 +23,6 @@ class Sentence(object):
    """Basic storage unit of SentenceLDA documents, contains word ids
       of the sentence and its corresponding topic id.
    """
    def __init__(self, topic, tokens):
        self.topic = topic
        self.tokens = tokens
@@ -34,7 +31,6 @@ class Sentence(object):
 class LDADoc(object):
    """The storage structure of LDA model's inference result.
    """
    def __init__(self):
        self._num_topics = None  # Number of topics.
        self._num_accum = None  # Number of accumulated sample rounds.
@@ -120,8 +116,8 @@ class LDADoc(object):
        dense_dist = np.zeros(self._num_topics)
        if self.size() == 0:
            return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
-            self.size() + self._alpha * self._num_topics)
+                                                                                      self._alpha * self._num_topics)
        return dense_dist
    def accumulate_topic_num(self):
@@ -133,7 +129,6 @@ class SLDADoc(LDADoc):
    """Sentence LDA Document, inherited from LDADoc.
       Add add_sentence interface.
    """
    def __init__(self):
        super().__init__()
        self.__sentences = None

--- a/modules/text/semantic_model/lda_webpage/inference_engine.py
+++ b/modules/text/semantic_model/lda_webpage/inference_engine.py
--- a/modules/text/semantic_model/lda_webpage/model.py
+++ b/modules/text/semantic_model/lda_webpage/model.py
@@ -11,7 +11,6 @@ from lda_webpage.vocab import Vocab, WordCount
 class TopicModel(object):
    """Storage Structure of Topic model, including vocabulary and word topic count.
    """
    def __init__(self, model_dir, config):
        """
        Args:

--- a/modules/text/semantic_model/lda_webpage/module.py
+++ b/modules/text/semantic_model/lda_webpage/module.py
@@ -105,8 +105,9 @@ class TopicModel(hub.Module):
            wd = WordAndDis()
            wd.word = word
            sm = SemanticMatching()
-            wd.distance = sm.likelihood_based_similarity(
+            wd.distance = sm.likelihood_based_similarity(terms=[word],
-                terms=[word], doc_topic_dist=doc_topic_dist, model=self.__engine.get_model())
+                                                         doc_topic_dist=doc_topic_dist,
+                                                         model=self.__engine.get_model())
            items.append(wd)
        def take_elem(word_dis):

--- a/modules/text/semantic_model/lda_webpage/sampler.py
+++ b/modules/text/semantic_model/lda_webpage/sampler.py
--- a/modules/text/semantic_model/lda_webpage/semantic_matching.py
+++ b/modules/text/semantic_model/lda_webpage/semantic_matching.py
--- a/modules/text/semantic_model/slda_novel/tokenizer.py
+++ b/modules/text/semantic_model/slda_novel/tokenizer.py
@@ -7,7 +7,6 @@ from paddlehub.common.logger import logger
 class Tokenizer(object):
    """Base tokenizer class.
    """
    def __init__(self):
        pass
@@ -21,7 +20,6 @@ class SimpleTokenizer(Tokenizer):
       Notes: This tokenizer can only recognize the words in the corresponding vocab file.
    """
    def __init__(self, vocab_path):
        super().__init__()
        self.__max_word_len = 0

--- a/modules/text/semantic_model/lda_webpage/util.py
+++ b/modules/text/semantic_model/lda_webpage/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
    """Return time cost of function f.
    """
    def timed(*args, **kwargs):
        start_time = time.time()
        result = f(*args, **kwargs)

--- a/modules/text/semantic_model/lda_webpage/vocab.py
+++ b/modules/text/semantic_model/lda_webpage/vocab.py
--- a/modules/text/semantic_model/lda_webpage/vose_alias.py
+++ b/modules/text/semantic_model/lda_webpage/vose_alias.py
@@ -9,7 +9,6 @@ from lda_webpage.util import rand, rand_k
 class VoseAlias(object):
    """Vose's Alias Method.
    """
    def __init__(self):
        self.__alias = None
        self.__prob = None  # np.array

--- a/modules/text/semantic_model/rbt3/README.md
+++ b/modules/text/semantic_model/rbt3/README.md
--- a/modules/text/semantic_model/rbt3/__init__.py
+++ b/modules/text/semantic_model/rbt3/__init__.py
--- a/modules/text/semantic_model/rbt3/model/__init__.py
+++ b/modules/text/semantic_model/rbt3/model/__init__.py
--- a/modules/text/semantic_model/rbt3/model/bert.py
+++ b/modules/text/semantic_model/rbt3/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class BertModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -130,11 +129,11 @@ class BertModel(object):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -150,43 +149,45 @@ class BertModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/language_model/rbt3/model/transformer_encoder.py
+++ b/modules/text/language_model/rbt3/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/rbt3/module.py
+++ b/modules/text/semantic_model/rbt3/module.py
@@ -58,8 +58,7 @@ class BertWwm(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        bert = BertModel(
+        bert = BertModel(src_ids=input_ids,
-            src_ids=input_ids,
                         position_ids=position_ids,
                         sentence_ids=segment_ids,
                         input_mask=input_mask,

--- a/modules/text/semantic_model/rbtl3/README.md
+++ b/modules/text/semantic_model/rbtl3/README.md
--- a/modules/text/semantic_model/rbtl3/__init__.py
+++ b/modules/text/semantic_model/rbtl3/__init__.py
--- a/modules/text/semantic_model/rbtl3/model/__init__.py
+++ b/modules/text/semantic_model/rbtl3/model/__init__.py
--- a/modules/text/semantic_model/rbtl3/model/bert.py
+++ b/modules/text/semantic_model/rbtl3/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
+        emb_out = fluid.layers.embedding(input=src_ids,
-            input=src_ids,
                                         size=[self._voc_size, self._emb_size],
                                         dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
+        position_emb_out = fluid.layers.embedding(input=position_ids,
-            input=position_ids,
                                                  size=[self._max_position_seq_len, self._emb_size],
                                                  dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
-        sent_emb_out = fluid.layers.embedding(
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
-            sentence_ids,
                                              size=[self._sent_types, self._emb_size],
                                              dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out
@@ -105,8 +105,7 @@ class BertModel(object):
        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
-        self._enc_out = encoder(
+        self._enc_out = encoder(enc_input=emb_out,
-            enc_input=emb_out,
                                attn_bias=n_head_self_attn_mask,
                                n_layer=self._n_layer,
                                n_head=self._n_head,
@@ -130,11 +129,11 @@ class BertModel(object):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                         size=self._emb_size,
                                         act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
                                         bias_attr="pooled_fc.b_0")
        return next_sent_feat
@@ -150,43 +149,45 @@ class BertModel(object):
        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
        # transform: fc
-        mask_trans_feat = fluid.layers.fc(
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-            input=mask_feat,
                                          size=self._emb_size,
                                          act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
        # transform: layer norm
        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-        mask_lm_out_bias_attr = fluid.ParamAttr(
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+                                                initializer=fluid.initializer.Constant(value=0.0))
        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                x=mask_trans_feat,
                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
        else:
-            fc_out = fluid.layers.fc(
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                input=mask_trans_feat,
                                     size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
                                     bias_attr=mask_lm_out_bias_attr)
        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-        next_sent_fc_out = fluid.layers.fc(
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-            input=next_sent_feat,
                                           size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
                                           bias_attr="next_sent_fc.b_0")
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)

--- a/modules/text/language_model/rbtl3/model/transformer_encoder.py
+++ b/modules/text/language_model/rbtl3/model/transformer_encoder.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/modules/text/semantic_model/rbtl3/module.py
+++ b/modules/text/semantic_model/rbtl3/module.py
@@ -58,8 +58,7 @@ class BertWwm(TransformerModule):
            pooled_output (tensor):  sentence-level output for classification task.
            sequence_output (tensor): token-level output for sequence task.
        """
-        bert = BertModel(
+        bert = BertModel(src_ids=input_ids,
-            src_ids=input_ids,
                         position_ids=position_ids,
                         sentence_ids=segment_ids,
                         input_mask=input_mask,

--- a/modules/text/semantic_model/simnet_bow/README.md
+++ b/modules/text/semantic_model/simnet_bow/README.md
--- a/modules/text/semantic_model/simnet_bow/__init__.py
+++ b/modules/text/semantic_model/simnet_bow/__init__.py
--- a/modules/text/semantic_model/simnet_bow/assets/params.txt
+++ b/modules/text/semantic_model/simnet_bow/assets/params.txt
--- a/modules/text/semantic_model/simnet_bow/assets/vocab.txt
+++ b/modules/text/semantic_model/simnet_bow/assets/vocab.txt
--- a/modules/text/semantic_model/simnet_bow/module.py
+++ b/modules/text/semantic_model/simnet_bow/module.py
@@ -29,8 +29,7 @@ class DataFormatError(Exception):
        self.args = args
-@moduleinfo(
+@moduleinfo(name="simnet_bow",
-    name="simnet_bow",
            version="1.2.0",
            summary="Baidu's open-source similarity network model based on bow_pairwise.",
            author="baidu-nlp",
@@ -107,11 +106,11 @@ class SimnetBow(hub.Module):
            seq_len_used = fluid.layers.squeeze(seq_len, axes=[1])
            # Add embedding layer.
-            w_param_attrs = fluid.ParamAttr(
+            w_param_attrs = fluid.ParamAttr(name="emb",
-                name="emb", initializer=fluid.initializer.TruncatedNormal(scale=0.02), trainable=trainable)
+                                            initializer=fluid.initializer.TruncatedNormal(scale=0.02),
+                                            trainable=trainable)
            dict_dim = 500002
-            emb_1 = fluid.layers.embedding(
+            emb_1 = fluid.layers.embedding(input=text_1,
-                input=text_1,
                                           size=[dict_dim, 128],
                                           is_sparse=True,
                                           padding_idx=dict_dim - 1,
@@ -123,8 +122,7 @@ class SimnetBow(hub.Module):
            if num_slots > 1:
                text_2 = fluid.data(name='text_2', shape=[-1, max_seq_len], dtype='int64', lod_level=0)
-                emb_2 = fluid.embedding(
+                emb_2 = fluid.embedding(input=text_2,
-                    input=text_2,
                                        size=[dict_dim, 128],
                                        is_sparse=True,
                                        padding_idx=dict_dim - 1,
@@ -136,8 +134,7 @@ class SimnetBow(hub.Module):
            if num_slots > 2:
                text_3 = fluid.data(name='text_3', shape=[-1, max_seq_len], dtype='int64', lod_level=0)
-                emb_3 = fluid.embedding(
+                emb_3 = fluid.embedding(input=text_3,
-                    input=text_3,
                                        size=[dict_dim, 128],
                                        is_sparse=True,
                                        padding_idx=dict_dim - 1,
@@ -298,8 +295,10 @@ class SimnetBow(hub.Module):
        """
        Run as a command
        """
-        self.parser = argparse.ArgumentParser(
+        self.parser = argparse.ArgumentParser(description="Run the simnet_bow module.",
-            description="Run the simnet_bow module.", prog='hub run simnet_bow', usage='%(prog)s', add_help=True)
+                                              prog='hub run simnet_bow',
+                                              usage='%(prog)s',
+                                              add_help=True)
        self.arg_input_group = self.parser.add_argument_group(title="Input options", description="Input data. Required")
        self.arg_config_group = self.parser.add_argument_group(
@@ -324,8 +323,10 @@ class SimnetBow(hub.Module):
        """
        Add the command config options
        """
-        self.arg_config_group.add_argument(
+        self.arg_config_group.add_argument('--use_gpu',
-            '--use_gpu', type=ast.literal_eval, default=False, help="whether use GPU for prediction")
+                                           type=ast.literal_eval,
+                                           default=False,
+                                           help="whether use GPU for prediction")
        self.arg_config_group.add_argument('--batch_size', type=int, default=1, help="batch size for prediction")

--- a/modules/text/semantic_model/simnet_bow/processor.py
+++ b/modules/text/semantic_model/simnet_bow/processor.py
--- a/modules/text/semantic_model/slda_news/README.md
+++ b/modules/text/semantic_model/slda_news/README.md
--- a/modules/text/semantic_model/slda_news/__init__.py
+++ b/modules/text/semantic_model/slda_news/__init__.py
--- a/modules/text/semantic_model/slda_news/config.py
+++ b/modules/text/semantic_model/slda_news/config.py
--- a/modules/text/semantic_model/lda_webpage/document.py
+++ b/modules/text/semantic_model/lda_webpage/document.py
@@ -5,7 +5,6 @@ class Topic(object):
    """Basic data structure of topic, contains topic id and
       corresponding probability.
    """
    def __init__(self, tid, prob):
        self.tid = tid  # topic id
        self.prob = prob  # topic probability
@@ -15,7 +14,6 @@ class Token(object):
    """Basic storage unit of LDA documents, contains word id
       and corresponding topic.
    """
    def __init__(self, topic, id):
        self.topic = topic
        self.id = id
@@ -25,7 +23,6 @@ class Sentence(object):
    """Basic storage unit of SentenceLDA documents, contains word ids
       of the sentence and its corresponding topic id.
    """
    def __init__(self, topic, tokens):
        self.topic = topic
        self.tokens = tokens
@@ -34,7 +31,6 @@ class Sentence(object):
 class LDADoc(object):
    """The storage structure of LDA model's inference result.
    """
    def __init__(self):
        self._num_topics = None  # Number of topics.
        self._num_accum = None  # Number of accumulated sample rounds.
@@ -120,8 +116,8 @@ class LDADoc(object):
        dense_dist = np.zeros(self._num_topics)
        if self.size() == 0:
            return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
-            self.size() + self._alpha * self._num_topics)
+                                                                                      self._alpha * self._num_topics)
        return dense_dist
    def accumulate_topic_num(self):
@@ -133,7 +129,6 @@ class SLDADoc(LDADoc):
    """Sentence LDA Document, inherited from LDADoc.
       Add add_sentence interface.
    """
    def __init__(self):
        super().__init__()
        self.__sentences = None

--- a/modules/text/semantic_model/slda_news/inference_engine.py
+++ b/modules/text/semantic_model/slda_news/inference_engine.py
--- a/modules/text/semantic_model/slda_news/model.py
+++ b/modules/text/semantic_model/slda_news/model.py
@@ -11,7 +11,6 @@ from slda_news.vocab import Vocab, WordCount
 class TopicModel(object):
    """Storage Structure of Topic model, including vocabulary and word topic count.
    """
    def __init__(self, model_dir, config):
        """
        Args:

--- a/modules/text/semantic_model/slda_news/module.py
+++ b/modules/text/semantic_model/slda_news/module.py
--- a/modules/text/semantic_model/slda_news/sampler.py
+++ b/modules/text/semantic_model/slda_news/sampler.py
--- a/modules/text/semantic_model/slda_news/semantic_matching.py
+++ b/modules/text/semantic_model/slda_news/semantic_matching.py
--- a/modules/text/semantic_model/slda_news/tokenizer.py
+++ b/modules/text/semantic_model/slda_news/tokenizer.py
@@ -7,7 +7,6 @@ from paddlehub.common.logger import logger
 class Tokenizer(object):
    """Base tokenizer class.
    """
    def __init__(self):
        pass
@@ -21,7 +20,6 @@ class SimpleTokenizer(Tokenizer):
       Notes: This tokenizer can only recognize the words in the corresponding vocab file.
    """
    def __init__(self, vocab_path):
        super().__init__()
        self.__max_word_len = 0

--- a/modules/text/semantic_model/slda_news/util.py
+++ b/modules/text/semantic_model/slda_news/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
    """Return time cost of function f.
    """
    def timed(*args, **kwargs):
        start_time = time.time()
        result = f(*args, **kwargs)

--- a/modules/text/semantic_model/slda_news/vocab.py
+++ b/modules/text/semantic_model/slda_news/vocab.py
--- a/modules/text/semantic_model/slda_news/vose_alias.py
+++ b/modules/text/semantic_model/slda_news/vose_alias.py
@@ -9,7 +9,6 @@ from slda_news.util import rand, rand_k
 class VoseAlias(object):
    """Vose's Alias Method.
    """
    def __init__(self):
        self.__alias = None
        self.__prob = None  # np.array

--- a/modules/text/semantic_model/slda_novel/README.md
+++ b/modules/text/semantic_model/slda_novel/README.md
--- a/modules/text/semantic_model/slda_novel/__init__.py
+++ b/modules/text/semantic_model/slda_novel/__init__.py
--- a/modules/text/semantic_model/slda_novel/config.py
+++ b/modules/text/semantic_model/slda_novel/config.py
--- a/modules/text/language_model/slda_novel/document.py
+++ b/modules/text/language_model/slda_novel/document.py
+import numpy as np
+class Topic(object):
+    """Basic data structure of topic, contains topic id and
+       corresponding probability.
+    """
+    def __init__(self, tid, prob):
+        self.tid = tid  # topic id
+        self.prob = prob  # topic probability
+class Token(object):
+    """Basic storage unit of LDA documents, contains word id
+       and corresponding topic.
+    """
+    def __init__(self, topic, id):
+        self.topic = topic
+        self.id = id
+class Sentence(object):
+    """Basic storage unit of SentenceLDA documents, contains word ids
+       of the sentence and its corresponding topic id.
+    """
+    def __init__(self, topic, tokens):
+        self.topic = topic
+        self.tokens = tokens
+class LDADoc(object):
+    """The storage structure of LDA model's inference result.
+    """
+    def __init__(self):
+        self._num_topics = None  # Number of topics.
+        self._num_accum = None  # Number of accumulated sample rounds.
+        self._alpha = None  # Document prior parameter.
+        self._tokens = None  # Storage structure of inference results.
+        self._topic_sum = None  # Document's topic sum in one round samples.
+        self._accum_topic_sum = None  # Accumulated results of topic sum.
+    def init(self, num_topics):
+        """Initialize the LDADoc according to num_topics.
+        """
+        self._num_topics = num_topics
+        self._num_accum = 0
+        self._tokens = []
+        self._topic_sum = np.zeros(self._num_topics)
+        self._accum_topic_sum = np.zeros(self._num_topics)
+    def add_token(self, token):
+        """Add new word to current LDADoc.
+        Arg:
+            token: Token class object.
+        """
+        assert token.topic >= 0, "Topic %d out of range!" % token.topic
+        assert token.topic < self._num_topics, "Topic %d out of range!" % token.topic
+        self._tokens.append(token)
+        self._topic_sum[token.topic] += 1
+    def token(self, index):
+        return self._tokens[index]
+    def set_topic(self, index, new_topic):
+        """Set the index word's topic to new_topic, and update the corresponding
+           topic distribution.
+        """
+        assert new_topic >= 0, "Topic %d out of range!" % new_topic
+        assert new_topic < self._num_topics, "Topic %d out of range!" % new_topic
+        old_topic = self._tokens[index].topic
+        if new_topic == old_topic:
+            return
+        self._tokens[index].topic = new_topic
+        self._topic_sum[old_topic] -= 1
+        self._topic_sum[new_topic] += 1
+    def set_alpha(self, alpha):
+        self._alpha = alpha
+    def size(self):
+        """Return number of words in LDADoc.
+        """
+        return len(self._tokens)
+    def topic_sum(self, topic_id):
+        return self._topic_sum[topic_id]
+    def sparse_topic_dist(self, sort=True):
+        """Return the topic distribution of documents in sparse format.
+           By default, it is sorted according to the topic probability
+           under the descending order.
+        """
+        topic_dist = []
+        sum_ = np.sum(self._accum_topic_sum)
+        if sum_ == 0:
+            return topic_dist
+        for i in range(0, self._num_topics):
+            if self._accum_topic_sum[i] == 0:
+                continue
+            topic_dist.append(Topic(i, self._accum_topic_sum[i] * 1.0 / sum_))
+        if sort:
+            def take_elem(topic):
+                return topic.prob
+            topic_dist.sort(key=take_elem, reverse=True)
+            if topic_dist is None:
+                topic_dist = []
+        return topic_dist
+    def dense_topic_dist(self):
+        """Return the distribution of document topics in dense format,
+           taking into account the prior parameter alpha.
+        """
+        dense_dist = np.zeros(self._num_topics)
+        if self.size() == 0:
+            return dense_dist
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
+                                                                                      self._alpha * self._num_topics)
+        return dense_dist
+    def accumulate_topic_num(self):
+        self._accum_topic_sum += self._topic_sum
+        self._num_accum += 1
+class SLDADoc(LDADoc):
+    """Sentence LDA Document, inherited from LDADoc.
+       Add add_sentence interface.
+    """
+    def __init__(self):
+        super().__init__()
+        self.__sentences = None
+    def init(self, num_topics):
+        """Initialize the SLDADoc according to num_topics.
+        """
+        self._num_topics = num_topics
+        self.__sentences = []
+        self._num_accum = 0
+        self._topic_sum = np.zeros(self._num_topics)
+        self._accum_topic_sum = np.zeros(self._num_topics)
+    def add_sentence(self, sent):
+        """Add new sentence to current SLDADoc.
+        Arg:
+            sent: Sentence class object.
+        """
+        assert sent.topic >= 0, "Topic %d out of range!" % (sent.topic)
+        assert sent.topic < self._num_topics, "Topic %d out of range!" % (sent.topic)
+        self.__sentences.append(sent)
+        self._topic_sum[sent.topic] += 1
+    def set_topic(self, index, new_topic):
+        assert new_topic >= 0, "Topic %d out of range!" % (new_topic)
+        assert new_topic < self._num_topics, "Topic %d out of range!" % (new_topic)
+        old_topic = self.__sentences[index].topic
+        if new_topic == old_topic:
+            return
+        self.__sentences[index].topic = new_topic
+        self._topic_sum[old_topic] -= 1
+        self._topic_sum[new_topic] += 1
+    def size(self):
+        """Return number of sentences in SLDADoc.
+        """
+        return len(self.__sentences)
+    def sent(self, index):
+        return self.__sentences[index]
--- a/modules/text/semantic_model/slda_novel/inference_engine.py
+++ b/modules/text/semantic_model/slda_novel/inference_engine.py
--- a/modules/text/semantic_model/slda_novel/model.py
+++ b/modules/text/semantic_model/slda_novel/model.py
@@ -11,7 +11,6 @@ from slda_novel.vocab import Vocab, WordCount
 class TopicModel(object):
    """Storage Structure of Topic model, including vocabulary and word topic count.
    """
    def __init__(self, model_dir, config):
        """
        Args:

--- a/modules/text/semantic_model/slda_novel/module.py
+++ b/modules/text/semantic_model/slda_novel/module.py
--- a/modules/text/semantic_model/slda_novel/sampler.py
+++ b/modules/text/semantic_model/slda_novel/sampler.py
--- a/modules/text/semantic_model/slda_novel/semantic_matching.py
+++ b/modules/text/semantic_model/slda_novel/semantic_matching.py
--- a/modules/text/semantic_model/lda_novel/tokenizer.py
+++ b/modules/text/semantic_model/lda_novel/tokenizer.py
@@ -7,7 +7,6 @@ from paddlehub.common.logger import logger
 class Tokenizer(object):
    """Base tokenizer class.
    """
    def __init__(self):
        pass
@@ -21,7 +20,6 @@ class SimpleTokenizer(Tokenizer):
       Notes: This tokenizer can only recognize the words in the corresponding vocab file.
    """
    def __init__(self, vocab_path):
        super().__init__()
        self.__max_word_len = 0

--- a/modules/text/semantic_model/slda_novel/util.py
+++ b/modules/text/semantic_model/slda_novel/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
    """Return time cost of function f.
    """
    def timed(*args, **kwargs):
        start_time = time.time()
        result = f(*args, **kwargs)

--- a/modules/text/semantic_model/slda_novel/vocab.py
+++ b/modules/text/semantic_model/slda_novel/vocab.py
--- a/modules/text/semantic_model/slda_novel/vose_alias.py
+++ b/modules/text/semantic_model/slda_novel/vose_alias.py
@@ -9,7 +9,6 @@ from slda_novel.util import rand, rand_k
 class VoseAlias(object):
    """Vose's Alias Method.
    """
    def __init__(self):
        self.__alias = None
        self.__prob = None  # np.array

--- a/modules/text/semantic_model/slda_webpage/README.md
+++ b/modules/text/semantic_model/slda_webpage/README.md
--- a/modules/text/semantic_model/slda_webpage/__init__.py
+++ b/modules/text/semantic_model/slda_webpage/__init__.py
--- a/modules/text/semantic_model/slda_webpage/config.py
+++ b/modules/text/semantic_model/slda_webpage/config.py
--- a/modules/text/language_model/slda_webpage/document.py
+++ b/modules/text/language_model/slda_webpage/document.py
+import numpy as np
+class Topic(object):
+    """Basic data structure of topic, contains topic id and
+       corresponding probability.
+    """
+    def __init__(self, tid, prob):
+        self.tid = tid  # topic id
+        self.prob = prob  # topic probability
+class Token(object):
+    """Basic storage unit of LDA documents, contains word id
+       and corresponding topic.
+    """
+    def __init__(self, topic, id):
+        self.topic = topic
+        self.id = id
+class Sentence(object):
+    """Basic storage unit of SentenceLDA documents, contains word ids
+       of the sentence and its corresponding topic id.
+    """
+    def __init__(self, topic, tokens):
+        self.topic = topic
+        self.tokens = tokens
+class LDADoc(object):
+    """The storage structure of LDA model's inference result.
+    """
+    def __init__(self):
+        self._num_topics = None  # Number of topics.
+        self._num_accum = None  # Number of accumulated sample rounds.
+        self._alpha = None  # Document prior parameter.
+        self._tokens = None  # Storage structure of inference results.
+        self._topic_sum = None  # Document's topic sum in one round samples.
+        self._accum_topic_sum = None  # Accumulated results of topic sum.
+    def init(self, num_topics):
+        """Initialize the LDADoc according to num_topics.
+        """
+        self._num_topics = num_topics
+        self._num_accum = 0
+        self._tokens = []
+        self._topic_sum = np.zeros(self._num_topics)
+        self._accum_topic_sum = np.zeros(self._num_topics)
+    def add_token(self, token):
+        """Add new word to current LDADoc.
+        Arg:
+            token: Token class object.
+        """
+        assert token.topic >= 0, "Topic %d out of range!" % token.topic
+        assert token.topic < self._num_topics, "Topic %d out of range!" % token.topic
+        self._tokens.append(token)
+        self._topic_sum[token.topic] += 1
+    def token(self, index):
+        return self._tokens[index]
+    def set_topic(self, index, new_topic):
+        """Set the index word's topic to new_topic, and update the corresponding
+           topic distribution.
+        """
+        assert new_topic >= 0, "Topic %d out of range!" % new_topic
+        assert new_topic < self._num_topics, "Topic %d out of range!" % new_topic
+        old_topic = self._tokens[index].topic
+        if new_topic == old_topic:
+            return
+        self._tokens[index].topic = new_topic
+        self._topic_sum[old_topic] -= 1
+        self._topic_sum[new_topic] += 1
+    def set_alpha(self, alpha):
+        self._alpha = alpha
+    def size(self):
+        """Return number of words in LDADoc.
+        """
+        return len(self._tokens)
+    def topic_sum(self, topic_id):
+        return self._topic_sum[topic_id]
+    def sparse_topic_dist(self, sort=True):
+        """Return the topic distribution of documents in sparse format.
+           By default, it is sorted according to the topic probability
+           under the descending order.
+        """
+        topic_dist = []
+        sum_ = np.sum(self._accum_topic_sum)
+        if sum_ == 0:
+            return topic_dist
+        for i in range(0, self._num_topics):
+            if self._accum_topic_sum[i] == 0:
+                continue
+            topic_dist.append(Topic(i, self._accum_topic_sum[i] * 1.0 / sum_))
+        if sort:
+            def take_elem(topic):
+                return topic.prob
+            topic_dist.sort(key=take_elem, reverse=True)
+            if topic_dist is None:
+                topic_dist = []
+        return topic_dist
+    def dense_topic_dist(self):
+        """Return the distribution of document topics in dense format,
+           taking into account the prior parameter alpha.
+        """
+        dense_dist = np.zeros(self._num_topics)
+        if self.size() == 0:
+            return dense_dist
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
+                                                                                      self._alpha * self._num_topics)
+        return dense_dist
+    def accumulate_topic_num(self):
+        self._accum_topic_sum += self._topic_sum
+        self._num_accum += 1
+class SLDADoc(LDADoc):
+    """Sentence LDA Document, inherited from LDADoc.
+       Add add_sentence interface.
+    """
+    def __init__(self):
+        super().__init__()
+        self.__sentences = None
+    def init(self, num_topics):
+        """Initialize the SLDADoc according to num_topics.
+        """
+        self._num_topics = num_topics
+        self.__sentences = []
+        self._num_accum = 0
+        self._topic_sum = np.zeros(self._num_topics)
+        self._accum_topic_sum = np.zeros(self._num_topics)
+    def add_sentence(self, sent):
+        """Add new sentence to current SLDADoc.
+        Arg:
+            sent: Sentence class object.
+        """
+        assert sent.topic >= 0, "Topic %d out of range!" % (sent.topic)
+        assert sent.topic < self._num_topics, "Topic %d out of range!" % (sent.topic)
+        self.__sentences.append(sent)
+        self._topic_sum[sent.topic] += 1
+    def set_topic(self, index, new_topic):
+        assert new_topic >= 0, "Topic %d out of range!" % (new_topic)
+        assert new_topic < self._num_topics, "Topic %d out of range!" % (new_topic)
+        old_topic = self.__sentences[index].topic
+        if new_topic == old_topic:
+            return
+        self.__sentences[index].topic = new_topic
+        self._topic_sum[old_topic] -= 1
+        self._topic_sum[new_topic] += 1
+    def size(self):
+        """Return number of sentences in SLDADoc.
+        """
+        return len(self.__sentences)
+    def sent(self, index):
+        return self.__sentences[index]
--- a/modules/text/semantic_model/slda_webpage/inference_engine.py
+++ b/modules/text/semantic_model/slda_webpage/inference_engine.py
--- a/modules/text/semantic_model/slda_webpage/model.py
+++ b/modules/text/semantic_model/slda_webpage/model.py
@@ -11,7 +11,6 @@ from slda_webpage.vocab import Vocab, WordCount
 class TopicModel(object):
    """Storage Structure of Topic model, including vocabulary and word topic count.
    """
    def __init__(self, model_dir, config):
        """
        Args:

--- a/modules/text/semantic_model/slda_webpage/module.py
+++ b/modules/text/semantic_model/slda_webpage/module.py
--- a/modules/text/semantic_model/slda_webpage/sampler.py
+++ b/modules/text/semantic_model/slda_webpage/sampler.py
--- a/modules/text/semantic_model/slda_webpage/semantic_matching.py
+++ b/modules/text/semantic_model/slda_webpage/semantic_matching.py
--- a/modules/text/language_model/slda_webpage/tokenizer.py
+++ b/modules/text/language_model/slda_webpage/tokenizer.py
+import os
+import numpy as np
+from paddlehub.common.logger import logger
+class Tokenizer(object):
+    """Base tokenizer class.
+    """
+    def __init__(self):
+        pass
+    def tokenize(self, text):
+        raise NotImplementedError
+class SimpleTokenizer(Tokenizer):
+    """Simple version FMM(Forward Maximun Matching) word tokenizer. This tokenizer can only
+       be used in topic model demo, but not in real business application scenarios.
+       Notes: This tokenizer can only recognize the words in the corresponding vocab file.
+    """
+    def __init__(self, vocab_path):
+        super().__init__()
+        self.__max_word_len = 0
+        self.__vocab = set()
+        self.__load_vocab(vocab_path)
+    def tokenize(self, text):
+        """Tokenize the input string `text`, and return the tokenize result.
+        """
+        text_len = len(text)
+        result = []
+        i = 0
+        while i < text_len:
+            word = found_word = ""
+            # Deal with English characters.
+            if self.__is_eng_char(text[i]):
+                for j in range(i, text_len + 1):
+                    if j < text_len and self.__is_eng_char(text[j]):
+                        word += self.__tolower(text[j])
+                    else:
+                        # Forward matching by character granularity.
+                        if word in self.__vocab:
+                            result.append(word)
+                        i = j - 1
+                        break
+            else:
+                for j in range(i, min(i + self.__max_word_len, text_len)):
+                    word += text[j]
+                    if word in self.__vocab:
+                        found_word = word
+                if len(found_word) > 0:
+                    result.append(found_word)
+                    i += len(found_word) - 1
+            i += 1
+        return result
+    def contains(self, word):
+        """Check whether the word is in the vocabulary.
+        """
+        return word in self.__vocab
+    def __load_vocab(self, vocab_path):
+        """Load the word dictionary.
+        """
+        with open(vocab_path, 'r', encoding='utf-8') as fin:
+            vocab_size = 0
+            for line in fin.readlines():
+                fields = line.strip().split('\t')
+                assert len(fields) >= 2
+                word = fields[1]
+                self.__max_word_len = max(self.__max_word_len, len(word))
+                self.__vocab.add(word)
+                vocab_size += 1
+    def __is_eng_char(self, c):
+        """Check whether char c is an English character.
+        """
+        return (c >= 'A' and c <= 'Z') or (c >= 'a' and c <= 'z')
+    def __tolower(self, c):
+        """Return the lowercase character of the corresponding character, or return
+           the original character if there is no corresponding lowercase character.
+        """
+        return c.lower()
+class LACTokenizer(Tokenizer):
+    def __init__(self, vocab_path, lac):
+        super().__init__()
+        self.__max_word_len = 0
+        self.__vocab = set()
+        self.__lac = lac
+        self.__load_vocab(vocab_path)
+    def __load_vocab(self, vocab_path):
+        """Load the word dictionary.
+                """
+        with open(vocab_path, 'r', encoding='utf-8') as fin:
+            vocab_size = 0
+            for line in fin.readlines():
+                fields = line.strip().split('\t')
+                assert len(fields) >= 2
+                word = fields[1]
+                self.__max_word_len = max(self.__max_word_len, len(word))
+                self.__vocab.add(word)
+                vocab_size += 1
+    def tokenize(self, text):
+        results = self.__lac.lexical_analysis(texts=[text], use_gpu=False, batch_size=1, return_tag=True)
+        # Change English words to lower case.
+        # And just preserve the word in vocab.
+        words = results[0]["word"]
+        result = []
+        for word in words:
+            word = word.lower()
+            if word in self.__vocab:
+                result.append(word)
+        return result
+    def contains(self, word):
+        """Check whether the word is in the vocabulary.
+        """
+        return word in self.__vocab
--- a/modules/text/semantic_model/slda_webpage/util.py
+++ b/modules/text/semantic_model/slda_webpage/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
    """Return time cost of function f.
    """
    def timed(*args, **kwargs):
        start_time = time.time()
        result = f(*args, **kwargs)

--- a/modules/text/semantic_model/slda_webpage/vocab.py
+++ b/modules/text/semantic_model/slda_webpage/vocab.py
--- a/modules/text/semantic_model/slda_webpage/vose_alias.py
+++ b/modules/text/semantic_model/slda_webpage/vose_alias.py
@@ -9,7 +9,6 @@ from slda_webpage.util import rand, rand_k
 class VoseAlias(object):
    """Vose's Alias Method.
    """
    def __init__(self):
        self.__alias = None
        self.__prob = None  # np.array

--- a/modules/text/semantic_model/slda_weibo/README.md
+++ b/modules/text/semantic_model/slda_weibo/README.md
--- a/modules/text/semantic_model/slda_weibo/__init__.py
+++ b/modules/text/semantic_model/slda_weibo/__init__.py
--- a/modules/text/semantic_model/slda_weibo/config.py
+++ b/modules/text/semantic_model/slda_weibo/config.py
--- a/modules/text/language_model/slda_weibo/document.py
+++ b/modules/text/language_model/slda_weibo/document.py
+import numpy as np
+class Topic(object):
+    """Basic data structure of topic, contains topic id and
+       corresponding probability.
+    """
+    def __init__(self, tid, prob):
+        self.tid = tid  # topic id
+        self.prob = prob  # topic probability
+class Token(object):
+    """Basic storage unit of LDA documents, contains word id
+       and corresponding topic.
+    """
+    def __init__(self, topic, id):
+        self.topic = topic
+        self.id = id
+class Sentence(object):
+    """Basic storage unit of SentenceLDA documents, contains word ids
+       of the sentence and its corresponding topic id.
+    """
+    def __init__(self, topic, tokens):
+        self.topic = topic
+        self.tokens = tokens
+class LDADoc(object):
+    """The storage structure of LDA model's inference result.
+    """
+    def __init__(self):
+        self._num_topics = None  # Number of topics.
+        self._num_accum = None  # Number of accumulated sample rounds.
+        self._alpha = None  # Document prior parameter.
+        self._tokens = None  # Storage structure of inference results.
+        self._topic_sum = None  # Document's topic sum in one round samples.
+        self._accum_topic_sum = None  # Accumulated results of topic sum.
+    def init(self, num_topics):
+        """Initialize the LDADoc according to num_topics.
+        """
+        self._num_topics = num_topics
+        self._num_accum = 0
+        self._tokens = []
+        self._topic_sum = np.zeros(self._num_topics)
+        self._accum_topic_sum = np.zeros(self._num_topics)
+    def add_token(self, token):
+        """Add new word to current LDADoc.
+        Arg:
+            token: Token class object.
+        """
+        assert token.topic >= 0, "Topic %d out of range!" % token.topic
+        assert token.topic < self._num_topics, "Topic %d out of range!" % token.topic
+        self._tokens.append(token)
+        self._topic_sum[token.topic] += 1
+    def token(self, index):
+        return self._tokens[index]
+    def set_topic(self, index, new_topic):
+        """Set the index word's topic to new_topic, and update the corresponding
+           topic distribution.
+        """
+        assert new_topic >= 0, "Topic %d out of range!" % new_topic
+        assert new_topic < self._num_topics, "Topic %d out of range!" % new_topic
+        old_topic = self._tokens[index].topic
+        if new_topic == old_topic:
+            return
+        self._tokens[index].topic = new_topic
+        self._topic_sum[old_topic] -= 1
+        self._topic_sum[new_topic] += 1
+    def set_alpha(self, alpha):
+        self._alpha = alpha
+    def size(self):
+        """Return number of words in LDADoc.
+        """
+        return len(self._tokens)
+    def topic_sum(self, topic_id):
+        return self._topic_sum[topic_id]
+    def sparse_topic_dist(self, sort=True):
+        """Return the topic distribution of documents in sparse format.
+           By default, it is sorted according to the topic probability
+           under the descending order.
+        """
+        topic_dist = []
+        sum_ = np.sum(self._accum_topic_sum)
+        if sum_ == 0:
+            return topic_dist
+        for i in range(0, self._num_topics):
+            if self._accum_topic_sum[i] == 0:
+                continue
+            topic_dist.append(Topic(i, self._accum_topic_sum[i] * 1.0 / sum_))
+        if sort:
+            def take_elem(topic):
+                return topic.prob
+            topic_dist.sort(key=take_elem, reverse=True)
+            if topic_dist is None:
+                topic_dist = []
+        return topic_dist
+    def dense_topic_dist(self):
+        """Return the distribution of document topics in dense format,
+           taking into account the prior parameter alpha.
+        """
+        dense_dist = np.zeros(self._num_topics)
+        if self.size() == 0:
+            return dense_dist
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
+                                                                                      self._alpha * self._num_topics)
+        return dense_dist
+    def accumulate_topic_num(self):
+        self._accum_topic_sum += self._topic_sum
+        self._num_accum += 1
+class SLDADoc(LDADoc):
+    """Sentence LDA Document, inherited from LDADoc.
+       Add add_sentence interface.
+    """
+    def __init__(self):
+        super().__init__()
+        self.__sentences = None
+    def init(self, num_topics):
+        """Initialize the SLDADoc according to num_topics.
+        """
+        self._num_topics = num_topics
+        self.__sentences = []
+        self._num_accum = 0
+        self._topic_sum = np.zeros(self._num_topics)
+        self._accum_topic_sum = np.zeros(self._num_topics)
+    def add_sentence(self, sent):
+        """Add new sentence to current SLDADoc.
+        Arg:
+            sent: Sentence class object.
+        """
+        assert sent.topic >= 0, "Topic %d out of range!" % (sent.topic)
+        assert sent.topic < self._num_topics, "Topic %d out of range!" % (sent.topic)
+        self.__sentences.append(sent)
+        self._topic_sum[sent.topic] += 1
+    def set_topic(self, index, new_topic):
+        assert new_topic >= 0, "Topic %d out of range!" % (new_topic)
+        assert new_topic < self._num_topics, "Topic %d out of range!" % (new_topic)
+        old_topic = self.__sentences[index].topic
+        if new_topic == old_topic:
+            return
+        self.__sentences[index].topic = new_topic
+        self._topic_sum[old_topic] -= 1
+        self._topic_sum[new_topic] += 1
+    def size(self):
+        """Return number of sentences in SLDADoc.
+        """
+        return len(self.__sentences)
+    def sent(self, index):
+        return self.__sentences[index]
--- a/modules/text/semantic_model/slda_weibo/inference_engine.py
+++ b/modules/text/semantic_model/slda_weibo/inference_engine.py
--- a/modules/text/semantic_model/slda_weibo/model.py
+++ b/modules/text/semantic_model/slda_weibo/model.py
@@ -11,7 +11,6 @@ from slda_weibo.vocab import Vocab, WordCount
 class TopicModel(object):
    """Storage Structure of Topic model, including vocabulary and word topic count.
    """
    def __init__(self, model_dir, config):
        """
        Args:

--- a/modules/text/semantic_model/slda_weibo/module.py
+++ b/modules/text/semantic_model/slda_weibo/module.py
--- a/modules/text/semantic_model/slda_weibo/sampler.py
+++ b/modules/text/semantic_model/slda_weibo/sampler.py
--- a/modules/text/semantic_model/slda_weibo/semantic_matching.py
+++ b/modules/text/semantic_model/slda_weibo/semantic_matching.py
--- a/modules/text/language_model/slda_weibo/tokenizer.py
+++ b/modules/text/language_model/slda_weibo/tokenizer.py
+import os
+import numpy as np
+from paddlehub.common.logger import logger
+class Tokenizer(object):
+    """Base tokenizer class.
+    """
+    def __init__(self):
+        pass
+    def tokenize(self, text):
+        raise NotImplementedError
+class SimpleTokenizer(Tokenizer):
+    """Simple version FMM(Forward Maximun Matching) word tokenizer. This tokenizer can only
+       be used in topic model demo, but not in real business application scenarios.
+       Notes: This tokenizer can only recognize the words in the corresponding vocab file.
+    """
+    def __init__(self, vocab_path):
+        super().__init__()
+        self.__max_word_len = 0
+        self.__vocab = set()
+        self.__load_vocab(vocab_path)
+    def tokenize(self, text):
+        """Tokenize the input string `text`, and return the tokenize result.
+        """
+        text_len = len(text)
+        result = []
+        i = 0
+        while i < text_len:
+            word = found_word = ""
+            # Deal with English characters.
+            if self.__is_eng_char(text[i]):
+                for j in range(i, text_len + 1):
+                    if j < text_len and self.__is_eng_char(text[j]):
+                        word += self.__tolower(text[j])
+                    else:
+                        # Forward matching by character granularity.
+                        if word in self.__vocab:
+                            result.append(word)
+                        i = j - 1
+                        break
+            else:
+                for j in range(i, min(i + self.__max_word_len, text_len)):
+                    word += text[j]
+                    if word in self.__vocab:
+                        found_word = word
+                if len(found_word) > 0:
+                    result.append(found_word)
+                    i += len(found_word) - 1
+            i += 1
+        return result
+    def contains(self, word):
+        """Check whether the word is in the vocabulary.
+        """
+        return word in self.__vocab
+    def __load_vocab(self, vocab_path):
+        """Load the word dictionary.
+        """
+        with open(vocab_path, 'r', encoding='utf-8') as fin:
+            vocab_size = 0
+            for line in fin.readlines():
+                fields = line.strip().split('\t')
+                assert len(fields) >= 2
+                word = fields[1]
+                self.__max_word_len = max(self.__max_word_len, len(word))
+                self.__vocab.add(word)
+                vocab_size += 1
+    def __is_eng_char(self, c):
+        """Check whether char c is an English character.
+        """
+        return (c >= 'A' and c <= 'Z') or (c >= 'a' and c <= 'z')
+    def __tolower(self, c):
+        """Return the lowercase character of the corresponding character, or return
+           the original character if there is no corresponding lowercase character.
+        """
+        return c.lower()
+class LACTokenizer(Tokenizer):
+    def __init__(self, vocab_path, lac):
+        super().__init__()
+        self.__max_word_len = 0
+        self.__vocab = set()
+        self.__lac = lac
+        self.__load_vocab(vocab_path)
+    def __load_vocab(self, vocab_path):
+        """Load the word dictionary.
+                """
+        with open(vocab_path, 'r', encoding='utf-8') as fin:
+            vocab_size = 0
+            for line in fin.readlines():
+                fields = line.strip().split('\t')
+                assert len(fields) >= 2
+                word = fields[1]
+                self.__max_word_len = max(self.__max_word_len, len(word))
+                self.__vocab.add(word)
+                vocab_size += 1
+    def tokenize(self, text):
+        results = self.__lac.lexical_analysis(texts=[text], use_gpu=False, batch_size=1, return_tag=True)
+        # Change English words to lower case.
+        # And just preserve the word in vocab.
+        words = results[0]["word"]
+        result = []
+        for word in words:
+            word = word.lower()
+            if word in self.__vocab:
+                result.append(word)
+        return result
+    def contains(self, word):
+        """Check whether the word is in the vocabulary.
+        """
+        return word in self.__vocab
--- a/modules/text/semantic_model/slda_weibo/util.py
+++ b/modules/text/semantic_model/slda_weibo/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
    """Return time cost of function f.
    """
    def timed(*args, **kwargs):
        start_time = time.time()
        result = f(*args, **kwargs)

--- a/modules/text/semantic_model/slda_weibo/vocab.py
+++ b/modules/text/semantic_model/slda_weibo/vocab.py
--- a/modules/text/semantic_model/slda_weibo/vose_alias.py
+++ b/modules/text/semantic_model/slda_weibo/vose_alias.py
@@ -9,7 +9,6 @@ from slda_weibo.util import rand, rand_k
 class VoseAlias(object):
    """Vose's Alias Method.
    """
    def __init__(self):
        self.__alias = None
        self.__prob = None  # np.array

--- a/modules/text/lexical_analysis/README.md
+++ b/modules/text/lexical_analysis/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【词法分析】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 词法分析
+- 推荐模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [词法分析-LAC](https://www.paddlepaddle.org.cn/hubdetail?name=lac&en_category=LexicalAnalysis) | 百度自研联合的词法分析模型，能整体性地完成中文分词、词性标注、专名识别任务。在百度自建数据集上评测，LAC效果：Precision=88.0%，Recall=88.7%，F1-Score=88.4%。 |
+| [切词网络-jieba](https://www.paddlepaddle.org.cn/hubdetail?name=jieba_paddle&en_category=LexicalAnalysis) | 该Module是jieba使用PaddlePaddle深度学习框架搭建的切词网络（双向GRU）。 同时也支持jieba的传统切词方法，如精确模式、全模式、搜索引擎模式等切词模式。 |
--- a/modules/text/semantic_model/README.md
+++ b/modules/text/semantic_model/README.md
--- a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/transformer_encoder.py
+++ b/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/chinese_bert_wwm/model/transformer_encoder.py
+++ b/modules/text/semantic_model/chinese_bert_wwm/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/chinese_bert_wwm_ext/model/transformer_encoder.py
+++ b/modules/text/semantic_model/chinese_bert_wwm_ext/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/chinese_electra_base/model/transformer_encoder.py
+++ b/modules/text/semantic_model/chinese_electra_base/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/chinese_electra_small/model/transformer_encoder.py
+++ b/modules/text/semantic_model/chinese_electra_small/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext/model/transformer_encoder.py
+++ b/modules/text/semantic_model/chinese_roberta_wwm_ext/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/transformer_encoder.py
+++ b/modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/ernie/model/transformer_encoder.py
+++ b/modules/text/semantic_model/ernie/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/ernie_tiny/model/transformer_encoder.py
+++ b/modules/text/semantic_model/ernie_tiny/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/ernie_v2_eng_base/model/transformer_encoder.py
+++ b/modules/text/semantic_model/ernie_v2_eng_base/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/ernie_v2_eng_large/model/transformer_encoder.py
+++ b/modules/text/semantic_model/ernie_v2_eng_large/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/rbt3/model/transformer_encoder.py
+++ b/modules/text/semantic_model/rbt3/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/rbtl3/model/transformer_encoder.py
+++ b/modules/text/semantic_model/rbtl3/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from functools import partial
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-    out = __combine_heads(ctx_multiheads)
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-    return enc_output
--- a/modules/text/semantic_model/slda_novel/document.py
+++ b/modules/text/semantic_model/slda_novel/document.py
-import numpy as np
-class Topic(object):
-    """Basic data structure of topic, contains topic id and
-       corresponding probability.
-    """
-    def __init__(self, tid, prob):
-        self.tid = tid  # topic id
-        self.prob = prob  # topic probability
-class Token(object):
-    """Basic storage unit of LDA documents, contains word id
-       and corresponding topic.
-    """
-    def __init__(self, topic, id):
-        self.topic = topic
-        self.id = id
-class Sentence(object):
-    """Basic storage unit of SentenceLDA documents, contains word ids
-       of the sentence and its corresponding topic id.
-    """
-    def __init__(self, topic, tokens):
-        self.topic = topic
-        self.tokens = tokens
-class LDADoc(object):
-    """The storage structure of LDA model's inference result.
-    """
-    def __init__(self):
-        self._num_topics = None  # Number of topics.
-        self._num_accum = None  # Number of accumulated sample rounds.
-        self._alpha = None  # Document prior parameter.
-        self._tokens = None  # Storage structure of inference results.
-        self._topic_sum = None  # Document's topic sum in one round samples.
-        self._accum_topic_sum = None  # Accumulated results of topic sum.
-    def init(self, num_topics):
-        """Initialize the LDADoc according to num_topics.
-        """
-        self._num_topics = num_topics
-        self._num_accum = 0
-        self._tokens = []
-        self._topic_sum = np.zeros(self._num_topics)
-        self._accum_topic_sum = np.zeros(self._num_topics)
-    def add_token(self, token):
-        """Add new word to current LDADoc.
-        Arg:
-            token: Token class object.
-        """
-        assert token.topic >= 0, "Topic %d out of range!" % token.topic
-        assert token.topic < self._num_topics, "Topic %d out of range!" % token.topic
-        self._tokens.append(token)
-        self._topic_sum[token.topic] += 1
-    def token(self, index):
-        return self._tokens[index]
-    def set_topic(self, index, new_topic):
-        """Set the index word's topic to new_topic, and update the corresponding
-           topic distribution.
-        """
-        assert new_topic >= 0, "Topic %d out of range!" % new_topic
-        assert new_topic < self._num_topics, "Topic %d out of range!" % new_topic
-        old_topic = self._tokens[index].topic
-        if new_topic == old_topic:
-            return
-        self._tokens[index].topic = new_topic
-        self._topic_sum[old_topic] -= 1
-        self._topic_sum[new_topic] += 1
-    def set_alpha(self, alpha):
-        self._alpha = alpha
-    def size(self):
-        """Return number of words in LDADoc.
-        """
-        return len(self._tokens)
-    def topic_sum(self, topic_id):
-        return self._topic_sum[topic_id]
-    def sparse_topic_dist(self, sort=True):
-        """Return the topic distribution of documents in sparse format.
-           By default, it is sorted according to the topic probability
-           under the descending order.
-        """
-        topic_dist = []
-        sum_ = np.sum(self._accum_topic_sum)
-        if sum_ == 0:
-            return topic_dist
-        for i in range(0, self._num_topics):
-            if self._accum_topic_sum[i] == 0:
-                continue
-            topic_dist.append(Topic(i, self._accum_topic_sum[i] * 1.0 / sum_))
-        if sort:
-            def take_elem(topic):
-                return topic.prob
-            topic_dist.sort(key=take_elem, reverse=True)
-            if topic_dist is None:
-                topic_dist = []
-        return topic_dist
-    def dense_topic_dist(self):
-        """Return the distribution of document topics in dense format,
-           taking into account the prior parameter alpha.
-        """
-        dense_dist = np.zeros(self._num_topics)
-        if self.size() == 0:
-            return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
-            self.size() + self._alpha * self._num_topics)
-        return dense_dist
-    def accumulate_topic_num(self):
-        self._accum_topic_sum += self._topic_sum
-        self._num_accum += 1
-class SLDADoc(LDADoc):
-    """Sentence LDA Document, inherited from LDADoc.
-       Add add_sentence interface.
-    """
-    def __init__(self):
-        super().__init__()
-        self.__sentences = None
-    def init(self, num_topics):
-        """Initialize the SLDADoc according to num_topics.
-        """
-        self._num_topics = num_topics
-        self.__sentences = []
-        self._num_accum = 0
-        self._topic_sum = np.zeros(self._num_topics)
-        self._accum_topic_sum = np.zeros(self._num_topics)
-    def add_sentence(self, sent):
-        """Add new sentence to current SLDADoc.
-        Arg:
-            sent: Sentence class object.
-        """
-        assert sent.topic >= 0, "Topic %d out of range!" % (sent.topic)
-        assert sent.topic < self._num_topics, "Topic %d out of range!" % (sent.topic)
-        self.__sentences.append(sent)
-        self._topic_sum[sent.topic] += 1
-    def set_topic(self, index, new_topic):
-        assert new_topic >= 0, "Topic %d out of range!" % (new_topic)
-        assert new_topic < self._num_topics, "Topic %d out of range!" % (new_topic)
-        old_topic = self.__sentences[index].topic
-        if new_topic == old_topic:
-            return
-        self.__sentences[index].topic = new_topic
-        self._topic_sum[old_topic] -= 1
-        self._topic_sum[new_topic] += 1
-    def size(self):
-        """Return number of sentences in SLDADoc.
-        """
-        return len(self.__sentences)
-    def sent(self, index):
-        return self.__sentences[index]
--- a/modules/text/semantic_model/slda_webpage/document.py
+++ b/modules/text/semantic_model/slda_webpage/document.py
-import numpy as np
-class Topic(object):
-    """Basic data structure of topic, contains topic id and
-       corresponding probability.
-    """
-    def __init__(self, tid, prob):
-        self.tid = tid  # topic id
-        self.prob = prob  # topic probability
-class Token(object):
-    """Basic storage unit of LDA documents, contains word id
-       and corresponding topic.
-    """
-    def __init__(self, topic, id):
-        self.topic = topic
-        self.id = id
-class Sentence(object):
-    """Basic storage unit of SentenceLDA documents, contains word ids
-       of the sentence and its corresponding topic id.
-    """
-    def __init__(self, topic, tokens):
-        self.topic = topic
-        self.tokens = tokens
-class LDADoc(object):
-    """The storage structure of LDA model's inference result.
-    """
-    def __init__(self):
-        self._num_topics = None  # Number of topics.
-        self._num_accum = None  # Number of accumulated sample rounds.
-        self._alpha = None  # Document prior parameter.
-        self._tokens = None  # Storage structure of inference results.
-        self._topic_sum = None  # Document's topic sum in one round samples.
-        self._accum_topic_sum = None  # Accumulated results of topic sum.
-    def init(self, num_topics):
-        """Initialize the LDADoc according to num_topics.
-        """
-        self._num_topics = num_topics
-        self._num_accum = 0
-        self._tokens = []
-        self._topic_sum = np.zeros(self._num_topics)
-        self._accum_topic_sum = np.zeros(self._num_topics)
-    def add_token(self, token):
-        """Add new word to current LDADoc.
-        Arg:
-            token: Token class object.
-        """
-        assert token.topic >= 0, "Topic %d out of range!" % token.topic
-        assert token.topic < self._num_topics, "Topic %d out of range!" % token.topic
-        self._tokens.append(token)
-        self._topic_sum[token.topic] += 1
-    def token(self, index):
-        return self._tokens[index]
-    def set_topic(self, index, new_topic):
-        """Set the index word's topic to new_topic, and update the corresponding
-           topic distribution.
-        """
-        assert new_topic >= 0, "Topic %d out of range!" % new_topic
-        assert new_topic < self._num_topics, "Topic %d out of range!" % new_topic
-        old_topic = self._tokens[index].topic
-        if new_topic == old_topic:
-            return
-        self._tokens[index].topic = new_topic
-        self._topic_sum[old_topic] -= 1
-        self._topic_sum[new_topic] += 1
-    def set_alpha(self, alpha):
-        self._alpha = alpha
-    def size(self):
-        """Return number of words in LDADoc.
-        """
-        return len(self._tokens)
-    def topic_sum(self, topic_id):
-        return self._topic_sum[topic_id]
-    def sparse_topic_dist(self, sort=True):
-        """Return the topic distribution of documents in sparse format.
-           By default, it is sorted according to the topic probability
-           under the descending order.
-        """
-        topic_dist = []
-        sum_ = np.sum(self._accum_topic_sum)
-        if sum_ == 0:
-            return topic_dist
-        for i in range(0, self._num_topics):
-            if self._accum_topic_sum[i] == 0:
-                continue
-            topic_dist.append(Topic(i, self._accum_topic_sum[i] * 1.0 / sum_))
-        if sort:
-            def take_elem(topic):
-                return topic.prob
-            topic_dist.sort(key=take_elem, reverse=True)
-            if topic_dist is None:
-                topic_dist = []
-        return topic_dist
-    def dense_topic_dist(self):
-        """Return the distribution of document topics in dense format,
-           taking into account the prior parameter alpha.
-        """
-        dense_dist = np.zeros(self._num_topics)
-        if self.size() == 0:
-            return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
-            self.size() + self._alpha * self._num_topics)
-        return dense_dist
-    def accumulate_topic_num(self):
-        self._accum_topic_sum += self._topic_sum
-        self._num_accum += 1
-class SLDADoc(LDADoc):
-    """Sentence LDA Document, inherited from LDADoc.
-       Add add_sentence interface.
-    """
-    def __init__(self):
-        super().__init__()
-        self.__sentences = None
-    def init(self, num_topics):
-        """Initialize the SLDADoc according to num_topics.
-        """
-        self._num_topics = num_topics
-        self.__sentences = []
-        self._num_accum = 0
-        self._topic_sum = np.zeros(self._num_topics)
-        self._accum_topic_sum = np.zeros(self._num_topics)
-    def add_sentence(self, sent):
-        """Add new sentence to current SLDADoc.
-        Arg:
-            sent: Sentence class object.
-        """
-        assert sent.topic >= 0, "Topic %d out of range!" % (sent.topic)
-        assert sent.topic < self._num_topics, "Topic %d out of range!" % (sent.topic)
-        self.__sentences.append(sent)
-        self._topic_sum[sent.topic] += 1
-    def set_topic(self, index, new_topic):
-        assert new_topic >= 0, "Topic %d out of range!" % (new_topic)
-        assert new_topic < self._num_topics, "Topic %d out of range!" % (new_topic)
-        old_topic = self.__sentences[index].topic
-        if new_topic == old_topic:
-            return
-        self.__sentences[index].topic = new_topic
-        self._topic_sum[old_topic] -= 1
-        self._topic_sum[new_topic] += 1
-    def size(self):
-        """Return number of sentences in SLDADoc.
-        """
-        return len(self.__sentences)
-    def sent(self, index):
-        return self.__sentences[index]
--- a/modules/text/semantic_model/slda_webpage/tokenizer.py
+++ b/modules/text/semantic_model/slda_webpage/tokenizer.py
-import os
-import numpy as np
-from paddlehub.common.logger import logger
-class Tokenizer(object):
-    """Base tokenizer class.
-    """
-    def __init__(self):
-        pass
-    def tokenize(self, text):
-        raise NotImplementedError
-class SimpleTokenizer(Tokenizer):
-    """Simple version FMM(Forward Maximun Matching) word tokenizer. This tokenizer can only
-       be used in topic model demo, but not in real business application scenarios.
-       Notes: This tokenizer can only recognize the words in the corresponding vocab file.
-    """
-    def __init__(self, vocab_path):
-        super().__init__()
-        self.__max_word_len = 0
-        self.__vocab = set()
-        self.__load_vocab(vocab_path)
-    def tokenize(self, text):
-        """Tokenize the input string `text`, and return the tokenize result.
-        """
-        text_len = len(text)
-        result = []
-        i = 0
-        while i < text_len:
-            word = found_word = ""
-            # Deal with English characters.
-            if self.__is_eng_char(text[i]):
-                for j in range(i, text_len + 1):
-                    if j < text_len and self.__is_eng_char(text[j]):
-                        word += self.__tolower(text[j])
-                    else:
-                        # Forward matching by character granularity.
-                        if word in self.__vocab:
-                            result.append(word)
-                        i = j - 1
-                        break
-            else:
-                for j in range(i, min(i + self.__max_word_len, text_len)):
-                    word += text[j]
-                    if word in self.__vocab:
-                        found_word = word
-                if len(found_word) > 0:
-                    result.append(found_word)
-                    i += len(found_word) - 1
-            i += 1
-        return result
-    def contains(self, word):
-        """Check whether the word is in the vocabulary.
-        """
-        return word in self.__vocab
-    def __load_vocab(self, vocab_path):
-        """Load the word dictionary.
-        """
-        with open(vocab_path, 'r', encoding='utf-8') as fin:
-            vocab_size = 0
-            for line in fin.readlines():
-                fields = line.strip().split('\t')
-                assert len(fields) >= 2
-                word = fields[1]
-                self.__max_word_len = max(self.__max_word_len, len(word))
-                self.__vocab.add(word)
-                vocab_size += 1
-    def __is_eng_char(self, c):
-        """Check whether char c is an English character.
-        """
-        return (c >= 'A' and c <= 'Z') or (c >= 'a' and c <= 'z')
-    def __tolower(self, c):
-        """Return the lowercase character of the corresponding character, or return
-           the original character if there is no corresponding lowercase character.
-        """
-        return c.lower()
-class LACTokenizer(Tokenizer):
-    def __init__(self, vocab_path, lac):
-        super().__init__()
-        self.__max_word_len = 0
-        self.__vocab = set()
-        self.__lac = lac
-        self.__load_vocab(vocab_path)
-    def __load_vocab(self, vocab_path):
-        """Load the word dictionary.
-                """
-        with open(vocab_path, 'r', encoding='utf-8') as fin:
-            vocab_size = 0
-            for line in fin.readlines():
-                fields = line.strip().split('\t')
-                assert len(fields) >= 2
-                word = fields[1]
-                self.__max_word_len = max(self.__max_word_len, len(word))
-                self.__vocab.add(word)
-                vocab_size += 1
-    def tokenize(self, text):
-        results = self.__lac.lexical_analysis(texts=[text], use_gpu=False, batch_size=1, return_tag=True)
-        # Change English words to lower case.
-        # And just preserve the word in vocab.
-        words = results[0]["word"]
-        result = []
-        for word in words:
-            word = word.lower()
-            if word in self.__vocab:
-                result.append(word)
-        return result
-    def contains(self, word):
-        """Check whether the word is in the vocabulary.
-        """
-        return word in self.__vocab
--- a/modules/text/semantic_model/slda_weibo/document.py
+++ b/modules/text/semantic_model/slda_weibo/document.py
-import numpy as np
-class Topic(object):
-    """Basic data structure of topic, contains topic id and
-       corresponding probability.
-    """
-    def __init__(self, tid, prob):
-        self.tid = tid  # topic id
-        self.prob = prob  # topic probability
-class Token(object):
-    """Basic storage unit of LDA documents, contains word id
-       and corresponding topic.
-    """
-    def __init__(self, topic, id):
-        self.topic = topic
-        self.id = id
-class Sentence(object):
-    """Basic storage unit of SentenceLDA documents, contains word ids
-       of the sentence and its corresponding topic id.
-    """
-    def __init__(self, topic, tokens):
-        self.topic = topic
-        self.tokens = tokens
-class LDADoc(object):
-    """The storage structure of LDA model's inference result.
-    """
-    def __init__(self):
-        self._num_topics = None  # Number of topics.
-        self._num_accum = None  # Number of accumulated sample rounds.
-        self._alpha = None  # Document prior parameter.
-        self._tokens = None  # Storage structure of inference results.
-        self._topic_sum = None  # Document's topic sum in one round samples.
-        self._accum_topic_sum = None  # Accumulated results of topic sum.
-    def init(self, num_topics):
-        """Initialize the LDADoc according to num_topics.
-        """
-        self._num_topics = num_topics
-        self._num_accum = 0
-        self._tokens = []
-        self._topic_sum = np.zeros(self._num_topics)
-        self._accum_topic_sum = np.zeros(self._num_topics)
-    def add_token(self, token):
-        """Add new word to current LDADoc.
-        Arg:
-            token: Token class object.
-        """
-        assert token.topic >= 0, "Topic %d out of range!" % token.topic
-        assert token.topic < self._num_topics, "Topic %d out of range!" % token.topic
-        self._tokens.append(token)
-        self._topic_sum[token.topic] += 1
-    def token(self, index):
-        return self._tokens[index]
-    def set_topic(self, index, new_topic):
-        """Set the index word's topic to new_topic, and update the corresponding
-           topic distribution.
-        """
-        assert new_topic >= 0, "Topic %d out of range!" % new_topic
-        assert new_topic < self._num_topics, "Topic %d out of range!" % new_topic
-        old_topic = self._tokens[index].topic
-        if new_topic == old_topic:
-            return
-        self._tokens[index].topic = new_topic
-        self._topic_sum[old_topic] -= 1
-        self._topic_sum[new_topic] += 1
-    def set_alpha(self, alpha):
-        self._alpha = alpha
-    def size(self):
-        """Return number of words in LDADoc.
-        """
-        return len(self._tokens)
-    def topic_sum(self, topic_id):
-        return self._topic_sum[topic_id]
-    def sparse_topic_dist(self, sort=True):
-        """Return the topic distribution of documents in sparse format.
-           By default, it is sorted according to the topic probability
-           under the descending order.
-        """
-        topic_dist = []
-        sum_ = np.sum(self._accum_topic_sum)
-        if sum_ == 0:
-            return topic_dist
-        for i in range(0, self._num_topics):
-            if self._accum_topic_sum[i] == 0:
-                continue
-            topic_dist.append(Topic(i, self._accum_topic_sum[i] * 1.0 / sum_))
-        if sort:
-            def take_elem(topic):
-                return topic.prob
-            topic_dist.sort(key=take_elem, reverse=True)
-            if topic_dist is None:
-                topic_dist = []
-        return topic_dist
-    def dense_topic_dist(self):
-        """Return the distribution of document topics in dense format,
-           taking into account the prior parameter alpha.
-        """
-        dense_dist = np.zeros(self._num_topics)
-        if self.size() == 0:
-            return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
-            self.size() + self._alpha * self._num_topics)
-        return dense_dist
-    def accumulate_topic_num(self):
-        self._accum_topic_sum += self._topic_sum
-        self._num_accum += 1
-class SLDADoc(LDADoc):
-    """Sentence LDA Document, inherited from LDADoc.
-       Add add_sentence interface.
-    """
-    def __init__(self):
-        super().__init__()
-        self.__sentences = None
-    def init(self, num_topics):
-        """Initialize the SLDADoc according to num_topics.
-        """
-        self._num_topics = num_topics
-        self.__sentences = []
-        self._num_accum = 0
-        self._topic_sum = np.zeros(self._num_topics)
-        self._accum_topic_sum = np.zeros(self._num_topics)
-    def add_sentence(self, sent):
-        """Add new sentence to current SLDADoc.
-        Arg:
-            sent: Sentence class object.
-        """
-        assert sent.topic >= 0, "Topic %d out of range!" % (sent.topic)
-        assert sent.topic < self._num_topics, "Topic %d out of range!" % (sent.topic)
-        self.__sentences.append(sent)
-        self._topic_sum[sent.topic] += 1
-    def set_topic(self, index, new_topic):
-        assert new_topic >= 0, "Topic %d out of range!" % (new_topic)
-        assert new_topic < self._num_topics, "Topic %d out of range!" % (new_topic)
-        old_topic = self.__sentences[index].topic
-        if new_topic == old_topic:
-            return
-        self.__sentences[index].topic = new_topic
-        self._topic_sum[old_topic] -= 1
-        self._topic_sum[new_topic] += 1
-    def size(self):
-        """Return number of sentences in SLDADoc.
-        """
-        return len(self.__sentences)
-    def sent(self, index):
-        return self.__sentences[index]
--- a/modules/text/semantic_model/slda_weibo/tokenizer.py
+++ b/modules/text/semantic_model/slda_weibo/tokenizer.py
-import os
-import numpy as np
-from paddlehub.common.logger import logger
-class Tokenizer(object):
-    """Base tokenizer class.
-    """
-    def __init__(self):
-        pass
-    def tokenize(self, text):
-        raise NotImplementedError
-class SimpleTokenizer(Tokenizer):
-    """Simple version FMM(Forward Maximun Matching) word tokenizer. This tokenizer can only
-       be used in topic model demo, but not in real business application scenarios.
-       Notes: This tokenizer can only recognize the words in the corresponding vocab file.
-    """
-    def __init__(self, vocab_path):
-        super().__init__()
-        self.__max_word_len = 0
-        self.__vocab = set()
-        self.__load_vocab(vocab_path)
-    def tokenize(self, text):
-        """Tokenize the input string `text`, and return the tokenize result.
-        """
-        text_len = len(text)
-        result = []
-        i = 0
-        while i < text_len:
-            word = found_word = ""
-            # Deal with English characters.
-            if self.__is_eng_char(text[i]):
-                for j in range(i, text_len + 1):
-                    if j < text_len and self.__is_eng_char(text[j]):
-                        word += self.__tolower(text[j])
-                    else:
-                        # Forward matching by character granularity.
-                        if word in self.__vocab:
-                            result.append(word)
-                        i = j - 1
-                        break
-            else:
-                for j in range(i, min(i + self.__max_word_len, text_len)):
-                    word += text[j]
-                    if word in self.__vocab:
-                        found_word = word
-                if len(found_word) > 0:
-                    result.append(found_word)
-                    i += len(found_word) - 1
-            i += 1
-        return result
-    def contains(self, word):
-        """Check whether the word is in the vocabulary.
-        """
-        return word in self.__vocab
-    def __load_vocab(self, vocab_path):
-        """Load the word dictionary.
-        """
-        with open(vocab_path, 'r', encoding='utf-8') as fin:
-            vocab_size = 0
-            for line in fin.readlines():
-                fields = line.strip().split('\t')
-                assert len(fields) >= 2
-                word = fields[1]
-                self.__max_word_len = max(self.__max_word_len, len(word))
-                self.__vocab.add(word)
-                vocab_size += 1
-    def __is_eng_char(self, c):
-        """Check whether char c is an English character.
-        """
-        return (c >= 'A' and c <= 'Z') or (c >= 'a' and c <= 'z')
-    def __tolower(self, c):
-        """Return the lowercase character of the corresponding character, or return
-           the original character if there is no corresponding lowercase character.
-        """
-        return c.lower()
-class LACTokenizer(Tokenizer):
-    def __init__(self, vocab_path, lac):
-        super().__init__()
-        self.__max_word_len = 0
-        self.__vocab = set()
-        self.__lac = lac
-        self.__load_vocab(vocab_path)
-    def __load_vocab(self, vocab_path):
-        """Load the word dictionary.
-                """
-        with open(vocab_path, 'r', encoding='utf-8') as fin:
-            vocab_size = 0
-            for line in fin.readlines():
-                fields = line.strip().split('\t')
-                assert len(fields) >= 2
-                word = fields[1]
-                self.__max_word_len = max(self.__max_word_len, len(word))
-                self.__vocab.add(word)
-                vocab_size += 1
-    def tokenize(self, text):
-        results = self.__lac.lexical_analysis(texts=[text], use_gpu=False, batch_size=1, return_tag=True)
-        # Change English words to lower case.
-        # And just preserve the word in vocab.
-        words = results[0]["word"]
-        result = []
-        for word in words:
-            word = word.lower()
-            if word in self.__vocab:
-                result.append(word)
-        return result
-    def contains(self, word):
-        """Check whether the word is in the vocabulary.
-        """
-        return word in self.__vocab
--- a/modules/text/sentiment_analysis/README.md
+++ b/modules/text/sentiment_analysis/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【情感分析】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 情感分析
+情感倾向分析（Sentiment Classification，简称Senta）针对带有主观描述的中文文本，可自动判断该文本的情感极性类别并给出相应的置信度，能够帮助企业理解用户消费习惯、分析热点话题和危机舆情监控，为企业提供有利的决策支持。
+- 推荐模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [情感分析-LSTM](https://www.paddlepaddle.org.cn/hubdetail?name=senta_lstm&en_category=SentimentAnalysis) |情感倾向分析LSTM实现 |
+| [情感分析-GRU](https://www.paddlepaddle.org.cn/hubdetail?name=senta_gru&en_category=SentimentAnalysis) |情感倾向分析GRU实现 |
+| [对话情绪识别](https://www.paddlepaddle.org.cn/hubdetail?name=emotion_detection_textcnn&en_category=SentimentAnalysis) |针对智能对话场景中的用户文本，自动判断该文本的情绪类别并给出相应的置信度，情绪类型分为积极、消极、中性。 |
--- a/modules/text/syntactic_analysis/README.md
+++ b/modules/text/syntactic_analysis/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【句法分析】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 句法分析
+- 推荐模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [句法分析-DDParser](https://www.paddlepaddle.org.cn/hubdetail?name=ddparser&en_category=SyntacticAnalysis) | DDParser（Baidu Dependency Parser）是百度NLP基于大规模标注数据和深度学习平台飞桨研发的中文依存句法分析工具，可帮助用户直接获取输入文本中的关联词对、长距离依赖词对等。 |
--- a/modules/text/text_generation/README.md
+++ b/modules/text/text_generation/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【文本生成】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 文本生成
+- 推荐模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [AI写诗](https://www.paddlepaddle.org.cn/hubdetail?name=ernie_gen_poetry&en_category=TextGeneration) |根据输入上句内容，自动生成后面的多句诗歌 |
+| [AI情话](https://www.paddlepaddle.org.cn/hubdetail?name=ernie_gen_lover_words&en_category=TextGeneration) |根据输入上句内容，自动生成后面的多句情话。 |
+| [生成式对话](https://www.paddlepaddle.org.cn/hubdetail?name=plato2_en_large&en_category=TextGeneration) |交互式多轮问答，目前仅支持英文 |
+| [看图写诗](https://www.paddlepaddle.org.cn/hubdetail?name=reading_pictures_writing_poems&en_category=TextGeneration)|该模型可自动根据图像生成古诗词|
+.
--- a/modules/text/text_review/README.md
+++ b/modules/text/text_review/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【文本审核】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 文本审核
+- 推荐模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [色情识别-LSTM](https://www.paddlepaddle.org.cn/hubdetail?name=porn_detection_lstm&en_category=TextCensorship) | 色情检测模型可自动判别文本是否涉黄并给出相应的置信度，对文本中的色情描述、低俗交友、污秽文爱进行识别，核心采用LSTM网络结构并按字粒度进行切词。 |
+| [色情识别-GRU](https://www.paddlepaddle.org.cn/hubdetail?name=porn_detection_gru&en_category=TextCensorship) | 色情检测模型可自动判别文本是否涉黄并给出相应的置信度，对文本中的色情描述、低俗交友、污秽文爱进行识别，核心采用GRU网络结构并按字粒度进行切词。 |
+| [色情识别-CNN](https://www.paddlepaddle.org.cn/hubdetail?name=porn_detection_cnn&en_category=TextCensorship) | 色情检测模型可自动判别文本是否涉黄并给出相应的置信度，对文本中的色情描述、低俗交友、污秽文爱进行识别，核心采用CNN网络结构并按字粒度进行切词。|
--- a/modules/video/README.md
+++ b/modules/video/README.md
+## **更好用户体验，建议参考WEB端官方文档 -> [【视频分类】](https://www.paddlepaddle.org.cn/hubdetail)**
+### 视频分类
+视频数据包含语音、图像等多种信息，因此理解视频任务不仅需要处理语音和图像，还需要提取视频帧时间序列中的上下文信息，视频分类模型适用于各类短视频快速打标签等场景。
+- 推荐模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [视频分类](https://www.paddlepaddle.org.cn/hubdetail?name=videotag_tsn_lstm&en_category=VideoClassification) | 基于千万短视频预训练的视频分类模型，支持超过3000个短视频标签，在实际业务中取得89.9%的Top-1精度，同时具有良好的泛化能力，