diff --git a/README.md b/README.md
index 879df2056c0c14f3593d89fd6a3ac5d3936ca9fe..a19e9e205309f98b9fa06ce8cba20d41aff0374b 100644
--- a/README.md
+++ b/README.md
@@ -1,163 +1,236 @@
-<p align="center">
- <img src="./docs/imgs/paddlehub_logo.jpg" align="middle"  
-</p>
-
-[![Build Status](https://travis-ci.org/PaddlePaddle/PaddleHub.svg?branch=release/v1.8)](https://travis-ci.org/PaddlePaddle/PaddleHub)
 [![License](https://img.shields.io/badge/license-Apache%202-red.svg)](LICENSE)
 [![Version](https://img.shields.io/github/release/PaddlePaddle/PaddleHub.svg)](https://github.com/PaddlePaddle/PaddleHub/releases)
 ![python version](https://img.shields.io/badge/python-3.6+-orange.svg)
 ![support os](https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-yellow.svg)
 
 ## 简介
-PaddleHub是飞桨生态的预训练模型应用工具，开发者可以便捷地使用高质量的预训练模型结合Fine-tune API快速完成模型迁移到部署的全流程工作。PaddleHub提供的预训练模型涵盖了图像分类、目标检测、词法分析、语义模型、情感分析、视频分类、图像生成、图像分割、文本审核、关键点检测等主流模型。更多详情可查看官网：https://www.paddlepaddle.org.cn/hub
-
-## 特性
-- **模型即软件**：通过Python API或命令行实现模型调用，可快速体验或集成飞桨特色预训练模型。[-> 效果展示](#模型即软件)
-- **易用的迁移学习**：通过Fine-tune API，只需少量代码即可完成预训练模型的Fine-tuning。[-> 效果展示](#易用的迁移学习)
-- **一键模型转服务**：简单一行命令即可搭建属于自己的深度学习模型API服务完成部署。[-> 效果展示](#一键模型转服务)
-
-
-## 文档教程 [[readthedocs]](https://paddlehub.readthedocs.io/zh_CN/develop/index.html)
-
-- [概述](./docs/overview.md)
-- [PIP安装](./docs/installation.md)
-- [快速体验](./docs/quickstart.md)
-- [丰富的预训练模型](./docs/pretrained_models.md)
-    - [飞桨优势特色模型](./docs/pretrained_models.md)
-    - [计算机视觉](./docs/pretrained_models.md)
-      - [图像分类](./docs/pretrained_models.md)
-      - [目标检测](./docs/pretrained_models.md)
-      - [图像分割](./docs/pretrained_models.md)
-      - [关键点检测](./docs/pretrained_models.md)
-      - [图像生成](./docs/pretrained_models.md)
-    - [自然语言处理](./docs/pretrained_models.md)
-      - [中文词法分析与词向量](./docs/pretrained_models.md)
-      - [情感分析](./docs/pretrained_models.md)
-      - [文本相似度计算](./docs/pretrained_models.md)
-      - [文本生成](./docs/pretrained_models.md)
-      - [语义表示](./docs/pretrained_models.md)
-    - [视频](./docs/pretrained_models.md)
-- 使用教程
-    - [命令行工具](./docs/tutorial/cmdintro.md)
-    - [自定义数据](./docs/tutorial/how_to_load_data.md)
-    - [服务化部署](./docs/tutorial/serving.md)
-- 进阶指南
-    - [文本Embedding服务](./docs/tutorial/bert_service.md)
-    - [语义相似度计算](./docs/tutorial/sentence_sim.md)
-- API
-    - [hub.datasets](./docs/reference/datasets.md)
-    - [hub.finetune](./docs/reference/finetune.md)
-    - [hub.Module](./docs/reference/module.md)
-    - [hub.vision.transforms](./docs/reference/vision.md)
-- [FAQ](./docs/faq.md)  
-- 社区交流
-    - [加入技术交流群](#欢迎加入PaddleHub技术交流群)
-    - [贡献预训练模型](./docs/contribution/contri_pretrained_model.md)
-    - [贡献代码](./docs/contribution/contri_pr.md)
-- [更新历史](./docs/release.md)
-- [许可证书](#许可证书)
-- [致谢](#致谢)
+- PaddleHub旨在为开发者提供丰富的、高质量的、直接可用的预训练模型，**【无需深度学习背景、无需数据与训练过程】**，也可快速使用AI模型。
+- 涵盖CV、NLP、Audio、Video主流四大品类，支持**一键预测**、**一键服务化部署**和**快速迁移学习**
+- 全部模型开源下载，**离线可运行**。
 
-## 效果展示
 
-<a name="模型即软件"></a>
-### 1、模型即软件
+## 近期更新
+- **2020.11.20**，发布2.0-beta版本，全面迁移动态图编程模式，服务化部署Serving能力升级；新增手部关键点检测1个、图像动漫化类12个、图片编辑类3个，语音合成类3个，句法分析1个，预训练模型总量到达 **【182】** 个。
+- **2020.10.09**，新增OCR多语言系列模型4个，图像编辑模型4个，预训练模型总量到达 **【162】** 个。
+- **2020.09.27**，新增文本生成模型6个，图像分割模型1个，预训练模型总量到达 **【154】** 个。
+- **2020.08.13**，发布v1.8.1，新增人像分割模型Humanseg，支持EMNLP2019-Sentence-BERT作为文本匹配任务网络，预训练模型总量到达 **【147】** 个。
+- **2020.07.29**，发布v1.8.0，新增AI对联和AI写诗、jieba切词，文本数据LDA、语义相似度计算，新增目标检测，短视频分类模型，超轻量中英文OCR，新增行人检测、车辆检测、动物识别等工业级模型，支持VisualDL可视化训练，预训练模型总量到达 **【135】** 个。
+- [More](./docs/release.md)
 
-PaddleHub采用模型即软件的设计理念，所有的预训练模型与Python软件包类似，具备版本的概念，通过`hub install/uninstall` 可以便捷完成模型的升级和卸载。还可以通过Python的API或命令行实现快速预测的软件集成，更方便地应用和集成深度学习模型。
 
-安装PaddleHub后，执行命令[hub run](./docs/tutorial/cmdintro.md)，即可快速体验无需代码、一键预测的功能：
+## [特性](./docs/figures.md)
+- **【丰富的预训练模型】**：涵盖CV、NLP、Audio、Video主流四大品类的 180+ 预训练模型，全部开源下载，离线可运行。
+- **【一键模型快速预测】**：通过一行命令行或者极简的Python API实现模型调用，可快速体验模型效果。
+- **【一键模型转服务化】**：一行命令，搭建深度学习模型API服务化部署能力。
+- **【十行代码迁移学习】**：十行代码完成图片分类、文本分类的迁移学习任务
+- **【PIP安装便捷】**：支持PIP快速安装使用
+- **【跨平台兼容性】**：可运行于Linux、Windows、MacOS等多种操作系统
 
-* 使用[文字识别](https://www.paddlepaddle.org.cn/hublist?filter=en_category&value=TextRecognition)轻量级中文OCR模型chinese_ocr_db_crnn_mobile即可一键快速识别图片中的文字。
-```shell
-$ wget https://paddlehub.bj.bcebos.com/model/image/ocr/test_ocr.jpg
-$ hub run chinese_ocr_db_crnn_mobile --input_path test_ocr.jpg --visualization=True
-```
 
-预测结果图片保存在当前运行路径下ocr_result文件夹中，如下图所示。
-
-<p align="center">
- <img src="./docs/imgs/ocr_res.jpg" width='70%' align="middle"  
-</p>
-
-* 使用[目标检测](https://www.paddlepaddle.org.cn/hublist?filter=en_category&value=ObjectDetection)模型pyramidbox_lite_mobile_mask对图片进行口罩检测
-```shell
-$ wget https://paddlehub.bj.bcebos.com/resources/test_mask_detection.jpg
-$ hub run pyramidbox_lite_mobile_mask --input_path test_mask_detection.jpg
-```
-<p align="center">
- <img src="./docs/imgs/test_mask_detection_result.jpg" align="middle"  
-</p>
-
-* 使用[词法分析](https://www.paddlepaddle.org.cn/hublist?filter=en_category&value=LexicalAnalysis)模型LAC进行分词
-```shell
-$ hub run lac --input_text "现在，慕尼黑再保险公司不仅是此类行动的倡议者，更是将其大量气候数据整合进保险产品中，并与公众共享大量天气信息，参与到新能源领域的保障中。"
-[{
-    'word': ['现在', '，', '慕尼黑再保险公司', '不仅', '是', '此类', '行动', '的', '倡议者', '，', '更是', '将', '其', '大量', '气候', '数据', '整合', '进', '保险', '产品', '中', '，', '并', '与', '公众', '共享', '大量', '天气', '信息', '，', '参与', '到', '新能源', '领域', '的', '保障', '中', '。'],
-    'tag':  ['TIME', 'w', 'ORG', 'c', 'v', 'r', 'n', 'u', 'n', 'w', 'd', 'p', 'r', 'a', 'n', 'n', 'v', 'v', 'n', 'n', 'f', 'w', 'c', 'p', 'n', 'v', 'a', 'n', 'n', 'w', 'v', 'v', 'n', 'n', 'u', 'vn', 'f', 'w']
-}]
-```
-
-PaddleHub还提供图像分类、语义模型、视频分类、图像生成、图像分割、文本审核、关键点检测等主流模型，更多模型介绍，请前往[预训练模型介绍](./docs/pretrained_models.md)或者PaddleHub官网[https://www.paddlepaddle.org.cn/hub](https://www.paddlepaddle.org.cn/hub) 查看
-
-<a name="易用的迁移学习"></a>
-
-### 2、易用的迁移学习
-
-通过Fine-tune API，只需要少量代码即可完成深度学习模型在计算机视觉场景下的迁移学习。
-
-* [Demo示例](./demo)提供丰富的Fine-tune API的使用代码，包括[图像分类](./demo/image_classification)、[图像着色](./demo/colorization)、[风格迁移](./demo/style_transfer)、等场景的模型迁移示例。
-
-<p align="center">
- <img src="./docs/imgs/paddlehub_finetune.gif" align="middle"  
-</p>
-
-<p align='center'>
- 十行代码完成图像风格迁移
-</p>
-
-* 如需在线快速体验，请点击[PaddleHub教程合集](https://aistudio.baidu.com/aistudio/projectdetail/231146)，可使用AI Studio平台提供的GPU算力进行快速尝试。
+## 精品模型效果展示
+### 文本识别
+- 包含超轻量中英文OCR模型，高精度中英文、多语种德语、法语、日语、韩语OCR识别。
+<div align="center">
+<img src="./docs/imgs/Readme_Related/Image_Ocr.gif"  width = "800" height = "400" />
+</div>
 
-<a name="一键模型转服务"></a>
-### 3、一键模型转服务
+### 人脸检测
+- 包含人脸检测，口罩人脸检测，多种算法可选。
+<div align="center">
+<img src="./docs/imgs/Readme_Related/Image_ObjectDetection_Face_Mask.gif"  width = "588" height = "400" />
+</div>
 
-PaddleHub提供便捷的模型转服务的能力，只需简单一行命令即可完成模型的HTTP服务部署。通过以下命令即可快速启动LAC词法分析服务：
+### 图像编辑
+- 4倍超分效果，多种超分算法可选。
+- 黑白图片上色，可用于老旧照片修复，
+<div align="center">
+<table>
+    <thead>
+    </thead>
+    <tbody>
+        <tr>
+            <th>图像超分辨率 </th>
+            <th>黑白图片上色 </th>
+        </tr>
+        <tr>
+            <th>
+            <a>
+            <img src="./docs/imgs/Readme_Related/ImageEdit_SuperResolution.gif"  width = "266" height = "400" /></a><br>
+            </th>
+            <th>
+            <a>
+            <img src="./docs/imgs/Readme_Related/ImageEdit_Restoration.gif"  width = "300" height = "400" /></a><br>
+            </th>
+        </tr>
+    </tbody>
+</table>
+</div>
+
+
+### 目标检测
+- 包含行人检测、车辆检测，更有工业级超大规模预训练模型可选。
+<div align="center">
+<img src="./docs/imgs/Readme_Related/Image_ObjectDetection_Pedestrian_Vehicle.gif"  width = "642" height = "400" />
+</div>
 
-```shell
-$ hub serving start --modules lac
-```
+### 关键点检测
+- 包含单人、多人身体关键点检测、面部关键点检测、手部关键点检测。
+<div align="center">
+<img src="./docs/imgs/Readme_Related/Image_keypoint.gif"  width = "458" height = "400" />
+</div>
 
-更多关于模型服务化使用说明参见[PaddleHub模型一键服务化部署](./docs/tutorial/serving.md)。
+### 图像分割
+- 包含效果卓越的人像抠图模型、ACE2P人体解析世界冠军模型
+<div align="center">
+<img src="./docs/imgs/Readme_Related/ImageSeg_Human.gif"  width = "642" height = "400" />
+</div>
 
-## FAQ
+### 图像动漫化
+- 包含宫崎骏、新海诚在内的多位漫画家风格迁移，多种算法可选
+<div align="center">
+<img src="./docs/imgs/Readme_Related/ImageGan_Anime.gif"  width = "532" height = "400" />
+</div>
 
-**Q:** 利用PaddleHub Fine-tune如何适配自定义数据集？
+### 图像分类
+- 包含动物分类、菜品分类、野生动物制品分类，多种算法可选
+<div align="center">
+<img src="./docs/imgs/Readme_Related/ImageClas_animal_dish_wild.gif"  width = "530" height = "400" />
+</div>
 
-**A:** 参考[PaddleHub使用自定义数据集完成Fine-tune](./docs/tutorial/how_to_load_data.md)
+### 词法分析
+- 效果优秀的中文分词、词性标注与命名实体识别的模型。
+<div align="center">
+<img src="./docs/imgs/Readme_Related/Text_Lexical Analysis.png"  width = "640" height = "233" />
+</div>
 
+### 文本生成
+- 包含AI写诗、AI对联、AI情话、AI藏头诗，多种算法可选。
+<div align="center">
+<img src="./docs/imgs/Readme_Related/Text_Textgen_poetry.gif"  width = "850" height = "400" />
+</div>
 
-**Q:** 使用PaddleHub时，无法下载预置数据集、Module的等现象。
+### 句法分析
+- 效果领先的中文句法分析模型。
+<div align="center">
+<img src="./docs/imgs/Readme_Related/Text_SyntacticAnalysis.png"  width = "640" height = "301" />
+</div>
 
-**A:** 下载数据集、module等，PaddleHub要求机器可以访问外网。可以使用server_check()可以检查本地与远端PaddleHub-Server的连接状态，使用方法如下：
+### 情感分析
+- 支持中文的评论情感分析
+<div align="center">
+<img src="./docs/imgs/Readme_Related/Text_SentimentAnalysis.png"  width = "640" height = "228" />
+</div>
 
-```python
-import paddlehub
-paddlehub.server_check()
-# 如果可以连接远端PaddleHub-Server，则显示Request Hub-Server successfully。
-# 如果无法连接远端PaddleHub-Server，则显示Request Hub-Server unsuccessfully。
-```
+### 文本审核
+- 包含中文色情文本的审核，多种算法可选。
+<div align="center">
+<img src="./docs/imgs/Readme_Related/Text_Textreview.png"  width = "640" height = "140" />
+</div>
 
-**[More](./docs/faq.md)**
+### 语音合成
+- TTS语音合成算法，多种算法可选
+- 输入：`Life was like a box of chocolates, you never know what you're gonna get.`
+- 合成效果如下:
+<div align="center">
+<table>
+    <thead>
+    </thead>
+    <tbody>
+        <tr>
+            <th>deepvoice3 </th>
+            <th>fastspeech </th>
+            <th>transformer</th>
+        </tr>
+        <tr>
+            <th>
+            <a href="https://paddlehub.bj.bcebos.com/resources/deepvoice3_ljspeech-0.wav">
+            <img src="./docs/imgs/Readme_Related/audio_icon.png" width=250 /></a><br>
+            </th>
+            <th>
+            <a href="https://paddlehub.bj.bcebos.com/resources/fastspeech_ljspeech-0.wav">
+            <img src="./docs/imgs/Readme_Related/audio_icon.png" width=250 /></a><br>
+            </th>
+            <th>
+            <a href="https://paddlehub.bj.bcebos.com/resources/transformer_tts_ljspeech-0.wav">
+            <img src="./docs/imgs/Readme_Related/audio_icon.png" width=250 /></a><br>
+            </th>
+        </tr>
+    </tbody>
+</table>
+</div>
+
+### 视频分类
+- 包含短视频分类，支持3000+标签种类，可输出TOP-K标签，多种算法可选。
+- `举例：输入一段游泳的短视频，算法可以输出"游泳"结果`
+<div align="center">
+<img src="./docs/imgs/Readme_Related/Text_Video.gif"  width = "400" height = "400" />
+</div>
 
-当您安装或者使用遇到问题时，如果在FAQ中没有找到解决方案，欢迎您将问题以[Github Issues](https://github.com/PaddlePaddle/PaddleHub/issues)的形式提交给我们，我们会第一时间进行跟进。
+##  ===划重点===
+- 以上所有预训练模型全部开源，模型数量持续更新，欢迎Star关注。
+<div align="center">
+<a href="https://github.com/PaddlePaddle/PaddleHub/stargazers">
+            <img src="./docs/imgs/Readme_Related/star.png"  width = "411" height = "100" /></a>  
+</div>
 
 <a name="欢迎加入PaddleHub技术交流群"></a>
-## 微信扫描二维码，欢迎加入PaddleHub技术交流群
-
+## 欢迎加入PaddleHub技术交流群
+- 在使用模型过程中有任何问题，可以加入官方微信群，获得更高效的问题答疑，与各行各业开发者充分交流，期待您的加入。
 <div align="center">
-<img src="./docs/imgs/joinus.JPEG"  width = "200" height = "200" />
+<img src="./docs/imgs/joinus.PNG"  width = "200" height = "200" />
 </div>  
 如扫码失败，请添加微信15711058002，并备注“Hub”，运营同学会邀请您入群。  
 
+
+## 文档教程
+
+- [PIP安装](./docs/installation.md)
+- 快速开始
+    - [命令行调用](./docs/quick_experience/cmd_quick_run.md)
+    - [Python API调用](./docs/quick_experience/python_use_hub.md)
+    - [在线运行体验demo【Official】](https://github.com/PaddlePaddle/PaddleHub/tree/release/v1.8/demo)
+    - [生态趣味项目demo【ThirdParty】](./docs/quick_experience/more_demos.md)
+- 丰富的预训练模型 182 个
+    - [精品特色模型](./docs/figure.md)
+    - 计算机视觉 126 个
+      - [图像分类 64 个](./modules/image/classification/README.md)
+      - [目标检测 13 个](./modules/image/object_detection/README.md)
+      - [人脸检测 7 个](./modules/image/face_detection/README.md)  
+      - [关键点检测 3 个](./modules/image/keypoint_detection/README.md)
+      - [图像分割 7 个](./modules/image/semantic_segmentation/README.md)
+      - [文本识别 8 个](./modules/image/text_recognition/README.md)
+      - [图像生成 17 个](./modules/image/Image_gan/README.md)
+      - [图像编辑 7 个](./modules/image/Image_editing/README.md)
+    - 自然语言处理 48 个
+      - [词法分析 2 个](./modules/text/lexical_analysis/README.md)
+      - [句法分析 1 个](./modules/text/syntactic_analysis/README.md)
+      - [情感分析 7 个](./modules/text/sentiment_analysis/README.md)
+      - [文本审核 3 个](./modules/text/text_review/README.md)
+      - [文本生成 9 个](./modules/text/text_generation/README.md)
+      - [语义模型 26 个](./modules/text/language_model/README.md)
+    - 语音 3 个
+      - [语音合成 3 个](./modules/audio/README.md)
+    - 视频5个
+      - [视频分类 5 个](./modules/video/README.md)
+- 部署
+    - [本地Inference部署](./docs/quick_experience/python_use_hub.md)
+    - [一行代码服务化部署](./docs/tutorial/serving.md)
+    - [移动端 Lite 部署（跳转Lite教程）](https://paddle-lite.readthedocs.io/zh/latest/quick_start/tutorial.html)
+- 进阶文档
+    - [命令行工具详解](./docs/tutorial/cmdintro.md)
+    - [自定义数据迁移学习](./docs/tutorial/how_to_load_data.md)
+    - [模型转module](./docs/tutorial/contri_pretrained_model.md)
+    - [文本Embedding任务](./docs/tutorial/bert_service.md)
+- 社区交流
+    - [加入技术交流群](#欢迎加入PaddleHub技术交流群)
+    - [贡献预训练模型](./docs/contribution/contri_pretrained_model.md)
+    - [贡献代码](./docs/contribution/contri_pr.md)
+- [FAQ](./docs/faq.md)  
+- [更新历史](./docs/release.md)
+- [许可证书](#许可证书)
+- [致谢](#致谢)
+
+
 <a name="许可证书"></a>
 ## 许可证书
 本项目的发布受<a href="./LICENSE">Apache 2.0 license</a>许可认证。
diff --git a/docs/figures.md b/docs/figures.md
new file mode 100644
index 0000000000000000000000000000000000000000..0568561ad97ee7749b826a61e423293552f8fa12
--- /dev/null
+++ b/docs/figures.md
@@ -0,0 +1,98 @@
+## 特性详解
+<a name="丰富的预训练模型"></a>
+
+### 1、丰富的预训练模型
+
+#### 1.1、图像
+
+|            | **精品模型举例**                                             |
+| ---------- | :----------------------------------------------------------- |
+| 图像分类 | [菜品识别](https://www.paddlepaddle.org.cn/hubdetail?name=resnet50_vd_dishes&en_category=ImageClassification)、[动物识别](https://www.paddlepaddle.org.cn/hubdetail?name=resnet50_vd_animals&en_category=ImageClassification)、[动物识别](https://www.paddlepaddle.org.cn/hubdetail?name=resnet50_vd_animals&en_category=ImageClassification)、[-->More](../modules/image/classification/README.md) |
+| 目标检测   | [通用检测](https://www.paddlepaddle.org.cn/hubdetail?name=yolov3_darknet53_coco2017&en_category=ObjectDetection)、[行人检测](https://www.paddlepaddle.org.cn/hubdetail?name=yolov3_darknet53_pedestrian&en_category=ObjectDetection)、[车辆检测](https://www.paddlepaddle.org.cn/hubdetail?name=yolov3_darknet53_vehicles&en_category=ObjectDetection)、[-->More](../modules/image/object_detection/README.md) |
+| 人脸检测 | [人脸检测](https://www.paddlepaddle.org.cn/hubdetail?name=pyramidbox_lite_server&en_category=FaceDetection)、[口罩检测](https://www.paddlepaddle.org.cn/hubdetail?name=pyramidbox_lite_server_mask&en_category=FaceDetection)、[-->More](../modules/image/face_detection/README.md) |
+| 图像分割   | [人像分割](https://www.paddlepaddle.org.cn/hubdetail?name=deeplabv3p_xception65_humanseg&en_category=ImageSegmentation)、[人体解析](https://www.paddlepaddle.org.cn/hubdetail?name=ace2p&en_category=ImageSegmentation)、[肺炎CT影像分析](https://www.paddlepaddle.org.cn/hubdetail?name=Pneumonia_CT_LKM_PP&en_category=ImageSegmentation)、[-->More](../modules/image/semantic_segmentation/README.md) |
+| 关键点检测 | [人体关键点](https://www.paddlepaddle.org.cn/hubdetail?name=human_pose_estimation_resnet50_mpii&en_category=KeyPointDetection)、[人脸关键点](https://www.paddlepaddle.org.cn/hubdetail?name=face_landmark_localization&en_category=KeyPointDetection)、[手部关键点](https://www.paddlepaddle.org.cn/hubdetail?name=hand_pose_localization&en_category=KeyPointDetection)、[-->More](./modules/image/keypoint_detection/README.md) |
+| 文本识别 | [超轻量中英文OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=chinese_ocr_db_crnn_mobile&en_category=TextRecognition)、[-->More](../modules/image/text_recognition/README.md) |
+| 图像生成    | [风格迁移](https://www.paddlepaddle.org.cn/hubdetail?name=stylepro_artistic&en_category=GANs)、[街景动漫画]()、[-->More](../modules/image/Image_gan/README.md) |
+| 图像编辑 | [超分辨率](https://www.paddlepaddle.org.cn/hubdetail?name=realsr&en_category=ImageEditing)、[黑白上色](https://www.paddlepaddle.org.cn/hubdetail?name=deoldify&en_category=ImageEditing)、[-->More](../modules/image/Image_editing/README.md) |
+#### 1.2、文本
+|            | **精品模型举例**                                           |
+| ---------- | :----------------------------------------------------------- |
+| 词句分析 | [词法分析 ](https://www.paddlepaddle.org.cn/hubdetail?name=lac&en_category=LexicalAnalysis)、[句法分析](https://www.paddlepaddle.org.cn/hubdetail?name=ddparser&en_category=SyntacticAnalysis)、[-->More](../modules/text/lexical_analysis/README.md) |
+| 情感分析   | [情感判断](https://www.paddlepaddle.org.cn/hubdetail?name=lac&en_category=LexicalAnalysis)、[情绪分析](https://www.paddlepaddle.org.cn/hubdetail?name=emotion_detection_textcnn&en_category=SentimentAnalysis) 、[-->More](../modules/text/sentiment_analysis/README.md)|
+| 文本审核 | [色情审核](https://www.paddlepaddle.org.cn/hubdetail?name=porn_detection_gru&en_category=TextCensorship)、[-->More](../modules/text/text_review/README.md) |
+| 文本生成 | [对联生成]()、[情话生成]()、[藏图诗生成]()、[土味情话]() 、[-->More](../modules/text/text_generation/README.md)|
+| 语义模型   | [ERNIE](https://www.paddlepaddle.org.cn/hubdetail?name=ERNIE&en_category=SemanticModel)、[文本相似度](https://www.paddlepaddle.org.cn/hubdetail?name=simnet_bow&en_category=SemanticModel)、[-->More](../modules/text/language_model/README.md) |
+
+#### 1.3、语音
+|            | **精品模型举例**                                           |
+| ---------- | :----------------------------------------------------------- |
+| 语音合成   | [语音合成]() 、[-->More](../modules/audio/README.md)                         |
+#### 1.4、视频
+|            | **精品模型举例**                                       |
+| ---------- | :----------------------------------------------------------- |
+| 视频分类 | [视频分类]()、[-->More](../modules/video/README.md) |
+
+<a name="一键模型预测"></a>
+
+### 2、一键模型预测
+
+
+* 举例，假如考虑使用文字识别轻量级中文OCR模型chinese_ocr_db_crnn_mobile即可一键快速识别图片中的文字。
+```shell
+$ pip install paddlehub
+$ wget https://paddlehub.bj.bcebos.com/model/image/ocr/test_ocr.jpg
+$ hub run chinese_ocr_db_crnn_mobile --input_path test_ocr.jpg --visualization=True
+```
+
+* 预测结果图片保存在当前运行路径下ocr_result文件夹中，如下图所示。
+
+<p align="center">
+ <img src="./imgs/ocr_res.jpg" width='70%' align="middle"  
+</p>
+
+* 使用词法分析模型LAC进行分词
+```shell
+$ hub run lac --input_text "现在，慕尼黑再保险公司不仅是此类行动的倡议者，更是将其大量气候数据整合进保险产品中，并与公众共享大量天气信息，参与到新能源领域的保障中。"
+[{
+    'word': ['现在', '，', '慕尼黑再保险公司', '不仅', '是', '此类', '行动', '的', '倡议者', '，', '更是', '将', '其', '大量', '气候', '数据', '整合', '进', '保险', '产品', '中', '，', '并', '与', '公众', '共享', '大量', '天气', '信息', '，', '参与', '到', '新能源', '领域', '的', '保障', '中', '。'],
+    'tag':  ['TIME', 'w', 'ORG', 'c', 'v', 'r', 'n', 'u', 'n', 'w', 'd', 'p', 'r', 'a', 'n', 'n', 'v', 'v', 'n', 'n', 'f', 'w', 'c', 'p', 'n', 'v', 'a', 'n', 'n', 'w', 'v', 'v', 'n', 'n', 'u', 'vn', 'f', 'w']
+}]
+```
+
+除了一行代码预测之外，PaddleHub也支持使用API调用模型的方式，可以参考每个模型的详细文档。
+
+<a name="一键模型转服务"></a>
+
+### 3、一键模型转服务
+
+PaddleHub提供便捷的模型转服务的能力，只需简单一行命令即可完成模型的HTTP服务部署。通过以下命令即可快速启动LAC词法分析服务：
+
+```shell
+$ hub serving start -m chinese_ocr_db_crnn_mobile
+```
+
+更多关于模型服务化使用说明参见[PaddleHub模型一键服务化部署](./tutorial/serving.md)。
+
+
+
+<a name="十行代码迁移学习"></a>
+
+### 4、十行代码迁移学习
+
+通过Fine-tune API，只需要少量代码即可完成深度学习模型在计算机视觉场景下的迁移学习。
+
+* [Demo示例](../demo)提供丰富的Fine-tune API的使用代码，包括[图像分类](../demo/image_classification)、[图像着色](../demo/colorization)、[风格迁移](../demo/style_transfer)、等场景的模型迁移示例。
+
+<p align="center">
+ <img src="./imgs/paddlehub_finetune.gif" align="middle"  
+</p>
+
+<p align='center'>
+ 十行代码完成工业级文本分类
+</p>
+
+* 如需在线快速体验，请点击[PaddleHub教程合集](https://aistudio.baidu.com/aistudio/projectdetail/231146)，可使用AI Studio平台提供的GPU算力进行快速尝试。
+
+
+
diff --git a/docs/imgs/Readme_Related/ImageClas_animal_dish_wild.gif b/docs/imgs/Readme_Related/ImageClas_animal_dish_wild.gif
new file mode 100644
index 0000000000000000000000000000000000000000..6b7e8eeda85c0d7dad9ea6850f4d50f5f2da41d8
Binary files /dev/null and b/docs/imgs/Readme_Related/ImageClas_animal_dish_wild.gif differ
diff --git a/docs/imgs/Readme_Related/ImageEdit_Restoration.gif b/docs/imgs/Readme_Related/ImageEdit_Restoration.gif
new file mode 100644
index 0000000000000000000000000000000000000000..8951bcaeda6b2973b2136c6ec31569efde47539c
Binary files /dev/null and b/docs/imgs/Readme_Related/ImageEdit_Restoration.gif differ
diff --git a/docs/imgs/Readme_Related/ImageEdit_SuperResolution.gif b/docs/imgs/Readme_Related/ImageEdit_SuperResolution.gif
new file mode 100644
index 0000000000000000000000000000000000000000..5a94561a1b4658270a348dbf7e9d2e874a0ca1f4
Binary files /dev/null and b/docs/imgs/Readme_Related/ImageEdit_SuperResolution.gif differ
diff --git a/docs/imgs/Readme_Related/ImageGan_Anime.gif b/docs/imgs/Readme_Related/ImageGan_Anime.gif
new file mode 100644
index 0000000000000000000000000000000000000000..557b5eea30c26543a55162e379c661d53789b0b1
Binary files /dev/null and b/docs/imgs/Readme_Related/ImageGan_Anime.gif differ
diff --git a/docs/imgs/Readme_Related/ImageSeg_Human.gif b/docs/imgs/Readme_Related/ImageSeg_Human.gif
new file mode 100644
index 0000000000000000000000000000000000000000..a5dd46beb76fca0296da53cd83f6e8b913b88e93
Binary files /dev/null and b/docs/imgs/Readme_Related/ImageSeg_Human.gif differ
diff --git a/docs/imgs/Readme_Related/Image_ObjectDetection_Face_Mask.gif b/docs/imgs/Readme_Related/Image_ObjectDetection_Face_Mask.gif
new file mode 100644
index 0000000000000000000000000000000000000000..f03184448183b86a48888553184a908976c83604
Binary files /dev/null and b/docs/imgs/Readme_Related/Image_ObjectDetection_Face_Mask.gif differ
diff --git a/docs/imgs/Readme_Related/Image_ObjectDetection_Pedestrian_Vehicle.gif b/docs/imgs/Readme_Related/Image_ObjectDetection_Pedestrian_Vehicle.gif
new file mode 100644
index 0000000000000000000000000000000000000000..b723becf5b7153938d5323ee9549a40021291d4c
Binary files /dev/null and b/docs/imgs/Readme_Related/Image_ObjectDetection_Pedestrian_Vehicle.gif differ
diff --git a/docs/imgs/Readme_Related/Image_Ocr.gif b/docs/imgs/Readme_Related/Image_Ocr.gif
new file mode 100644
index 0000000000000000000000000000000000000000..3d53afd54242d21ed5d5f09d1c5f699e21afa355
Binary files /dev/null and b/docs/imgs/Readme_Related/Image_Ocr.gif differ
diff --git a/docs/imgs/Readme_Related/Image_keypoint.gif b/docs/imgs/Readme_Related/Image_keypoint.gif
new file mode 100644
index 0000000000000000000000000000000000000000..9fbffe2096348583aa9a1152cb288420ad2fb61d
Binary files /dev/null and b/docs/imgs/Readme_Related/Image_keypoint.gif differ
diff --git a/docs/imgs/Readme_Related/Text_Lexical Analysis.png b/docs/imgs/Readme_Related/Text_Lexical Analysis.png
new file mode 100644
index 0000000000000000000000000000000000000000..2e16ed3bb188c3f64a41a37cd8688980fb9ad2d6
Binary files /dev/null and b/docs/imgs/Readme_Related/Text_Lexical Analysis.png differ
diff --git a/docs/imgs/Readme_Related/Text_SentimentAnalysis.png b/docs/imgs/Readme_Related/Text_SentimentAnalysis.png
new file mode 100644
index 0000000000000000000000000000000000000000..7506dc7dadd4ef7e5be8f639511a612a76e314f1
Binary files /dev/null and b/docs/imgs/Readme_Related/Text_SentimentAnalysis.png differ
diff --git a/docs/imgs/Readme_Related/Text_SyntacticAnalysis.png b/docs/imgs/Readme_Related/Text_SyntacticAnalysis.png
new file mode 100644
index 0000000000000000000000000000000000000000..30457688c416829c77a5ae84fc027e489fbb761a
Binary files /dev/null and b/docs/imgs/Readme_Related/Text_SyntacticAnalysis.png differ
diff --git a/docs/imgs/Readme_Related/Text_Textgen_poetry.gif b/docs/imgs/Readme_Related/Text_Textgen_poetry.gif
new file mode 100644
index 0000000000000000000000000000000000000000..9cc2d72f64f9a521498ec1ee2e2309ef58d32808
Binary files /dev/null and b/docs/imgs/Readme_Related/Text_Textgen_poetry.gif differ
diff --git a/docs/imgs/Readme_Related/Text_Textreview.png b/docs/imgs/Readme_Related/Text_Textreview.png
new file mode 100644
index 0000000000000000000000000000000000000000..05c353a91bd93c98c67ac26a9f77b156249419b3
Binary files /dev/null and b/docs/imgs/Readme_Related/Text_Textreview.png differ
diff --git a/docs/imgs/Readme_Related/Text_Video.gif b/docs/imgs/Readme_Related/Text_Video.gif
new file mode 100644
index 0000000000000000000000000000000000000000..b1fb6f5cdce000f4cff849064344b4ba3c450178
Binary files /dev/null and b/docs/imgs/Readme_Related/Text_Video.gif differ
diff --git a/docs/imgs/Readme_Related/audio_icon.png b/docs/imgs/Readme_Related/audio_icon.png
new file mode 100644
index 0000000000000000000000000000000000000000..925ecff7024fcd45b521d665f18075ca16783994
Binary files /dev/null and b/docs/imgs/Readme_Related/audio_icon.png differ
diff --git a/docs/imgs/Readme_Related/star.png b/docs/imgs/Readme_Related/star.png
new file mode 100644
index 0000000000000000000000000000000000000000..204feeae084d3f32a3df41cbf60e5f9002e8dc83
Binary files /dev/null and b/docs/imgs/Readme_Related/star.png differ
diff --git a/docs/imgs/joinus.PNG b/docs/imgs/joinus.PNG
new file mode 100644
index 0000000000000000000000000000000000000000..af0eda0f0eff7712f9795b4e5fa61338214c7be7
Binary files /dev/null and b/docs/imgs/joinus.PNG differ
diff --git a/docs/tutorial/contri_pretrained_model.md b/docs/tutorial/contri_pretrained_model.md
new file mode 100644
index 0000000000000000000000000000000000000000..d73038da6b9dc00aa004cf837239d65088a8c6d7
--- /dev/null
+++ b/docs/tutorial/contri_pretrained_model.md
@@ -0,0 +1,309 @@
+# 如何编写一个PaddleHub Module
+
+## 模型基本信息
+
+我们准备编写一个PaddleHub Module，Module的基本信息如下：
+```yaml
+name="openpose_body_estimation",
+type="CV/image_editing",
+author="paddlepaddle",
+author_email="",
+summary="Openpose_body_estimation is a body pose estimation model based on Realtime Multi-Person 2D Pose \
+        Estimation using Part Affinity Fields.",
+version="1.0.0"
+
+```
+
+Module存在一个接口predict，用于接收传入图片，并得到最终输出的结果，支持python接口调用和命令行调用。
+```python
+import paddlehub as hub
+
+model = hub.Module(name="openpose_body_estimation")
+result = model.predict("demo.jpg")
+```
+```cmd
+hub run openpose_body_estimation --input_path demo.jpg
+```
+
+
+## Module创建
+
+### step 1. 创建必要的目录与文件
+
+创建一个openpose_body_estimation的目录，并在openpose_body_estimation目录下分别创建module.py, processor.py。其中
+
+|文件名|用途|
+|-|-|
+|module.py|主模块，提供Module的实现代码|
+|processor.py|辅助模块，提供词表加载的方法|
+
+```cmd
+➜  tree openpose_body_estimation
+openpose_body_estimation/
+   ├── module.py
+   └── processor.py
+```
+### step 2. 实现辅助模块processor
+
+在processor.py中实现一些在module.py里面需要调用到的类和函数。例如在processor.py 中实现ResizeScaling类：
+
+```python
+class ResizeScaling:
+    """Resize images by scaling method.
+
+    Args:
+        target(int): Target image size.
+        interpolation(Callable): Interpolation method.
+    """
+
+    def __init__(self, target: int = 368, interpolation: Callable = cv2.INTER_CUBIC):
+        self.target = target
+        self.interpolation = interpolation
+
+    def __call__(self, img, scale_search):
+        scale = scale_search * self.target / img.shape[0]
+        resize_img = cv2.resize(img, (0, 0), fx=scale, fy=scale, interpolation=self.interpolation)
+        return resize_img
+```
+
+### step 3. 编写Module处理代码
+
+module.py文件为Module的入口代码所在，我们需要在其中实现预测逻辑。
+
+#### step 3_1. 引入必要的头文件
+```python
+import os
+import time
+import copy
+import base64
+import argparse
+from typing import Union
+from collections import OrderedDict
+
+import cv2
+import paddle
+import paddle.nn as nn
+import numpy as np
+from paddlehub.module.module import moduleinfo, runnable, serving
+import paddlehub.vision.transforms as T
+import openpose_body_estimation.processor as P
+```
+**NOTE:** `paddlehub.vision.transforms`有常见的图像处理方法，可以方便调用。
+
+#### step 3_2. 定义BodyPoseModel类
+module.py中需要有一个继承了nn.Layer，该类负责实现预测逻辑，并使用moduleinfo填写基本信息。当使用hub.Module(name="openpose_body_estimation")加载Module时，PaddleHub会自动创建openpose_body_estimation的对象并返回。
+```python
+@moduleinfo(
+    name="openpose_body_estimation",
+    type="CV/image_editing",
+    author="paddlepaddle",
+    author_email="",
+    summary="Openpose_body_estimation is a body pose estimation model based on Realtime Multi-Person 2D Pose \
+            Estimation using Part Affinity Fields.",
+    version="1.0.0")
+class BodyPoseModel(nn.Layer):
+    ...
+```
+#### step 3_3. 执行必要的初始化及模型搭建
+模型的初始化主要完成几个功能：待使用的类的声明，模型使用的类的声明及参数加载。
+```python
+ def __init__(self, load_checkpoint: str = None):
+        super(BodyPoseModel, self).__init__()
+        #将会使用到的类的声明
+        self.resize_func = P.ResizeScaling()
+        self.norm_func = T.Normalize(std=[1, 1, 1])
+        #模型声明
+        self.input_nc = 4
+        self.output_nc = 2
+        model1 = (
+            Conv2D(self.input_nc, 64, 3, 1, 1),
+            nn.ReLU(),
+            Conv2D(64, 64, 3, 1, 1),
+            nn.ReLU(),
+            nn.BatchNorm(64),
+        )
+        self.model1 = nn.Sequential(*model1)
+        #参数加载
+        if load_checkpoint is not None:
+            self.model_dict = paddle.load(load_checkpoint)
+            self.set_dict(self.model_dict)
+            print("load custom checkpoint success")
+        else:
+            checkpoint = os.path.join(self.directory, 'model.pdparams')
+            self.model_dict = paddle.load(checkpoint)
+            self.set_dict(self.model_dict)
+            print("load pretrained checkpoint success")
+
+```  
+模型的搭建主要在`forward`里面实现：
+```python
+def forward(self, input: paddle.Tensor) -> paddle.Tensor:
+        result = self.model1(input)
+        return result
+
+```  
+
+#### step 3_4. 完善预测逻辑
+```python
+def predict(self, img:Union(np.ndarray,str), visualization: bool = True):
+    self.eval()
+    self.visualization = visualization
+    if isinstance(img, str):
+        orgImg = cv2.imread(img)
+    else:
+        orgImg = img
+    data = self.resize_func(self.norm_func(orgImg))
+    output = self.forward(paddle.to_tensor(data.astype('float32')))
+    output = paddle.clip(output[0].transpose((1, 2, 0)), 0, 255).numpy()
+    output = output.astype(np.uint8)
+    if self.visualization:
+        style_name = "body_" + str(time.time()) + ".png"
+        if not os.path.exists(save_path):
+            os.mkdir(save_path)
+        path = os.path.join(save_path, style_name)
+        cv2.imwrite(path, output)
+    return output
+```
+#### step 3_5. 支持命令行调用
+如果希望Module可以支持命令行调用，则需要提供一个经过runnable修饰的接口，接口负责解析传入数据并进行预测，将结果返回。
+
+```python
+@runnable
+def run_cmd(self, argvs):
+    """
+    Run as a command.
+    """
+    self.parser = argparse.ArgumentParser(
+        description="Run the {} module.".format(self.name),
+        prog='hub run {}'.format(self.name),
+        usage='%(prog)s',
+        add_help=True)
+    self.arg_input_group = self.parser.add_argument_group(
+        title="Input options", description="Input data. Required")
+    self.arg_config_group = self.parser.add_argument_group(
+        title="Config options",
+        description=
+        "Run configuration for controlling module behavior, not required.")
+    self.add_module_config_arg()
+    self.add_module_input_arg()
+    args = self.parser.parse_args(argvs)
+    results = self.predict(
+        img=args.input_path,
+        save_path=args.output_dir,
+        visualization=args.visualization)
+    return results
+
+def add_module_config_arg(self):
+    """
+    Add the command config options.
+    """
+
+    self.arg_config_group.add_argument(
+        '--output_dir',
+        type=str,
+        default='openpose_body',
+        help="The directory to save output images.")
+    self.arg_config_group.add_argument(
+        '--save_dir',
+        type=str,
+        default='openpose_model',
+        help="The directory to save model.")
+    self.arg_config_group.add_argument(
+        '--visualization',
+        type=bool,
+        default=True,
+        help="whether to save output as images.")
+
+def add_module_input_arg(self):
+    """
+    Add the command input options.
+    """
+    self.arg_input_group.add_argument(
+        '--input_path', type=str, help="path to image.")
+
+```
+#### step 3_6. 支持serving调用
+
+如果希望Module可以支持PaddleHub Serving部署预测服务，则需要提供一个经过serving修饰的接口，接口负责解析传入数据并进行预测，将结果返回。
+
+如果不需要提供PaddleHub Serving部署预测服务，则可以不需要加上serving修饰。
+
+```python
+@serving
+def serving_method(self, images, **kwargs):
+    """
+    Run as a service.
+    """
+    images_decode = [base64_to_cv2(image) for image in images]
+    results = self.predict(img=images_decode[0], **kwargs)
+    final={}
+    final['data'] = P.cv2_to_base64(results)
+    return final
+```
+
+
+## 测试步骤
+
+完成Module编写后，我们可以通过以下方式测试该Module
+
+### 调用方法1
+
+将Module安装到本机中，再通过Hub.Module(name=...)加载
+```shell
+hub install openpose_body_estimation
+```
+
+```python
+import paddlehub as hub
+
+if __name__ == "__main__":
+
+    model = hub.Module(name='openpose_hands_estimation')
+    result = model.predict("demo.jpg")
+```
+
+### 调用方法2
+将Module安装到本机中，再通过hub run运行
+
+```shell
+hub install openpose_body_estimation
+hub run openpose_body_estimation --input_path demo.jpg
+```
+### 测试serving方法
+
+运行启动命令：
+
+```shell
+$ hub serving start -m openpose_body_estimation
+```
+
+发送预测请求，获取预测结果.
+
+```python
+import requests
+import json
+import cv2
+import base64
+
+import numpy as np
+
+
+def cv2_to_base64(image):
+    data = cv2.imencode('.jpg', image)[1]
+    return base64.b64encode(data.tostring()).decode('utf8')
+
+def base64_to_cv2(b64str):
+    data = base64.b64decode(b64str.encode('utf8'))
+    data = np.fromstring(data, np.uint8)
+    data = cv2.imdecode(data, cv2.IMREAD_COLOR)
+    return data
+
+# 发送HTTP请求
+org_im = cv2.imread('/PATH/TO/IMAGE')
+data = {'images':[cv2_to_base64(org_im)]}
+headers = {"Content-type": "application/json"}
+url = "http://127.0.0.1:8866/predict/openpose_body_estimation"
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+canvas = base64_to_cv2(r.json()["results"]['data'])
+cv2.imwrite('keypoint_body.png', canvas)
+```
diff --git a/modules/audio/README.md b/modules/audio/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..20918f86fa51ed3d2e5490bfefad76ec130afb8c 100644
--- a/modules/audio/README.md
+++ b/modules/audio/README.md
@@ -0,0 +1,11 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【语音合成】](https://www.paddlepaddle.org.cn/hublist)**
+
+### 文字识别
+语音合成（TTS）任务可以实现讲文字转化为语音，已经广泛应用于各种语音交互设备中。
+- 推荐模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [语音合成transformer_tts_ljspeech](https://www.paddlepaddle.org.cn/hubdetail?name=transformer_tts_ljspeech&en_category=TextToSpeech) | TansformerTTS 对 Transformer 和 Tacotron2 进行了融合，取得了令人满意的效果，英文TTS模型，仅支持预测。 |
+| [语音合成fastspeech_ljspeech](https://www.paddlepaddle.org.cn/hubdetail?name=fastspeech_ljspeech&en_category=TextToSpeech) | FastSpeech是基于encoder-decoder结构的teacher model中提取attention对角线来做发音持续时间预测，英文TTS模型，仅支持预测。 |
+| [语音合成deepvoice3_ljspeech](https://www.paddlepaddle.org.cn/hubdetail?name=deepvoice3_ljspeech&en_category=TextToSpeech) | Deep Voice 3是百度研究院2017年发布的端到端的TTS模型（论文录用于ICLR 2018）。它是一个基于卷积神经网络和注意力机制的seq2seq模型,英文TTS模型，仅支持预测。|
diff --git a/modules/image/Image_editing/README.md b/modules/image/Image_editing/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..83260149039171d28ed189dc4a98d5fa77178036
--- /dev/null
+++ b/modules/image/Image_editing/README.md
@@ -0,0 +1,15 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【图像编辑】](https://www.paddlepaddle.org.cn/hublist)**
+
+
+
+### 图像编辑
+
+图像编辑是指在输入图像的基础上，对图像的像素点进行进一步的编辑和调整，输出新的目标图像，具体的应用场景有：超分辨率、黑白片上色，老照片修复等。
+
+- 精选推荐模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+ | [超分辨率](https://www.paddlepaddle.org.cn/hubdetail?name=realsr&en_category=ImageEditing) | 可用于图像和视频超分模型，它能够将输入的图片和视频超分四倍。 |
+ | [黑白图像上色](https://www.paddlepaddle.org.cn/hubdetail?name=deoldify&en_category=ImageEditing) | deoldify是用于图像和视频的着色渲染模型，该模型能够实现给黑白照片和视频恢复原彩。 |
+  | [老照片修复](https://www.paddlepaddle.org.cn/hubdetail?name=photo_restoration&en_category=ImageEditing) | 针对老照片修复的模型。它主要由两个部分组成：着色和超分。|
diff --git a/modules/image/colorization/user_guided_colorization/data_feed.py b/modules/image/Image_editing/colorization/user_guided_colorization/data_feed.py
similarity index 99%
rename from modules/image/colorization/user_guided_colorization/data_feed.py
rename to modules/image/Image_editing/colorization/user_guided_colorization/data_feed.py
index 225d46f88bf16873b6c3c56bebefd46ca0702b42..2efe947873cffe884d278f262b1426cd86c7dcb6 100644
--- a/modules/image/colorization/user_guided_colorization/data_feed.py
+++ b/modules/image/Image_editing/colorization/user_guided_colorization/data_feed.py
@@ -130,4 +130,4 @@ class ColorizePreprocess:
         data['real_B_enc'] = paddle.to_tensor(data['real_B_enc'].astype(np.int64))
         data['hint_B'] = paddle.to_tensor(data['hint_B'].astype(np.float32))
         data['mask_B'] = paddle.to_tensor(data['mask_B'].astype(np.float32))
-        return data
\ No newline at end of file
+        return data
diff --git a/modules/image/colorization/user_guided_colorization/module.py b/modules/image/Image_editing/colorization/user_guided_colorization/module.py
similarity index 94%
rename from modules/image/colorization/user_guided_colorization/module.py
rename to modules/image/Image_editing/colorization/user_guided_colorization/module.py
index 184062e4ba481c1bad449b006ac0ee07a1b629e9..13b75c3551e5a3629967be51bb1a2416df8848ee 100644
--- a/modules/image/colorization/user_guided_colorization/module.py
+++ b/modules/image/Image_editing/colorization/user_guided_colorization/module.py
@@ -40,7 +40,6 @@ class UserGuidedColorization(nn.Layer):
         load_checkpoint (str): Pretrained checkpoint path.
 
     """
-
     def __init__(self, use_tanh: bool = True, load_checkpoint: str = None):
         super(UserGuidedColorization, self).__init__()
         self.input_nc = 4
@@ -119,8 +118,8 @@ class UserGuidedColorization(nn.Layer):
         )
 
         # Conv8
-        model8up = (Conv2DTranspose(512, 256, kernel_size=4, stride=2, padding=1),)
-        model3short8 = (Conv2D(256, 256, 3, 1, 1),)
+        model8up = (Conv2DTranspose(512, 256, kernel_size=4, stride=2, padding=1), )
+        model3short8 = (Conv2D(256, 256, 3, 1, 1), )
         model8 = (
             nn.ReLU(),
             Conv2D(256, 256, 3, 1, 1),
@@ -131,20 +130,26 @@ class UserGuidedColorization(nn.Layer):
         )
 
         # Conv9
-        model9up = (Conv2DTranspose(256, 128, kernel_size=4, stride=2, padding=1),)
-        model2short9 = (Conv2D(128, 128, 3, 1, 1,),)
+        model9up = (Conv2DTranspose(256, 128, kernel_size=4, stride=2, padding=1), )
+        model2short9 = (Conv2D(
+            128,
+            128,
+            3,
+            1,
+            1,
+        ), )
         model9 = (nn.ReLU(), Conv2D(128, 128, 3, 1, 1), nn.ReLU(), nn.BatchNorm(128))
 
         # Conv10
-        model10up = (Conv2DTranspose(128, 128, kernel_size=4, stride=2, padding=1),)
-        model1short10 = (Conv2D(64, 128, 3, 1, 1),)
+        model10up = (Conv2DTranspose(128, 128, kernel_size=4, stride=2, padding=1), )
+        model1short10 = (Conv2D(64, 128, 3, 1, 1), )
         model10 = (nn.ReLU(), Conv2D(128, 128, 3, 1, 1), nn.LeakyReLU(negative_slope=0.2))
-        model_class = (Conv2D(256, 529, 1),)
+        model_class = (Conv2D(256, 529, 1), )
 
         if use_tanh:
             model_out = (Conv2D(128, 2, 1, 1, 0, 1), nn.Tanh())
         else:
-            model_out = (Conv2D(128, 2, 1, 1, 0, 1),)
+            model_out = (Conv2D(128, 2, 1, 1, 0, 1), )
 
         self.model1 = nn.Sequential(*model1)
         self.model2 = nn.Sequential(*model2)
@@ -178,10 +183,10 @@ class UserGuidedColorization(nn.Layer):
             print("load pretrained checkpoint success")
 
     def transforms(self, images: str) -> callable:
- 
+
         transform = T.Compose([T.Resize((256, 256), interpolation='NEAREST'), T.RGB2LAB()], to_rgb=True)
         return transform(images)
-    
+
     def set_config(self, classification: bool = True, prob: float = 1., num_point: int = None):
         self.classification = classification
         self.pre_func = ColorizePreprocess(ab_thresh=0., p=prob, points=num_point)
@@ -221,4 +226,4 @@ class UserGuidedColorization(nn.Layer):
             conv10_2 = self.model10(conv10_up)
             out_reg = self.model_out(conv10_2)
 
-        return out_class, out_reg
\ No newline at end of file
+        return out_class, out_reg
diff --git a/modules/image/Image_gan/README.md b/modules/image/Image_gan/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..05f022bc6a3ee3451ebcf5f750b7f7a5dd413c2b
--- /dev/null
+++ b/modules/image/Image_gan/README.md
@@ -0,0 +1,16 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【图像生成】](https://www.paddlepaddle.org.cn/hublist)**
+
+
+
+### 图像生成
+
+图像生成是指根据输入向量，生成目标图像。这里的输入向量可以是随机的噪声或用户指定的条件向量。具体的应用场景有：风格迁移、图像动漫画等。
+
+- 精选推荐模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+ | [艺术风格迁移](https://www.paddlepaddle.org.cn/hubdetail?name=stylepro_artistic&en_category=GANs) | 将给定的图像转换为任意的艺术风格。确保模型高保真还原内容图片的语义细节信息与风格图片的风格信息。 |
+ | [图像动漫化-新海诚](https://www.paddlepaddle.org.cn/hubdetail?name=animegan_v2_shinkai_53&en_category=GANs) | AnimeGAN V2 图像风格转换模型, 模型可将输入的图像转换成新海诚动漫风格 |
+  | [图像动漫化-宫崎骏](https://www.paddlepaddle.org.cn/hubdetail?name=animegan_v2_hayao_64&en_category=GANs) | AnimeGAN V2 图像风格转换模型, 模型可将输入的图像转换成宫崎骏动漫风格|
+  | [图像动漫化-今敏红辣椒](https://www.paddlepaddle.org.cn/hubdetail?name=animegan_v2_paprika_97&en_category=GANs) | AnimeGAN V2 图像风格转换模型, 模型可将输入的图像转换成今敏红辣椒动漫风格。|
diff --git a/modules/image/style_transfer/msgnet/module.py b/modules/image/Image_gan/style_transfer/msgnet/module.py
similarity index 93%
rename from modules/image/style_transfer/msgnet/module.py
rename to modules/image/Image_gan/style_transfer/msgnet/module.py
index 9953dc586b4ad6f0ab50373c0e47f9ede309c27d..4230d11a7c749ff175324e07310aac93fe090ca3 100644
--- a/modules/image/style_transfer/msgnet/module.py
+++ b/modules/image/Image_gan/style_transfer/msgnet/module.py
@@ -14,7 +14,6 @@ from paddlehub.module.cv_module import StyleTransferModule
 
 class GramMatrix(nn.Layer):
     """Calculate gram matrix"""
-
     def forward(self, y):
         (b, ch, h, w) = y.shape
         features = y.reshape((b, ch, w * h))
@@ -25,7 +24,6 @@ class GramMatrix(nn.Layer):
 
 class ConvLayer(nn.Layer):
     """Basic conv layer with reflection padding layer"""
-
     def __init__(self, in_channels: int, out_channels: int, kernel_size: int, stride: int):
         super(ConvLayer, self).__init__()
         pad = int(np.floor(kernel_size / 2))
@@ -53,7 +51,6 @@ class UpsampleConvLayer(nn.Layer):
     Return:
         img(paddle.Tensor): UpsampleConvLayer output.
     """
-
     def __init__(self, in_channels: int, out_channels: int, kernel_size: int, stride: int, upsample=None):
         super(UpsampleConvLayer, self).__init__()
         self.upsample = upsample
@@ -88,7 +85,6 @@ class Bottleneck(nn.Layer):
     Return:
         img(paddle.Tensor): Bottleneck output.
     """
-
     def __init__(self,
                  inplanes: int,
                  planes: int,
@@ -102,8 +98,8 @@ class Bottleneck(nn.Layer):
             self.residual_layer = nn.Conv2D(inplanes, planes * self.expansion, kernel_size=1, stride=stride)
         conv_block = (norm_layer(inplanes), nn.ReLU(), nn.Conv2D(inplanes, planes, kernel_size=1, stride=1),
                       norm_layer(planes), nn.ReLU(), ConvLayer(planes, planes, kernel_size=3, stride=stride),
-                      norm_layer(planes), nn.ReLU(), nn.Conv2D(
-                          planes, planes * self.expansion, kernel_size=1, stride=1))
+                      norm_layer(planes), nn.ReLU(), nn.Conv2D(planes, planes * self.expansion, kernel_size=1,
+                                                               stride=1))
         self.conv_block = nn.Sequential(*conv_block)
 
     def forward(self, x: paddle.Tensor):
@@ -129,12 +125,14 @@ class UpBottleneck(nn.Layer):
     Return:
         img(paddle.Tensor): UpBottleneck output.
     """
-
     def __init__(self, inplanes: int, planes: int, stride: int = 2, norm_layer: nn.Layer = nn.BatchNorm2D):
         super(UpBottleneck, self).__init__()
         self.expansion = 4
-        self.residual_layer = UpsampleConvLayer(
-            inplanes, planes * self.expansion, kernel_size=1, stride=1, upsample=stride)
+        self.residual_layer = UpsampleConvLayer(inplanes,
+                                                planes * self.expansion,
+                                                kernel_size=1,
+                                                stride=1,
+                                                upsample=stride)
         conv_block = []
         conv_block += [norm_layer(inplanes), nn.ReLU(), nn.Conv2D(inplanes, planes, kernel_size=1, stride=1)]
         conv_block += [
@@ -165,7 +163,6 @@ class Inspiration(nn.Layer):
     Return:
         img(paddle.Tensor): UpBottleneck output.
     """
-
     def __init__(self, C: int, B: int = 1):
         super(Inspiration, self).__init__()
 
@@ -182,8 +179,8 @@ class Inspiration(nn.Layer):
         self.P = paddle.bmm(self.weight.expand_as(self.G), self.G)
 
         x = paddle.bmm(
-            self.P.transpose((0, 2, 1)).expand((X.shape[0], self.C, self.C)), X.reshape((X.shape[0], X.shape[1],
-                                                                                         -1))).reshape(X.shape)
+            self.P.transpose((0, 2, 1)).expand((X.shape[0], self.C, self.C)), X.reshape(
+                (X.shape[0], X.shape[1], -1))).reshape(X.shape)
         return x
 
     def __repr__(self):
@@ -193,7 +190,6 @@ class Inspiration(nn.Layer):
 
 class Vgg16(nn.Layer):
     """ First four layers from Vgg16."""
-
     def __init__(self):
         super(Vgg16, self).__init__()
         self.conv1_1 = nn.Conv2D(3, 64, kernel_size=3, stride=1, padding=1)
@@ -268,8 +264,12 @@ class MSGNet(nn.Layer):
     Return:
         img(paddle.Tensor): MSGNet output.
     """
-
-    def __init__(self, input_nc=3, output_nc=3, ngf=128, n_blocks=6, norm_layer=nn.InstanceNorm2D,
+    def __init__(self,
+                 input_nc=3,
+                 output_nc=3,
+                 ngf=128,
+                 n_blocks=6,
+                 norm_layer=nn.InstanceNorm2D,
                  load_checkpoint=None):
         super(MSGNet, self).__init__()
         self.gram = GramMatrix()
@@ -341,4 +341,4 @@ class MSGNet(nn.Layer):
         return self._vgg(input)
 
     def forward(self, input: paddle.Tensor):
-        return self.model(input)
\ No newline at end of file
+        return self.model(input)
diff --git a/modules/image/style_transfer/stylepro_artistic/README.md b/modules/image/Image_gan/style_transfer/stylepro_artistic/README.md
similarity index 100%
rename from modules/image/style_transfer/stylepro_artistic/README.md
rename to modules/image/Image_gan/style_transfer/stylepro_artistic/README.md
diff --git a/modules/image/style_transfer/stylepro_artistic/__init__.py b/modules/image/Image_gan/style_transfer/stylepro_artistic/__init__.py
similarity index 100%
rename from modules/image/style_transfer/stylepro_artistic/__init__.py
rename to modules/image/Image_gan/style_transfer/stylepro_artistic/__init__.py
diff --git a/modules/image/style_transfer/stylepro_artistic/data_feed.py b/modules/image/Image_gan/style_transfer/stylepro_artistic/data_feed.py
similarity index 100%
rename from modules/image/style_transfer/stylepro_artistic/data_feed.py
rename to modules/image/Image_gan/style_transfer/stylepro_artistic/data_feed.py
diff --git a/modules/image/Image_gan/style_transfer/stylepro_artistic/decoder_network.py b/modules/image/Image_gan/style_transfer/stylepro_artistic/decoder_network.py
new file mode 100644
index 0000000000000000000000000000000000000000..f349f5545265fb646185fc06890eb772c2c75aa1
--- /dev/null
+++ b/modules/image/Image_gan/style_transfer/stylepro_artistic/decoder_network.py
@@ -0,0 +1,173 @@
+# coding=utf-8
+from paddle.fluid.initializer import Constant
+from paddle.fluid.param_attr import ParamAttr
+import paddle.fluid as fluid
+
+
+def decoder_net():
+    x2paddle_22 = fluid.layers.create_parameter(dtype='float32',
+                                                shape=[4],
+                                                name='x2paddle_22',
+                                                attr='x2paddle_22',
+                                                default_initializer=Constant(0.0))
+    x2paddle_36 = fluid.layers.create_parameter(dtype='float32',
+                                                shape=[4],
+                                                name='x2paddle_36',
+                                                attr='x2paddle_36',
+                                                default_initializer=Constant(0.0))
+    x2paddle_44 = fluid.layers.create_parameter(dtype='float32',
+                                                shape=[4],
+                                                name='x2paddle_44',
+                                                attr='x2paddle_44',
+                                                default_initializer=Constant(0.0))
+    x2paddle_input_1 = fluid.layers.data(dtype='float32',
+                                         shape=[1, 512, 64, 64],
+                                         name='x2paddle_input_1',
+                                         append_batch_size=False)
+    x2paddle_19 = fluid.layers.pad2d(x2paddle_input_1,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_19')
+    x2paddle_20 = fluid.layers.conv2d(x2paddle_19,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_1',
+                                      name='x2paddle_20',
+                                      bias_attr='x2paddle_2')
+    x2paddle_21 = fluid.layers.relu(x2paddle_20, name='x2paddle_21')
+    x2paddle_23 = fluid.layers.resize_nearest(x2paddle_21, name='x2paddle_23', out_shape=[128, 128])
+    x2paddle_24 = fluid.layers.pad2d(x2paddle_23,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_24')
+    x2paddle_25 = fluid.layers.conv2d(x2paddle_24,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_3',
+                                      name='x2paddle_25',
+                                      bias_attr='x2paddle_4')
+    x2paddle_26 = fluid.layers.relu(x2paddle_25, name='x2paddle_26')
+    x2paddle_27 = fluid.layers.pad2d(x2paddle_26,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_27')
+    x2paddle_28 = fluid.layers.conv2d(x2paddle_27,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_5',
+                                      name='x2paddle_28',
+                                      bias_attr='x2paddle_6')
+    x2paddle_29 = fluid.layers.relu(x2paddle_28, name='x2paddle_29')
+    x2paddle_30 = fluid.layers.pad2d(x2paddle_29,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_30')
+    x2paddle_31 = fluid.layers.conv2d(x2paddle_30,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_7',
+                                      name='x2paddle_31',
+                                      bias_attr='x2paddle_8')
+    x2paddle_32 = fluid.layers.relu(x2paddle_31, name='x2paddle_32')
+    x2paddle_33 = fluid.layers.pad2d(x2paddle_32,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_33')
+    x2paddle_34 = fluid.layers.conv2d(x2paddle_33,
+                                      num_filters=128,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_9',
+                                      name='x2paddle_34',
+                                      bias_attr='x2paddle_10')
+    x2paddle_35 = fluid.layers.relu(x2paddle_34, name='x2paddle_35')
+    x2paddle_37 = fluid.layers.resize_nearest(x2paddle_35, name='x2paddle_37', out_shape=[256, 256])
+    x2paddle_38 = fluid.layers.pad2d(x2paddle_37,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_38')
+    x2paddle_39 = fluid.layers.conv2d(x2paddle_38,
+                                      num_filters=128,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_11',
+                                      name='x2paddle_39',
+                                      bias_attr='x2paddle_12')
+    x2paddle_40 = fluid.layers.relu(x2paddle_39, name='x2paddle_40')
+    x2paddle_41 = fluid.layers.pad2d(x2paddle_40,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_41')
+    x2paddle_42 = fluid.layers.conv2d(x2paddle_41,
+                                      num_filters=64,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_13',
+                                      name='x2paddle_42',
+                                      bias_attr='x2paddle_14')
+    x2paddle_43 = fluid.layers.relu(x2paddle_42, name='x2paddle_43')
+    x2paddle_45 = fluid.layers.resize_nearest(x2paddle_43, name='x2paddle_45', out_shape=[512, 512])
+    x2paddle_46 = fluid.layers.pad2d(x2paddle_45,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_46')
+    x2paddle_47 = fluid.layers.conv2d(x2paddle_46,
+                                      num_filters=64,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_15',
+                                      name='x2paddle_47',
+                                      bias_attr='x2paddle_16')
+    x2paddle_48 = fluid.layers.relu(x2paddle_47, name='x2paddle_48')
+    x2paddle_49 = fluid.layers.pad2d(x2paddle_48,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_49')
+    x2paddle_50 = fluid.layers.conv2d(x2paddle_49,
+                                      num_filters=3,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_17',
+                                      name='x2paddle_50',
+                                      bias_attr='x2paddle_18')
+    return x2paddle_input_1, x2paddle_50
diff --git a/modules/image/Image_gan/style_transfer/stylepro_artistic/encoder_network.py b/modules/image/Image_gan/style_transfer/stylepro_artistic/encoder_network.py
new file mode 100644
index 0000000000000000000000000000000000000000..f006b8ccfe23cb3ede6c111d35b0cd2c77ebaa96
--- /dev/null
+++ b/modules/image/Image_gan/style_transfer/stylepro_artistic/encoder_network.py
@@ -0,0 +1,187 @@
+# coding=utf-8
+from paddle.fluid.initializer import Constant
+from paddle.fluid.param_attr import ParamAttr
+import paddle.fluid as fluid
+
+
+def encoder_net():
+    x2paddle_0 = fluid.layers.data(dtype='float32', shape=[1, 3, 512, 512], name='x2paddle_0', append_batch_size=False)
+    x2paddle_21 = fluid.layers.conv2d(x2paddle_0,
+                                      num_filters=3,
+                                      filter_size=[1, 1],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_1',
+                                      name='x2paddle_21',
+                                      bias_attr='x2paddle_2')
+    x2paddle_22 = fluid.layers.pad2d(x2paddle_21,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_22')
+    x2paddle_23 = fluid.layers.conv2d(x2paddle_22,
+                                      num_filters=64,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_3',
+                                      name='x2paddle_23',
+                                      bias_attr='x2paddle_4')
+    x2paddle_24 = fluid.layers.relu(x2paddle_23, name='x2paddle_24')
+    x2paddle_25 = fluid.layers.pad2d(x2paddle_24,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_25')
+    x2paddle_26 = fluid.layers.conv2d(x2paddle_25,
+                                      num_filters=64,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_5',
+                                      name='x2paddle_26',
+                                      bias_attr='x2paddle_6')
+    x2paddle_27 = fluid.layers.relu(x2paddle_26, name='x2paddle_27')
+    x2paddle_28 = fluid.layers.pool2d(x2paddle_27,
+                                      pool_size=[2, 2],
+                                      pool_type='max',
+                                      pool_stride=[2, 2],
+                                      pool_padding=[0, 0],
+                                      ceil_mode=False,
+                                      name='x2paddle_28',
+                                      exclusive=False)
+    x2paddle_29 = fluid.layers.pad2d(x2paddle_28,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_29')
+    x2paddle_30 = fluid.layers.conv2d(x2paddle_29,
+                                      num_filters=128,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_7',
+                                      name='x2paddle_30',
+                                      bias_attr='x2paddle_8')
+    x2paddle_31 = fluid.layers.relu(x2paddle_30, name='x2paddle_31')
+    x2paddle_32 = fluid.layers.pad2d(x2paddle_31,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_32')
+    x2paddle_33 = fluid.layers.conv2d(x2paddle_32,
+                                      num_filters=128,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_9',
+                                      name='x2paddle_33',
+                                      bias_attr='x2paddle_10')
+    x2paddle_34 = fluid.layers.relu(x2paddle_33, name='x2paddle_34')
+    x2paddle_35 = fluid.layers.pool2d(x2paddle_34,
+                                      pool_size=[2, 2],
+                                      pool_type='max',
+                                      pool_stride=[2, 2],
+                                      pool_padding=[0, 0],
+                                      ceil_mode=False,
+                                      name='x2paddle_35',
+                                      exclusive=False)
+    x2paddle_36 = fluid.layers.pad2d(x2paddle_35,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_36')
+    x2paddle_37 = fluid.layers.conv2d(x2paddle_36,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_11',
+                                      name='x2paddle_37',
+                                      bias_attr='x2paddle_12')
+    x2paddle_38 = fluid.layers.relu(x2paddle_37, name='x2paddle_38')
+    x2paddle_39 = fluid.layers.pad2d(x2paddle_38,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_39')
+    x2paddle_40 = fluid.layers.conv2d(x2paddle_39,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_13',
+                                      name='x2paddle_40',
+                                      bias_attr='x2paddle_14')
+    x2paddle_41 = fluid.layers.relu(x2paddle_40, name='x2paddle_41')
+    x2paddle_42 = fluid.layers.pad2d(x2paddle_41,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_42')
+    x2paddle_43 = fluid.layers.conv2d(x2paddle_42,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_15',
+                                      name='x2paddle_43',
+                                      bias_attr='x2paddle_16')
+    x2paddle_44 = fluid.layers.relu(x2paddle_43, name='x2paddle_44')
+    x2paddle_45 = fluid.layers.pad2d(x2paddle_44,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_45')
+    x2paddle_46 = fluid.layers.conv2d(x2paddle_45,
+                                      num_filters=256,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_17',
+                                      name='x2paddle_46',
+                                      bias_attr='x2paddle_18')
+    x2paddle_47 = fluid.layers.relu(x2paddle_46, name='x2paddle_47')
+    x2paddle_48 = fluid.layers.pool2d(x2paddle_47,
+                                      pool_size=[2, 2],
+                                      pool_type='max',
+                                      pool_stride=[2, 2],
+                                      pool_padding=[0, 0],
+                                      ceil_mode=False,
+                                      name='x2paddle_48',
+                                      exclusive=False)
+    x2paddle_49 = fluid.layers.pad2d(x2paddle_48,
+                                     pad_value=0.0,
+                                     mode='reflect',
+                                     paddings=[1, 1, 1, 1],
+                                     name='x2paddle_49')
+    x2paddle_50 = fluid.layers.conv2d(x2paddle_49,
+                                      num_filters=512,
+                                      filter_size=[3, 3],
+                                      stride=[1, 1],
+                                      padding=[0, 0],
+                                      dilation=[1, 1],
+                                      groups=1,
+                                      param_attr='x2paddle_19',
+                                      name='x2paddle_50',
+                                      bias_attr='x2paddle_20')
+    x2paddle_51 = fluid.layers.relu(x2paddle_50, name='x2paddle_51')
+    return x2paddle_0, x2paddle_51
diff --git a/modules/image/style_transfer/stylepro_artistic/module.py b/modules/image/Image_gan/style_transfer/stylepro_artistic/module.py
similarity index 75%
rename from modules/image/style_transfer/stylepro_artistic/module.py
rename to modules/image/Image_gan/style_transfer/stylepro_artistic/module.py
index b6739014d2f92eeabfa99d7993aa44e6dc0e8cf6..95ea138c1b9fa60c7663b6ca152c8123e8a3db3a 100644
--- a/modules/image/style_transfer/stylepro_artistic/module.py
+++ b/modules/image/Image_gan/style_transfer/stylepro_artistic/module.py
@@ -140,14 +140,13 @@ class StyleProjection(hub.Module):
         encode_program, encode_feeded_var_names, encode_target_vars = fluid.io.load_inference_model(
             dirname=self.pretrained_encoder_net, executor=exe)
 
-        fluid.io.save_inference_model(
-            dirname=dirname,
-            main_program=encode_program,
-            executor=exe,
-            feeded_var_names=encode_feeded_var_names,
-            target_vars=encode_target_vars,
-            model_filename=model_filename,
-            params_filename=params_filename)
+        fluid.io.save_inference_model(dirname=dirname,
+                                      main_program=encode_program,
+                                      executor=exe,
+                                      feeded_var_names=encode_feeded_var_names,
+                                      target_vars=encode_target_vars,
+                                      model_filename=model_filename,
+                                      params_filename=params_filename)
 
     def _save_decode_model(self, dirname, model_filename=None, params_filename=None, combined=True):
         if combined:
@@ -159,14 +158,13 @@ class StyleProjection(hub.Module):
         decode_program, decode_feeded_var_names, decode_target_vars = fluid.io.load_inference_model(
             dirname=self.pretrained_decoder_net, executor=exe)
 
-        fluid.io.save_inference_model(
-            dirname=dirname,
-            main_program=decode_program,
-            executor=exe,
-            feeded_var_names=decode_feeded_var_names,
-            target_vars=decode_target_vars,
-            model_filename=model_filename,
-            params_filename=params_filename)
+        fluid.io.save_inference_model(dirname=dirname,
+                                      main_program=decode_program,
+                                      executor=exe,
+                                      feeded_var_names=decode_feeded_var_names,
+                                      target_vars=decode_target_vars,
+                                      model_filename=model_filename,
+                                      params_filename=params_filename)
 
     @serving
     def serving_method(self, images, **kwargs):
@@ -186,11 +184,10 @@ class StyleProjection(hub.Module):
         """
         Run as a command.
         """
-        self.parser = argparse.ArgumentParser(
-            description="Run the {} module.".format(self.name),
-            prog='hub run {}'.format(self.name),
-            usage='%(prog)s',
-            add_help=True)
+        self.parser = argparse.ArgumentParser(description="Run the {} module.".format(self.name),
+                                              prog='hub run {}'.format(self.name),
+                                              usage='%(prog)s',
+                                              add_help=True)
 
         self.arg_input_group = self.parser.add_argument_group(title="Input options", description="Input data. Required")
         self.arg_config_group = self.parser.add_argument_group(
@@ -202,20 +199,29 @@ class StyleProjection(hub.Module):
             paths = [{'content': args.content, 'styles': args.styles.split(',')}]
         else:
             paths = [{'content': args.content, 'styles': args.styles.split(','), 'weights': list(args.weights)}]
-        results = self.style_transfer(
-            paths=paths, alpha=args.alpha, use_gpu=args.use_gpu, output_dir=args.output_dir, visualization=True)
+        results = self.style_transfer(paths=paths,
+                                      alpha=args.alpha,
+                                      use_gpu=args.use_gpu,
+                                      output_dir=args.output_dir,
+                                      visualization=True)
         return results
 
     def add_module_config_arg(self):
         """
         Add the command config options.
         """
-        self.arg_config_group.add_argument(
-            '--use_gpu', type=ast.literal_eval, default=False, help="whether use GPU or not")
-        self.arg_config_group.add_argument(
-            '--output_dir', type=str, default='transfer_result', help="The directory to save output images.")
-        self.arg_config_group.add_argument(
-            '--visualization', type=ast.literal_eval, default=True, help="whether to save output as images.")
+        self.arg_config_group.add_argument('--use_gpu',
+                                           type=ast.literal_eval,
+                                           default=False,
+                                           help="whether use GPU or not")
+        self.arg_config_group.add_argument('--output_dir',
+                                           type=str,
+                                           default='transfer_result',
+                                           help="The directory to save output images.")
+        self.arg_config_group.add_argument('--visualization',
+                                           type=ast.literal_eval,
+                                           default=True,
+                                           help="whether to save output as images.")
 
     def add_module_input_arg(self):
         """
@@ -223,7 +229,11 @@ class StyleProjection(hub.Module):
         """
         self.arg_input_group.add_argument('--content', type=str, help="path to content.")
         self.arg_input_group.add_argument('--styles', type=str, help="path to styles.")
-        self.arg_input_group.add_argument(
-            '--weights', type=ast.literal_eval, default=None, help="interpolation weights of styles.")
-        self.arg_config_group.add_argument(
-            '--alpha', type=ast.literal_eval, default=1, help="The parameter to control the tranform degree.")
+        self.arg_input_group.add_argument('--weights',
+                                          type=ast.literal_eval,
+                                          default=None,
+                                          help="interpolation weights of styles.")
+        self.arg_config_group.add_argument('--alpha',
+                                           type=ast.literal_eval,
+                                           default=1,
+                                           help="The parameter to control the tranform degree.")
diff --git a/modules/image/style_transfer/stylepro_artistic/processor.py b/modules/image/Image_gan/style_transfer/stylepro_artistic/processor.py
similarity index 100%
rename from modules/image/style_transfer/stylepro_artistic/processor.py
rename to modules/image/Image_gan/style_transfer/stylepro_artistic/processor.py
diff --git a/modules/image/classification/README.md b/modules/image/classification/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..1793b4bfa76fcb1f42619d27d96583cd7fe94d6a 100644
--- a/modules/image/classification/README.md
+++ b/modules/image/classification/README.md
@@ -0,0 +1,36 @@
+
+## **更好用户体验，建议参考WEB端官方文档 -> [【图像分类】](https://www.paddlepaddle.org.cn/hublist)**
+
+
+### 图像分类
+图像分类是根据图像的语义信息对不同类别图像进行区分，是计算机视觉中重要的基础问题，是物体检测、图像分割、物体跟踪、行为分析、人脸识别等其他高层视觉任务的基础，在许多领域都有着广泛的应用。如：安防领域的人脸识别和智能视频分析等，交通领域的交通场景识别，互联网领域基于内容的图像检索和相册自动归类，医学领域的图像识别等。
+
+**注：** **如果你是资深开发者，那可以随意按需使用**，**假如你是新手，服务器端优先选择Resnet50，移动端优先选择MobileNetV3**
+
+- 精选模型推荐
+
+|            | **模型名称**                                                 | **模型特色**                                       |
+| ---------- | :----------------------------------------------------------- | ---------------------------------------------------------- |
+| 图像分类 | [菜品识别](https://www.paddlepaddle.org.cn/hubdetail?name=resnet50_vd_dishes&en_category=ImageClassification) | 私有数据集训练，支持8416种菜品的分类识别，适合进一步菜品方向微调 |
+| 图像分类 | [动物识别](https://www.paddlepaddle.org.cn/hubdetail?name=resnet50_vd_animals&en_category=ImageClassification) | 私有数据集训练，支持7978种动物的分类识别，适合进一步动物方向微调 |
+| 图像分类 | [野生动物制品识别](https://www.paddlepaddle.org.cn/hubdetail?name=resnet50_vd_wildanimals&en_category=ImageClassification) | 支持'象牙制品', '象牙', '大象', '虎皮', '老虎', '虎牙/虎爪/虎骨', '穿山甲甲片', '穿山甲', '穿山甲爪子', '其他' 这十个标签的识别。 |
+
+
+- 更多模型
+
+| **模型名称** | **模型简介** |
+| - | - |
+| [AlexNet](https://www.paddlepaddle.org.cn/hubdetail?name=alexnet_imagenet&en_category=ImageClassification) | 首次在 CNN 中成功的应用了 ReLU, Dropout 和 LRN，并使用 GPU 进行运算加速 |
+| [VGG19](https://www.paddlepaddle.org.cn/hubdetail?name=vgg19_imagenet&en_category=ImageClassification) | 在 AlexNet 的基础上使用 3*3 小卷积核，增加网络深度，具有很好的泛化能力 |
+| [GoogLeNet](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleCV/image_classification) | 在不增加计算负载的前提下增加了网络的深度和宽度，性能更加优越 |
+| [ResNet50](https://www.paddlepaddle.org.cn/hubdetail?name=resnet_v2_50_imagenet&en_category=ImageClassification) | Residual Network，引入了新的残差结构，解决了随着网络加深，准确率下降的问题 |
+| [Inceptionv4](https://www.paddlepaddle.org.cn/hubdetail?name=inception_v4_imagenet&en_category=ImageClassification) | 将 Inception 模块与 Residual Connection 进行结合，通过ResNet的结构极大地加速训练并获得性能的提升 |
+| [MobileNetV2](https://www.paddlepaddle.org.cn/hubdetail?name=mobilenet_v2_imagenet&en_category=ImageClassification) | MobileNet结构的微调，直接在 thinner 的 bottleneck层上进行 skip learning 连接以及对 bottleneck layer 不进行 ReLu 非线性处理可取得更好的结果 |
+| [se_resnext50](https://www.paddlepaddle.org.cn/hubdetail?name=se_resnext50_32x4d_imagenet&en_category=ImageClassification) | 在ResNeXt 基础、上加入了 SE(Sequeeze-and-Excitation) 模块，提高了识别准确率，在 ILSVRC 2017 的分类项目中取得了第一名 |
+| [ShuffleNetV2](https://www.paddlepaddle.org.cn/hubdetail?name=shufflenet_v2_imagenet&en_category=ImageClassification) | ECCV2018，轻量级 CNN 网络，在速度和准确度之间做了很好地平衡。在同等复杂度下，比 ShuffleNet 和 MobileNetv2 更准确，更适合移动端以及无人车领域 |
+| [efficientNetb7](https://www.paddlepaddle.org.cn/hubdetail?name=efficientnetb7_imagenet&en_category=ImageClassification) | 同时对模型的分辨率，通道数和深度进行缩放，用极少的参数就可以达到SOTA的精度。 |
+| [xception71](https://www.paddlepaddle.org.cn/hubdetail?name=xception71_imagenet&en_category=ImageClassification) | 对inception-v3的改进，用深度可分离卷积代替普通卷积，降低参数量同时提高了精度。 |
+| [dpn107](https://www.paddlepaddle.org.cn/hubdetail?name=dpn107_imagenet&en_category=ImageClassification) | 融合了densenet和resnext的特点。 |
+| [DarkNet53](https://www.paddlepaddle.org.cn/hubdetail?name=darknet53_imagenet&en_category=ImageClassification) | 检测框架yolov3使用的backbone，在分类和检测任务上都有不错表现。 |
+| [DenseNet161](https://www.paddlepaddle.org.cn/hubdetail?name=densenet161_imagenet&en_category=ImageClassification) | 提出了密集连接的网络结构，更加有利于信息流的传递。 |
+| [ResNeXt152_vd](https://www.paddlepaddle.org.cn/hubdetail?name=resnext152_64x4d_imagenet&en_category=ImageClassification) | 提出了cardinatity的概念，用于作为模型复杂度的另外一个度量，有效地提升模型精度。 |
diff --git a/modules/image/face_detection/README.md b/modules/image/face_detection/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..91714e79ebc584b39f863ca1e9d43c596fb1057d 100644
--- a/modules/image/face_detection/README.md
+++ b/modules/image/face_detection/README.md
@@ -0,0 +1,12 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【人脸检测】](https://www.paddlepaddle.org.cn/hublist)**
+
+### 人脸检测
+人脸检测属于目标检测的一个重要分支，由于近年来安防市场、人脸识别、人脸安全方面的原因，成为目标检测中最重要的任务之一。
+
+- 推荐模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+ | [人脸检测](https://www.paddlepaddle.org.cn/hubdetail?name=pyramidbox_lite_server&en_category=FaceDetection) | 百度自研，18年3月WIDER Face 数据集**冠军模型**，           |
+| [超轻量人脸检测](https://www.paddlepaddle.org.cn/hubdetail?name=ultra_light_fast_generic_face_detector_1mb_640&en_category=FaceDetection) | 针对边缘计算设备或低算力设备(如用ARM推理)设计的实时超轻量级通用人脸检测模型，可以在低算力设备中如用ARM进行实时的通用场景的人脸检测推理。 |
+| [口罩人脸检测与识别](https://www.paddlepaddle.org.cn/hubdetail?name=pyramidbox_lite_server_mask&en_category=FaceDetection) | 业界**首个开源口罩人脸检测与识别模型**，引起广泛关注。     |
diff --git a/modules/image/gan/README.md b/modules/image/gan/README.md
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/modules/image/keypoint_detection/README.md b/modules/image/keypoint_detection/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..01021c9b7e1381330b7bd5e26dd52c4cf323494c 100644
--- a/modules/image/keypoint_detection/README.md
+++ b/modules/image/keypoint_detection/README.md
@@ -0,0 +1,14 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【关键点检测】](https://www.paddlepaddle.org.cn/hublist)**
+
+#### 关键点检测
+
+人体骨骼关键点检测 (Pose Estimation) 主要检测人体的一些关键点，如关节，五官等，通过关键点描述人体骨骼信息。人体骨骼关键点检测对于描述人体姿态，预测人体行为至关重要。是诸多计算机视觉任务的基础，例如动作分类，异常行为检测，以及自动驾驶等等。
+
+- 精选模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [单人--人体骨骼关键点检测](https://www.paddlepaddle.org.cn/hubdetail?name=human_pose_estimation_resnet50_mpii&en_category=KeyPointDetection) | 可用于行为识别、人物跟踪、步态识别等相关领域。具体应用主要集中在智能视频监控，病人监护系统，人机交互，虚拟现实，人体动画，智能家居，智能安防，运动员辅助训练等等。  |
+| [多人-人体骨骼关键点检测](https://www.paddlepaddle.org.cn/hubdetail?name=openpose_body_estimation&en_category=KeyPointDetection) | 可用于行为识别、人物跟踪、步态识别等相关领域。具体应用主要集中在智能视频监控，病人监护系统，人机交互，虚拟现实，人体动画，智能家居，智能安防，运动员辅助训练等等。  |
+| [面部关键点检测](https://www.paddlepaddle.org.cn/hubdetail?name=face_landmark_localization&en_category=KeyPointDetection) |可用于人脸识别、表情分析、三维人脸重建及三维动画等其它人脸相关问题，支持同一张图中的多个人脸检测  |
+| [手部关键点检测](https://www.paddlepaddle.org.cn/hubdetail?name=hand_pose_localization&en_category=KeyPointDetection) |可用于手势识别，配合人体骨骼关键点，可用于异常行为检测等多种场景  |
diff --git a/modules/image/object_detection/README.md b/modules/image/object_detection/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..474ad3e9c99bb174f2843fe15cd2f8006c229042 100644
--- a/modules/image/object_detection/README.md
+++ b/modules/image/object_detection/README.md
@@ -0,0 +1,15 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【目标检测】](https://www.paddlepaddle.org.cn/hublist)**
+
+
+
+### 目标检测
+
+目标检测任务的目标是给定一张图像或是一个视频帧，让计算机找出其中所有目标的位置，并给出每个目标的具体类别。对于计算机而言，能够“看到”的是图像被编码之后的数字，但很难解图像或是视频帧中出现了人或是物体这样的高层语义概念，也就更加难以定位目标出现在图像中哪个区域。
+
+- 精选推荐模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+ | [YOLOv3](https://www.paddlepaddle.org.cn/hubdetail?name=yolov3_darknet53_coco2017&en_category=ObjectDetection) | 实现精度相比原作者**提高5.9 个绝对百分点**，性能极致优化。 |
+ | [行人检测](https://www.paddlepaddle.org.cn/hubdetail?name=yolov3_darknet53_pedestrian&en_category=ObjectDetection) | 百度自研模型，海量私有数据集训练，可以应用于智能视频监控，人体行为分析，客流统计系统，智能交通等领域 |
+ | [车辆检测](https://www.paddlepaddle.org.cn/hubdetail?name=yolov3_darknet53_vehicles&en_category=ObjectDetection) | 百度自研模型，支持car (汽车)，truck (卡车)，bus (公交车)，motorbike (摩托车)，tricycle (三轮车)等车型的识别 |
diff --git a/modules/image/semantic_segmentation/README.md b/modules/image/semantic_segmentation/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..8e35bc8ea449795c1369283ce9d28c6eb85785cc 100644
--- a/modules/image/semantic_segmentation/README.md
+++ b/modules/image/semantic_segmentation/README.md
@@ -0,0 +1,12 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【图像分割】](https://www.paddlepaddle.org.cn/hublist)**
+
+### 图像分割
+
+图像语义分割顾名思义是将图像像素按照表达的语义含义的不同进行分组/分割，图像语义是指对图像内容的理解，例如，能够描绘出什么物体在哪里做了什么事情等，分割是指对图片中的每个像素点进行标注，标注属于哪一类别。近年来用在无人车驾驶技术中分割街景来避让行人和车辆、医疗影像分析中辅助诊断等。
+
+- 精选模型
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [人像分割](https://www.paddlepaddle.org.cn/hubdetail?name=deeplabv3p_xception65_humanseg&en_category=ImageSegmentation) | 百度**自建数据集**训练，人像分割效果卓越。                 |
+| [人体解析](https://www.paddlepaddle.org.cn/hubdetail?name=ace2p&en_category=ImageSegmentation) | CVPR2019 LIP挑战赛中**满贯三冠王**。人体解析任务必选。     |
+| [肺炎CT影像分析](https://www.paddlepaddle.org.cn/hubdetail?name=Pneumonia_CT_LKM_PP&en_category=ImageSegmentation) | 助力连心医疗开源**业界首个**肺炎CT影像分析模型
diff --git a/modules/image/style_transfer/stylepro_artistic/decoder_network.py b/modules/image/style_transfer/stylepro_artistic/decoder_network.py
deleted file mode 100644
index 99a67c0aa44869ff989f7eaa176e42e179120896..0000000000000000000000000000000000000000
--- a/modules/image/style_transfer/stylepro_artistic/decoder_network.py
+++ /dev/null
@@ -1,144 +0,0 @@
-# coding=utf-8
-from paddle.fluid.initializer import Constant
-from paddle.fluid.param_attr import ParamAttr
-import paddle.fluid as fluid
-
-
-def decoder_net():
-    x2paddle_22 = fluid.layers.create_parameter(
-        dtype='float32', shape=[4], name='x2paddle_22', attr='x2paddle_22', default_initializer=Constant(0.0))
-    x2paddle_36 = fluid.layers.create_parameter(
-        dtype='float32', shape=[4], name='x2paddle_36', attr='x2paddle_36', default_initializer=Constant(0.0))
-    x2paddle_44 = fluid.layers.create_parameter(
-        dtype='float32', shape=[4], name='x2paddle_44', attr='x2paddle_44', default_initializer=Constant(0.0))
-    x2paddle_input_1 = fluid.layers.data(
-        dtype='float32', shape=[1, 512, 64, 64], name='x2paddle_input_1', append_batch_size=False)
-    x2paddle_19 = fluid.layers.pad2d(
-        x2paddle_input_1, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_19')
-    x2paddle_20 = fluid.layers.conv2d(
-        x2paddle_19,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_1',
-        name='x2paddle_20',
-        bias_attr='x2paddle_2')
-    x2paddle_21 = fluid.layers.relu(x2paddle_20, name='x2paddle_21')
-    x2paddle_23 = fluid.layers.resize_nearest(x2paddle_21, name='x2paddle_23', out_shape=[128, 128])
-    x2paddle_24 = fluid.layers.pad2d(
-        x2paddle_23, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_24')
-    x2paddle_25 = fluid.layers.conv2d(
-        x2paddle_24,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_3',
-        name='x2paddle_25',
-        bias_attr='x2paddle_4')
-    x2paddle_26 = fluid.layers.relu(x2paddle_25, name='x2paddle_26')
-    x2paddle_27 = fluid.layers.pad2d(
-        x2paddle_26, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_27')
-    x2paddle_28 = fluid.layers.conv2d(
-        x2paddle_27,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_5',
-        name='x2paddle_28',
-        bias_attr='x2paddle_6')
-    x2paddle_29 = fluid.layers.relu(x2paddle_28, name='x2paddle_29')
-    x2paddle_30 = fluid.layers.pad2d(
-        x2paddle_29, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_30')
-    x2paddle_31 = fluid.layers.conv2d(
-        x2paddle_30,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_7',
-        name='x2paddle_31',
-        bias_attr='x2paddle_8')
-    x2paddle_32 = fluid.layers.relu(x2paddle_31, name='x2paddle_32')
-    x2paddle_33 = fluid.layers.pad2d(
-        x2paddle_32, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_33')
-    x2paddle_34 = fluid.layers.conv2d(
-        x2paddle_33,
-        num_filters=128,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_9',
-        name='x2paddle_34',
-        bias_attr='x2paddle_10')
-    x2paddle_35 = fluid.layers.relu(x2paddle_34, name='x2paddle_35')
-    x2paddle_37 = fluid.layers.resize_nearest(x2paddle_35, name='x2paddle_37', out_shape=[256, 256])
-    x2paddle_38 = fluid.layers.pad2d(
-        x2paddle_37, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_38')
-    x2paddle_39 = fluid.layers.conv2d(
-        x2paddle_38,
-        num_filters=128,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_11',
-        name='x2paddle_39',
-        bias_attr='x2paddle_12')
-    x2paddle_40 = fluid.layers.relu(x2paddle_39, name='x2paddle_40')
-    x2paddle_41 = fluid.layers.pad2d(
-        x2paddle_40, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_41')
-    x2paddle_42 = fluid.layers.conv2d(
-        x2paddle_41,
-        num_filters=64,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_13',
-        name='x2paddle_42',
-        bias_attr='x2paddle_14')
-    x2paddle_43 = fluid.layers.relu(x2paddle_42, name='x2paddle_43')
-    x2paddle_45 = fluid.layers.resize_nearest(x2paddle_43, name='x2paddle_45', out_shape=[512, 512])
-    x2paddle_46 = fluid.layers.pad2d(
-        x2paddle_45, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_46')
-    x2paddle_47 = fluid.layers.conv2d(
-        x2paddle_46,
-        num_filters=64,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_15',
-        name='x2paddle_47',
-        bias_attr='x2paddle_16')
-    x2paddle_48 = fluid.layers.relu(x2paddle_47, name='x2paddle_48')
-    x2paddle_49 = fluid.layers.pad2d(
-        x2paddle_48, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_49')
-    x2paddle_50 = fluid.layers.conv2d(
-        x2paddle_49,
-        num_filters=3,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_17',
-        name='x2paddle_50',
-        bias_attr='x2paddle_18')
-    return x2paddle_input_1, x2paddle_50
diff --git a/modules/image/style_transfer/stylepro_artistic/encoder_network.py b/modules/image/style_transfer/stylepro_artistic/encoder_network.py
deleted file mode 100644
index 0bff785c65e933669fbd790565156da3edaed33d..0000000000000000000000000000000000000000
--- a/modules/image/style_transfer/stylepro_artistic/encoder_network.py
+++ /dev/null
@@ -1,173 +0,0 @@
-# coding=utf-8
-from paddle.fluid.initializer import Constant
-from paddle.fluid.param_attr import ParamAttr
-import paddle.fluid as fluid
-
-
-def encoder_net():
-    x2paddle_0 = fluid.layers.data(dtype='float32', shape=[1, 3, 512, 512], name='x2paddle_0', append_batch_size=False)
-    x2paddle_21 = fluid.layers.conv2d(
-        x2paddle_0,
-        num_filters=3,
-        filter_size=[1, 1],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_1',
-        name='x2paddle_21',
-        bias_attr='x2paddle_2')
-    x2paddle_22 = fluid.layers.pad2d(
-        x2paddle_21, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_22')
-    x2paddle_23 = fluid.layers.conv2d(
-        x2paddle_22,
-        num_filters=64,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_3',
-        name='x2paddle_23',
-        bias_attr='x2paddle_4')
-    x2paddle_24 = fluid.layers.relu(x2paddle_23, name='x2paddle_24')
-    x2paddle_25 = fluid.layers.pad2d(
-        x2paddle_24, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_25')
-    x2paddle_26 = fluid.layers.conv2d(
-        x2paddle_25,
-        num_filters=64,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_5',
-        name='x2paddle_26',
-        bias_attr='x2paddle_6')
-    x2paddle_27 = fluid.layers.relu(x2paddle_26, name='x2paddle_27')
-    x2paddle_28 = fluid.layers.pool2d(
-        x2paddle_27,
-        pool_size=[2, 2],
-        pool_type='max',
-        pool_stride=[2, 2],
-        pool_padding=[0, 0],
-        ceil_mode=False,
-        name='x2paddle_28',
-        exclusive=False)
-    x2paddle_29 = fluid.layers.pad2d(
-        x2paddle_28, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_29')
-    x2paddle_30 = fluid.layers.conv2d(
-        x2paddle_29,
-        num_filters=128,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_7',
-        name='x2paddle_30',
-        bias_attr='x2paddle_8')
-    x2paddle_31 = fluid.layers.relu(x2paddle_30, name='x2paddle_31')
-    x2paddle_32 = fluid.layers.pad2d(
-        x2paddle_31, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_32')
-    x2paddle_33 = fluid.layers.conv2d(
-        x2paddle_32,
-        num_filters=128,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_9',
-        name='x2paddle_33',
-        bias_attr='x2paddle_10')
-    x2paddle_34 = fluid.layers.relu(x2paddle_33, name='x2paddle_34')
-    x2paddle_35 = fluid.layers.pool2d(
-        x2paddle_34,
-        pool_size=[2, 2],
-        pool_type='max',
-        pool_stride=[2, 2],
-        pool_padding=[0, 0],
-        ceil_mode=False,
-        name='x2paddle_35',
-        exclusive=False)
-    x2paddle_36 = fluid.layers.pad2d(
-        x2paddle_35, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_36')
-    x2paddle_37 = fluid.layers.conv2d(
-        x2paddle_36,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_11',
-        name='x2paddle_37',
-        bias_attr='x2paddle_12')
-    x2paddle_38 = fluid.layers.relu(x2paddle_37, name='x2paddle_38')
-    x2paddle_39 = fluid.layers.pad2d(
-        x2paddle_38, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_39')
-    x2paddle_40 = fluid.layers.conv2d(
-        x2paddle_39,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_13',
-        name='x2paddle_40',
-        bias_attr='x2paddle_14')
-    x2paddle_41 = fluid.layers.relu(x2paddle_40, name='x2paddle_41')
-    x2paddle_42 = fluid.layers.pad2d(
-        x2paddle_41, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_42')
-    x2paddle_43 = fluid.layers.conv2d(
-        x2paddle_42,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_15',
-        name='x2paddle_43',
-        bias_attr='x2paddle_16')
-    x2paddle_44 = fluid.layers.relu(x2paddle_43, name='x2paddle_44')
-    x2paddle_45 = fluid.layers.pad2d(
-        x2paddle_44, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_45')
-    x2paddle_46 = fluid.layers.conv2d(
-        x2paddle_45,
-        num_filters=256,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_17',
-        name='x2paddle_46',
-        bias_attr='x2paddle_18')
-    x2paddle_47 = fluid.layers.relu(x2paddle_46, name='x2paddle_47')
-    x2paddle_48 = fluid.layers.pool2d(
-        x2paddle_47,
-        pool_size=[2, 2],
-        pool_type='max',
-        pool_stride=[2, 2],
-        pool_padding=[0, 0],
-        ceil_mode=False,
-        name='x2paddle_48',
-        exclusive=False)
-    x2paddle_49 = fluid.layers.pad2d(
-        x2paddle_48, pad_value=0.0, mode='reflect', paddings=[1, 1, 1, 1], name='x2paddle_49')
-    x2paddle_50 = fluid.layers.conv2d(
-        x2paddle_49,
-        num_filters=512,
-        filter_size=[3, 3],
-        stride=[1, 1],
-        padding=[0, 0],
-        dilation=[1, 1],
-        groups=1,
-        param_attr='x2paddle_19',
-        name='x2paddle_50',
-        bias_attr='x2paddle_20')
-    x2paddle_51 = fluid.layers.relu(x2paddle_50, name='x2paddle_51')
-    return x2paddle_0, x2paddle_51
diff --git a/modules/image/text_recognition/README.md b/modules/image/text_recognition/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6758d009054d3e5effcc9d2b4345e7dbbc5de256
--- /dev/null
+++ b/modules/image/text_recognition/README.md
@@ -0,0 +1,15 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【文字识别】](https://www.paddlepaddle.org.cn/hublist)**
+
+### 文字识别
+文字识别（OCR）是计算机视觉重要任务之一，主要用于图像中文本信息的提取，具有重要的产业实践意义。
+
+- 推荐模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [超轻量-中英文OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=chinese_ocr_db_crnn_mobile&en_category=TextRecognition) | 业界开源最小，8.1M超轻量中英文识别模型。支持中英文识别；支持倾斜、竖排等多种方向文字识别，**强力推荐** |
+| [高精度-中英文OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=chinese_ocr_db_crnn_mobile&en_category=TextRecognition) | 业界开源效果最好，155M高精度中英文识别模型。支持中英文识别；支持倾斜、竖排等多种方向文字识别，**强力推荐** |
+| [德语-超轻量OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=german_ocr_db_crnn_mobile&en_category=TextRecognition) | 德语OCR识别，超轻量|
+| [法语-超轻量OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=french_ocr_db_crnn_mobile&en_category=TextRecognition) | 法语OCR识别，超轻量|
+| [日语-超轻量OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=japan_ocr_db_crnn_mobile&en_category=TextRecognition) | 日语OCR识别，超轻量|
+| [韩语-超轻量OCR文字识别](https://www.paddlepaddle.org.cn/hubdetail?name=korean_ocr_db_crnn_mobile&en_category=TextRecognition) | 韩语OCR识别，超轻量|
diff --git a/modules/text/language_model/README.md b/modules/text/language_model/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..836a9d45e51f4fa460d29755fd062366b07ce774 100644
--- a/modules/text/language_model/README.md
+++ b/modules/text/language_model/README.md
@@ -0,0 +1,13 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【语言模型】](https://www.paddlepaddle.org.cn/hublist)**
+
+### 语言模型
+
+
+- 推荐模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [词嵌入模型](https://www.paddlepaddle.org.cn/hubdetail?name=word2vec_skipgram&en_category=SemanticModel) |在海量百度搜索数据集下预训练得到中文单词预训练词嵌入。其支持Fine-tune。Word2vec的预训练数据集的词汇表大小为1700249，word embedding维度为128。 |
+| [文本相似度](https://www.paddlepaddle.org.cn/hubdetail?name=simnet_bow&en_category=SemanticModel) |根据用户输入的两个文本，计算出文本相似度得分。 |
+| [ERNIE](https://www.paddlepaddle.org.cn/hubdetail?name=ERNIE&en_category=SemanticModel) |基于百科类、资讯类、论坛对话类数据等中文语料自研模型，其可用于文本分类、序列标注、阅读理解等任务。
+.
diff --git a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/README.md b/modules/text/language_model/bert_cased_L_12_H_768_A_12/README.md
similarity index 100%
rename from modules/text/semantic_model/bert_cased_L_12_H_768_A_12/README.md
rename to modules/text/language_model/bert_cased_L_12_H_768_A_12/README.md
diff --git a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/__init__.py b/modules/text/language_model/bert_cased_L_12_H_768_A_12/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_cased_L_12_H_768_A_12/__init__.py
rename to modules/text/language_model/bert_cased_L_12_H_768_A_12/__init__.py
diff --git a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/__init__.py b/modules/text/language_model/bert_cased_L_12_H_768_A_12/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/__init__.py
rename to modules/text/language_model/bert_cased_L_12_H_768_A_12/model/__init__.py
diff --git a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/bert.py b/modules/text/language_model/bert_cased_L_12_H_768_A_12/model/bert.py
similarity index 50%
rename from modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/bert.py
rename to modules/text/language_model/bert_cased_L_12_H_768_A_12/model/bert.py
index c42acad03751e2c1adc71faede6db393a2daca8a..8d8719e2b5eb1ed669286db15aed87d1567fd67f 100644
--- a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/language_model/bert_cased_L_12_H_768_A_12/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class BertModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -130,12 +129,12 @@ class BertModel(object):
         """Get the first feature of each sequence for classification"""
 
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -150,43 +149,45 @@ class BertModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/transformer_encoder.py b/modules/text/language_model/bert_cased_L_12_H_768_A_12/model/transformer_encoder.py
similarity index 58%
rename from modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/transformer_encoder.py
rename to modules/text/language_model/bert_cased_L_12_H_768_A_12/model/transformer_encoder.py
index 53051cde80308a17a30d9b92de11c712b63da406..b15d838883fdad1e3432e4ea8715f2320d67929f 100644
--- a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_cased_L_12_H_768_A_12/model/transformer_encoder.py
@@ -50,24 +50,21 @@ def multi_head_attention(queries,
         """
         Add linear projection to queries, keys, and values.
         """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
         return q, k, v
 
     def __split_heads(x, n_head):
@@ -110,8 +107,10 @@ def multi_head_attention(queries,
             product += attn_bias
         weights = layers.softmax(product)
         if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
         out = layers.matmul(weights, v)
         return out
 
@@ -133,12 +132,11 @@ def multi_head_attention(queries,
     out = __combine_heads(ctx_multiheads)
 
     # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
     return proj_out
 
 
@@ -148,22 +146,22 @@ def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, p
     This module consists of two linear transformations with a ReLU activation
     in between, which is applied to each position separately and identically.
     """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
     if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
     return out
 
 
@@ -181,17 +179,20 @@ def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name='')
             out_dtype = out.dtype
             if out_dtype == fluid.core.VarDesc.VarType.FP16:
                 out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
             if out_dtype == fluid.core.VarDesc.VarType.FP16:
                 out = layers.cast(x=out, dtype="float16")
         elif cmd == "d":  # add dropout
             if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
     return out
 
 
@@ -220,28 +221,35 @@ def encoder_layer(enc_input,
     with the post_process_layer to add residual connection, layer normalization
     and droput.
     """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
     return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
 
 
@@ -266,22 +274,21 @@ def encoder(enc_input,
     encoder_layer.
     """
     for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
         enc_input = enc_output
     enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
 
diff --git a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/module.py b/modules/text/language_model/bert_cased_L_12_H_768_A_12/module.py
similarity index 89%
rename from modules/text/semantic_model/bert_cased_L_12_H_768_A_12/module.py
rename to modules/text/language_model/bert_cased_L_12_H_768_A_12/module.py
index d5ec9e197980778d0165904f98a7b139d0015ccc..639bae52f754150b7b6097ed90e4fb0b1c946b7a 100644
--- a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/module.py
+++ b/modules/text/language_model/bert_cased_L_12_H_768_A_12/module.py
@@ -58,13 +58,12 @@ class Bert(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        bert = BertModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.bert_config,
-            use_fp16=False)
+        bert = BertModel(src_ids=input_ids,
+                         position_ids=position_ids,
+                         sentence_ids=segment_ids,
+                         input_mask=input_mask,
+                         config=self.bert_config,
+                         use_fp16=False)
         pooled_output = bert.get_pooled_output()
         sequence_output = bert.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/README.md b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/README.md
similarity index 100%
rename from modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/README.md
rename to modules/text/language_model/bert_cased_L_24_H_1024_A_16/README.md
diff --git a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/__init__.py b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/__init__.py
rename to modules/text/language_model/bert_cased_L_24_H_1024_A_16/__init__.py
diff --git a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/__init__.py b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/__init__.py
rename to modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/__init__.py
diff --git a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/bert.py b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/bert.py
similarity index 50%
rename from modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/bert.py
rename to modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/bert.py
index d62917a76488580560bf3f5e23e5e8b64b6c652c..fc465051e8be42c07f1d484ec499ccfe064016bc 100644
--- a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/bert.py
+++ b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class BertModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -130,12 +129,12 @@ class BertModel(object):
         """Get the first feature of each sequence for classification"""
 
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -150,43 +149,45 @@ class BertModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/transformer_encoder.py b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/transformer_encoder.py
similarity index 58%
rename from modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/transformer_encoder.py
rename to modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/transformer_encoder.py
index 53051cde80308a17a30d9b92de11c712b63da406..b15d838883fdad1e3432e4ea8715f2320d67929f 100644
--- a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/transformer_encoder.py
@@ -50,24 +50,21 @@ def multi_head_attention(queries,
         """
         Add linear projection to queries, keys, and values.
         """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
         return q, k, v
 
     def __split_heads(x, n_head):
@@ -110,8 +107,10 @@ def multi_head_attention(queries,
             product += attn_bias
         weights = layers.softmax(product)
         if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
         out = layers.matmul(weights, v)
         return out
 
@@ -133,12 +132,11 @@ def multi_head_attention(queries,
     out = __combine_heads(ctx_multiheads)
 
     # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
     return proj_out
 
 
@@ -148,22 +146,22 @@ def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, p
     This module consists of two linear transformations with a ReLU activation
     in between, which is applied to each position separately and identically.
     """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
     if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
     return out
 
 
@@ -181,17 +179,20 @@ def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name='')
             out_dtype = out.dtype
             if out_dtype == fluid.core.VarDesc.VarType.FP16:
                 out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
             if out_dtype == fluid.core.VarDesc.VarType.FP16:
                 out = layers.cast(x=out, dtype="float16")
         elif cmd == "d":  # add dropout
             if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
     return out
 
 
@@ -220,28 +221,35 @@ def encoder_layer(enc_input,
     with the post_process_layer to add residual connection, layer normalization
     and droput.
     """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
     return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
 
 
@@ -266,22 +274,21 @@ def encoder(enc_input,
     encoder_layer.
     """
     for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
         enc_input = enc_output
     enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
 
diff --git a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/module.py b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/module.py
similarity index 89%
rename from modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/module.py
rename to modules/text/language_model/bert_cased_L_24_H_1024_A_16/module.py
index 861a2c0a0a5ffcc007060bfb0cb04e98173c7091..b8ccbb87531f9fcf55d45f0464314adf0e461faa 100644
--- a/modules/text/semantic_model/bert_cased_L_24_H_1024_A_16/module.py
+++ b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/module.py
@@ -58,13 +58,12 @@ class Bert(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        bert = BertModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.bert_config,
-            use_fp16=False)
+        bert = BertModel(src_ids=input_ids,
+                         position_ids=position_ids,
+                         sentence_ids=segment_ids,
+                         input_mask=input_mask,
+                         config=self.bert_config,
+                         use_fp16=False)
         pooled_output = bert.get_pooled_output()
         sequence_output = bert.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/README.md b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/README.md
similarity index 100%
rename from modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/README.md
rename to modules/text/language_model/bert_chinese_L_12_H_768_A_12/README.md
diff --git a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/__init__.py b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/__init__.py
rename to modules/text/language_model/bert_chinese_L_12_H_768_A_12/__init__.py
diff --git a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/__init__.py b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/__init__.py
rename to modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/__init__.py
diff --git a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/bert.py b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/bert.py
similarity index 50%
rename from modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/bert.py
rename to modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/bert.py
index 023d67717c5b56367eb7bed0046b864f35b5a1da..88ca1a79dc7d41fab3814cd36543d444db62c77f 100644
--- a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class BertModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -130,12 +129,12 @@ class BertModel(object):
         """Get the first feature of each sequence for classification"""
 
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -150,43 +149,45 @@ class BertModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/transformer_encoder.py b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/transformer_encoder.py
similarity index 58%
rename from modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/transformer_encoder.py
rename to modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/transformer_encoder.py
index 53051cde80308a17a30d9b92de11c712b63da406..b15d838883fdad1e3432e4ea8715f2320d67929f 100644
--- a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/transformer_encoder.py
@@ -50,24 +50,21 @@ def multi_head_attention(queries,
         """
         Add linear projection to queries, keys, and values.
         """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
         return q, k, v
 
     def __split_heads(x, n_head):
@@ -110,8 +107,10 @@ def multi_head_attention(queries,
             product += attn_bias
         weights = layers.softmax(product)
         if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
         out = layers.matmul(weights, v)
         return out
 
@@ -133,12 +132,11 @@ def multi_head_attention(queries,
     out = __combine_heads(ctx_multiheads)
 
     # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
     return proj_out
 
 
@@ -148,22 +146,22 @@ def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, p
     This module consists of two linear transformations with a ReLU activation
     in between, which is applied to each position separately and identically.
     """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
     if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
     return out
 
 
@@ -181,17 +179,20 @@ def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name='')
             out_dtype = out.dtype
             if out_dtype == fluid.core.VarDesc.VarType.FP16:
                 out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
             if out_dtype == fluid.core.VarDesc.VarType.FP16:
                 out = layers.cast(x=out, dtype="float16")
         elif cmd == "d":  # add dropout
             if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
     return out
 
 
@@ -220,28 +221,35 @@ def encoder_layer(enc_input,
     with the post_process_layer to add residual connection, layer normalization
     and droput.
     """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
     return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
 
 
@@ -266,22 +274,21 @@ def encoder(enc_input,
     encoder_layer.
     """
     for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
         enc_input = enc_output
     enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
 
diff --git a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/module.py b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/module.py
similarity index 89%
rename from modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/module.py
rename to modules/text/language_model/bert_chinese_L_12_H_768_A_12/module.py
index 2d76cfb5d0276aa47d62f16078db1985447aab7e..05ff82b183d983dead65e4d0804621f87e3bb102 100644
--- a/modules/text/semantic_model/bert_chinese_L_12_H_768_A_12/module.py
+++ b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/module.py
@@ -58,13 +58,12 @@ class BertChinese(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        bert = BertModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.bert_config,
-            use_fp16=False)
+        bert = BertModel(src_ids=input_ids,
+                         position_ids=position_ids,
+                         sentence_ids=segment_ids,
+                         input_mask=input_mask,
+                         config=self.bert_config,
+                         use_fp16=False)
         pooled_output = bert.get_pooled_output()
         sequence_output = bert.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/README.md b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/README.md
similarity index 100%
rename from modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/README.md
rename to modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/README.md
diff --git a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/__init__.py b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/__init__.py
rename to modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/__init__.py
diff --git a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/__init__.py b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/__init__.py
rename to modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/__init__.py
diff --git a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/bert.py b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/bert.py
similarity index 50%
rename from modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/bert.py
rename to modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/bert.py
index 029047a29fb7b57881fce909b38d7b7158b6f336..b0c528f695dc03e4007fb803de077ab2496d1c4f 100644
--- a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class BertModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -130,12 +129,12 @@ class BertModel(object):
         """Get the first feature of each sequence for classification"""
 
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -150,43 +149,45 @@ class BertModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/transformer_encoder.py b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/transformer_encoder.py
similarity index 58%
rename from modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/transformer_encoder.py
rename to modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/transformer_encoder.py
index 53051cde80308a17a30d9b92de11c712b63da406..b15d838883fdad1e3432e4ea8715f2320d67929f 100644
--- a/modules/text/semantic_model/bert_cased_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/transformer_encoder.py
@@ -50,24 +50,21 @@ def multi_head_attention(queries,
         """
         Add linear projection to queries, keys, and values.
         """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
         return q, k, v
 
     def __split_heads(x, n_head):
@@ -110,8 +107,10 @@ def multi_head_attention(queries,
             product += attn_bias
         weights = layers.softmax(product)
         if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
         out = layers.matmul(weights, v)
         return out
 
@@ -133,12 +132,11 @@ def multi_head_attention(queries,
     out = __combine_heads(ctx_multiheads)
 
     # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
     return proj_out
 
 
@@ -148,22 +146,22 @@ def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, p
     This module consists of two linear transformations with a ReLU activation
     in between, which is applied to each position separately and identically.
     """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
     if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
     return out
 
 
@@ -181,17 +179,20 @@ def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name='')
             out_dtype = out.dtype
             if out_dtype == fluid.core.VarDesc.VarType.FP16:
                 out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
             if out_dtype == fluid.core.VarDesc.VarType.FP16:
                 out = layers.cast(x=out, dtype="float16")
         elif cmd == "d":  # add dropout
             if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
     return out
 
 
@@ -220,28 +221,35 @@ def encoder_layer(enc_input,
     with the post_process_layer to add residual connection, layer normalization
     and droput.
     """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
     return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
 
 
@@ -266,22 +274,21 @@ def encoder(enc_input,
     encoder_layer.
     """
     for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
         enc_input = enc_output
     enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
 
diff --git a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/module.py b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/module.py
similarity index 89%
rename from modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/module.py
rename to modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/module.py
index abf59f7a69190286fe0501a0b65d6c8563ef451b..64ef5ef24958aaf15310569d0563f381d56b3d2f 100644
--- a/modules/text/semantic_model/bert_multi_cased_L_12_H_768_A_12/module.py
+++ b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/module.py
@@ -58,13 +58,12 @@ class Bert(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        bert = BertModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.bert_config,
-            use_fp16=False)
+        bert = BertModel(src_ids=input_ids,
+                         position_ids=position_ids,
+                         sentence_ids=segment_ids,
+                         input_mask=input_mask,
+                         config=self.bert_config,
+                         use_fp16=False)
         pooled_output = bert.get_pooled_output()
         sequence_output = bert.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/README.md b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/README.md
similarity index 100%
rename from modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/README.md
rename to modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/README.md
diff --git a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/__init__.py b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/__init__.py
rename to modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/__init__.py
diff --git a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/__init__.py b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/__init__.py
rename to modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/__init__.py
diff --git a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/bert.py b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/bert.py
similarity index 51%
rename from modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/bert.py
rename to modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/bert.py
index dc02d7e84de1bbc4b16c9c556ea0a8e6790ffa36..538a79c831483d9c1c84d407d1ad5f163737ad92 100644
--- a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class BertModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -130,12 +129,12 @@ class BertModel(object):
         """Get the first feature of each sequence for classification"""
 
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -150,43 +149,45 @@ class BertModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/transformer_encoder.py b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/module.py b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/module.py
similarity index 89%
rename from modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/module.py
rename to modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/module.py
index 5978a30c5fe613993065d40ed83fa7efb77a26a8..f5c17cc3ecbb834c5908d49a28183a7db1d929a0 100644
--- a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/module.py
+++ b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/module.py
@@ -58,13 +58,12 @@ class Bert(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        bert = BertModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.bert_config,
-            use_fp16=False)
+        bert = BertModel(src_ids=input_ids,
+                         position_ids=position_ids,
+                         sentence_ids=segment_ids,
+                         input_mask=input_mask,
+                         config=self.bert_config,
+                         use_fp16=False)
         pooled_output = bert.get_pooled_output()
         sequence_output = bert.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/README.md b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/README.md
similarity index 100%
rename from modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/README.md
rename to modules/text/language_model/bert_uncased_L_12_H_768_A_12/README.md
diff --git a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/__init__.py b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/__init__.py
rename to modules/text/language_model/bert_uncased_L_12_H_768_A_12/__init__.py
diff --git a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/__init__.py b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/__init__.py
rename to modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/__init__.py
diff --git a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/bert.py b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/bert.py
similarity index 50%
rename from modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/bert.py
rename to modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/bert.py
index cbfa4b24ec652ea511feaf1c330650d191fd7ce4..32d800661ca580e9105fdc88ffd4a3b3d913522a 100644
--- a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class BertModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -130,12 +129,12 @@ class BertModel(object):
         """Get the first feature of each sequence for classification"""
 
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -150,43 +149,45 @@ class BertModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/transformer_encoder.py b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/module.py b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/module.py
similarity index 89%
rename from modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/module.py
rename to modules/text/language_model/bert_uncased_L_12_H_768_A_12/module.py
index 6038a2e22f824dd6e04914dc8f0caf34f1e64ffe..2ce0e1e6b5ee2d8b2f00ce3d3c3f4482f295f4ad 100644
--- a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/module.py
+++ b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/module.py
@@ -58,13 +58,12 @@ class Bert(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        bert = BertModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.bert_config,
-            use_fp16=False)
+        bert = BertModel(src_ids=input_ids,
+                         position_ids=position_ids,
+                         sentence_ids=segment_ids,
+                         input_mask=input_mask,
+                         config=self.bert_config,
+                         use_fp16=False)
         pooled_output = bert.get_pooled_output()
         sequence_output = bert.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/README.md b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/README.md
similarity index 100%
rename from modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/README.md
rename to modules/text/language_model/bert_uncased_L_24_H_1024_A_16/README.md
diff --git a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/__init__.py b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/__init__.py
rename to modules/text/language_model/bert_uncased_L_24_H_1024_A_16/__init__.py
diff --git a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/__init__.py b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/__init__.py
rename to modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/__init__.py
diff --git a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/bert.py b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/bert.py
similarity index 50%
rename from modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/bert.py
rename to modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/bert.py
index 9f4b1d9a428b229a3d6490b1678b4c083733e4db..f3f536f67c9c00912b6871575c131823adb57202 100644
--- a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/bert.py
+++ b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class BertModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -130,12 +129,12 @@ class BertModel(object):
         """Get the first feature of each sequence for classification"""
 
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -150,43 +149,45 @@ class BertModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/transformer_encoder.py b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/module.py b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/module.py
similarity index 89%
rename from modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/module.py
rename to modules/text/language_model/bert_uncased_L_24_H_1024_A_16/module.py
index efeeabd08fb0fac471c7ccf83c8b18ffa9080f69..d1df5b1ca52cb5aee1fe50b7b2b069b371c2d4a6 100644
--- a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/module.py
+++ b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/module.py
@@ -58,13 +58,12 @@ class Bert(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        bert = BertModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.bert_config,
-            use_fp16=False)
+        bert = BertModel(src_ids=input_ids,
+                         position_ids=position_ids,
+                         sentence_ids=segment_ids,
+                         input_mask=input_mask,
+                         config=self.bert_config,
+                         use_fp16=False)
         pooled_output = bert.get_pooled_output()
         sequence_output = bert.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/chinese_bert_wwm/README.md b/modules/text/language_model/chinese_bert_wwm/README.md
similarity index 100%
rename from modules/text/semantic_model/chinese_bert_wwm/README.md
rename to modules/text/language_model/chinese_bert_wwm/README.md
diff --git a/modules/text/semantic_model/chinese_bert_wwm/__init__.py b/modules/text/language_model/chinese_bert_wwm/__init__.py
similarity index 100%
rename from modules/text/semantic_model/chinese_bert_wwm/__init__.py
rename to modules/text/language_model/chinese_bert_wwm/__init__.py
diff --git a/modules/text/semantic_model/chinese_bert_wwm/model/__init__.py b/modules/text/language_model/chinese_bert_wwm/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/chinese_bert_wwm/model/__init__.py
rename to modules/text/language_model/chinese_bert_wwm/model/__init__.py
diff --git a/modules/text/semantic_model/chinese_bert_wwm/model/bert.py b/modules/text/language_model/chinese_bert_wwm/model/bert.py
similarity index 50%
rename from modules/text/semantic_model/chinese_bert_wwm/model/bert.py
rename to modules/text/language_model/chinese_bert_wwm/model/bert.py
index 19d991f373788305050a5f807fd90fdb500efc0f..819bdbad4b2715d91672ad57dce43f8143461515 100644
--- a/modules/text/semantic_model/chinese_bert_wwm/model/bert.py
+++ b/modules/text/language_model/chinese_bert_wwm/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class BertModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -130,12 +129,12 @@ class BertModel(object):
         """Get the first feature of each sequence for classification"""
 
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -150,43 +149,45 @@ class BertModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/language_model/chinese_bert_wwm/model/transformer_encoder.py b/modules/text/language_model/chinese_bert_wwm/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/chinese_bert_wwm/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/chinese_bert_wwm/module.py b/modules/text/language_model/chinese_bert_wwm/module.py
similarity index 89%
rename from modules/text/semantic_model/chinese_bert_wwm/module.py
rename to modules/text/language_model/chinese_bert_wwm/module.py
index ce2b8b3d369965dedba943bff3788fb29d36d7f4..70ea366d1afec7df4af90314cbd87bd991a38293 100644
--- a/modules/text/semantic_model/chinese_bert_wwm/module.py
+++ b/modules/text/language_model/chinese_bert_wwm/module.py
@@ -58,13 +58,12 @@ class BertWwm(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        bert = BertModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.bert_config,
-            use_fp16=False)
+        bert = BertModel(src_ids=input_ids,
+                         position_ids=position_ids,
+                         sentence_ids=segment_ids,
+                         input_mask=input_mask,
+                         config=self.bert_config,
+                         use_fp16=False)
         pooled_output = bert.get_pooled_output()
         sequence_output = bert.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/chinese_bert_wwm_ext/README.md b/modules/text/language_model/chinese_bert_wwm_ext/README.md
similarity index 100%
rename from modules/text/semantic_model/chinese_bert_wwm_ext/README.md
rename to modules/text/language_model/chinese_bert_wwm_ext/README.md
diff --git a/modules/text/semantic_model/chinese_bert_wwm_ext/__init__.py b/modules/text/language_model/chinese_bert_wwm_ext/__init__.py
similarity index 100%
rename from modules/text/semantic_model/chinese_bert_wwm_ext/__init__.py
rename to modules/text/language_model/chinese_bert_wwm_ext/__init__.py
diff --git a/modules/text/semantic_model/chinese_bert_wwm_ext/model/__init__.py b/modules/text/language_model/chinese_bert_wwm_ext/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/chinese_bert_wwm_ext/model/__init__.py
rename to modules/text/language_model/chinese_bert_wwm_ext/model/__init__.py
diff --git a/modules/text/semantic_model/chinese_bert_wwm_ext/model/bert.py b/modules/text/language_model/chinese_bert_wwm_ext/model/bert.py
similarity index 50%
rename from modules/text/semantic_model/chinese_bert_wwm_ext/model/bert.py
rename to modules/text/language_model/chinese_bert_wwm_ext/model/bert.py
index 782289289387a59a75db41ed972fa8e5394c19a3..cf2a32c177cec3062d0ba7ce8e3371bec79687b6 100644
--- a/modules/text/semantic_model/chinese_bert_wwm_ext/model/bert.py
+++ b/modules/text/language_model/chinese_bert_wwm_ext/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class BertModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -130,12 +129,12 @@ class BertModel(object):
         """Get the first feature of each sequence for classification"""
 
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -150,43 +149,45 @@ class BertModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/language_model/chinese_bert_wwm_ext/model/transformer_encoder.py b/modules/text/language_model/chinese_bert_wwm_ext/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/chinese_bert_wwm_ext/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/chinese_bert_wwm_ext/module.py b/modules/text/language_model/chinese_bert_wwm_ext/module.py
similarity index 89%
rename from modules/text/semantic_model/chinese_bert_wwm_ext/module.py
rename to modules/text/language_model/chinese_bert_wwm_ext/module.py
index d11fea1c3eae159ca2821b9ad1750021a31048b4..273b2f02db1207e25fa4997378b5917f930a8aa9 100644
--- a/modules/text/semantic_model/chinese_bert_wwm_ext/module.py
+++ b/modules/text/language_model/chinese_bert_wwm_ext/module.py
@@ -58,13 +58,12 @@ class BertWwm(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        bert = BertModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.bert_config,
-            use_fp16=False)
+        bert = BertModel(src_ids=input_ids,
+                         position_ids=position_ids,
+                         sentence_ids=segment_ids,
+                         input_mask=input_mask,
+                         config=self.bert_config,
+                         use_fp16=False)
         pooled_output = bert.get_pooled_output()
         sequence_output = bert.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/chinese_electra_base/README.md b/modules/text/language_model/chinese_electra_base/README.md
similarity index 100%
rename from modules/text/semantic_model/chinese_electra_base/README.md
rename to modules/text/language_model/chinese_electra_base/README.md
diff --git a/modules/text/semantic_model/chinese_electra_base/__init__.py b/modules/text/language_model/chinese_electra_base/__init__.py
similarity index 100%
rename from modules/text/semantic_model/chinese_electra_base/__init__.py
rename to modules/text/language_model/chinese_electra_base/__init__.py
diff --git a/modules/text/semantic_model/chinese_electra_base/model/__init__.py b/modules/text/language_model/chinese_electra_base/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/chinese_electra_base/model/__init__.py
rename to modules/text/language_model/chinese_electra_base/model/__init__.py
diff --git a/modules/text/semantic_model/chinese_electra_base/model/electra.py b/modules/text/language_model/chinese_electra_base/model/electra.py
similarity index 53%
rename from modules/text/semantic_model/chinese_electra_base/model/electra.py
rename to modules/text/language_model/chinese_electra_base/model/electra.py
index 0b81e89717de211a52f8c6217fe582b4708c66a3..a2b647d14b4470ef7cb878a7f6b1c693adfe3a1e 100644
--- a/modules/text/semantic_model/chinese_electra_base/model/electra.py
+++ b/modules/text/language_model/chinese_electra_base/model/electra.py
@@ -74,23 +74,23 @@ class ElectraModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class ElectraModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -143,43 +142,45 @@ class ElectraModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/language_model/chinese_electra_base/model/transformer_encoder.py b/modules/text/language_model/chinese_electra_base/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/chinese_electra_base/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/chinese_electra_base/module.py b/modules/text/language_model/chinese_electra_base/module.py
similarity index 88%
rename from modules/text/semantic_model/chinese_electra_base/module.py
rename to modules/text/language_model/chinese_electra_base/module.py
index b96b87d879752e4eb62c5d5c2db8fae857e58b52..8a24ffd3dd1bfec72e76065a24f35d5f150ba534 100644
--- a/modules/text/semantic_model/chinese_electra_base/module.py
+++ b/modules/text/language_model/chinese_electra_base/module.py
@@ -58,13 +58,12 @@ class Electra(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        electra = ElectraModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.electra_config,
-            use_fp16=False)
+        electra = ElectraModel(src_ids=input_ids,
+                               position_ids=position_ids,
+                               sentence_ids=segment_ids,
+                               input_mask=input_mask,
+                               config=self.electra_config,
+                               use_fp16=False)
         pooled_output = electra.get_pooled_output()
         sequence_output = electra.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/chinese_electra_small/README.md b/modules/text/language_model/chinese_electra_small/README.md
similarity index 100%
rename from modules/text/semantic_model/chinese_electra_small/README.md
rename to modules/text/language_model/chinese_electra_small/README.md
diff --git a/modules/text/semantic_model/chinese_electra_small/__init__.py b/modules/text/language_model/chinese_electra_small/__init__.py
similarity index 100%
rename from modules/text/semantic_model/chinese_electra_small/__init__.py
rename to modules/text/language_model/chinese_electra_small/__init__.py
diff --git a/modules/text/semantic_model/chinese_electra_small/model/__init__.py b/modules/text/language_model/chinese_electra_small/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/chinese_electra_small/model/__init__.py
rename to modules/text/language_model/chinese_electra_small/model/__init__.py
diff --git a/modules/text/semantic_model/chinese_electra_small/model/electra.py b/modules/text/language_model/chinese_electra_small/model/electra.py
similarity index 51%
rename from modules/text/semantic_model/chinese_electra_small/model/electra.py
rename to modules/text/language_model/chinese_electra_small/model/electra.py
index 083da60e6f842200f02a88dbc5c68b8b278c3d77..e1ec68f8ae76a7e62d059ef4f3266b2e2b89f513 100644
--- a/modules/text/semantic_model/chinese_electra_small/model/electra.py
+++ b/modules/text/language_model/chinese_electra_small/model/electra.py
@@ -75,23 +75,23 @@ class ElectraModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -99,13 +99,13 @@ class ElectraModel(object):
         emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
 
         if self._emb_size != self._hidden_size:
-            emb_out = fluid.layers.fc(
-                input=emb_out,
-                size=self._hidden_size,
-                act=None,
-                param_attr=fluid.ParamAttr(name="embeddings_project.w_0", initializer=self._param_initializer),
-                num_flatten_dims=2,
-                bias_attr="embeddings_project.b_0")
+            emb_out = fluid.layers.fc(input=emb_out,
+                                      size=self._hidden_size,
+                                      act=None,
+                                      param_attr=fluid.ParamAttr(name="embeddings_project.w_0",
+                                                                 initializer=self._param_initializer),
+                                      num_flatten_dims=2,
+                                      bias_attr="embeddings_project.b_0")
 
         if self._dtype == "float16":
             input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
@@ -115,23 +115,22 @@ class ElectraModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._hidden_size // self._n_head,
-            d_value=self._hidden_size // self._n_head,
-            d_model=self._hidden_size,
-            d_inner_hid=self._hidden_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._hidden_size // self._n_head,
+                                d_value=self._hidden_size // self._n_head,
+                                d_model=self._hidden_size,
+                                d_inner_hid=self._hidden_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -153,43 +152,45 @@ class ElectraModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._hidden_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._hidden_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/language_model/chinese_electra_small/model/transformer_encoder.py b/modules/text/language_model/chinese_electra_small/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/chinese_electra_small/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/chinese_electra_small/module.py b/modules/text/language_model/chinese_electra_small/module.py
similarity index 88%
rename from modules/text/semantic_model/chinese_electra_small/module.py
rename to modules/text/language_model/chinese_electra_small/module.py
index 55850d5b2396979848485886f916a1da9c982d2f..ac55aed3e08e7e12775ba0f0e2ab94a900fe160c 100644
--- a/modules/text/semantic_model/chinese_electra_small/module.py
+++ b/modules/text/language_model/chinese_electra_small/module.py
@@ -58,13 +58,12 @@ class Electra(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        electra = ElectraModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.electra_config,
-            use_fp16=False)
+        electra = ElectraModel(src_ids=input_ids,
+                               position_ids=position_ids,
+                               sentence_ids=segment_ids,
+                               input_mask=input_mask,
+                               config=self.electra_config,
+                               use_fp16=False)
         pooled_output = electra.get_pooled_output()
         sequence_output = electra.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/chinese_roberta_wwm_ext/README.md b/modules/text/language_model/chinese_roberta_wwm_ext/README.md
similarity index 100%
rename from modules/text/semantic_model/chinese_roberta_wwm_ext/README.md
rename to modules/text/language_model/chinese_roberta_wwm_ext/README.md
diff --git a/modules/text/semantic_model/chinese_roberta_wwm_ext/__init__.py b/modules/text/language_model/chinese_roberta_wwm_ext/__init__.py
similarity index 100%
rename from modules/text/semantic_model/chinese_roberta_wwm_ext/__init__.py
rename to modules/text/language_model/chinese_roberta_wwm_ext/__init__.py
diff --git a/modules/text/semantic_model/chinese_roberta_wwm_ext/model/__init__.py b/modules/text/language_model/chinese_roberta_wwm_ext/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/chinese_roberta_wwm_ext/model/__init__.py
rename to modules/text/language_model/chinese_roberta_wwm_ext/model/__init__.py
diff --git a/modules/text/semantic_model/chinese_roberta_wwm_ext/model/bert.py b/modules/text/language_model/chinese_roberta_wwm_ext/model/bert.py
similarity index 50%
rename from modules/text/semantic_model/chinese_roberta_wwm_ext/model/bert.py
rename to modules/text/language_model/chinese_roberta_wwm_ext/model/bert.py
index 29372adaf16a9717939cc07c4c135d27dba1dd61..f8cd18580ff61306a8b7547feafc0ff4a5440d47 100644
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext/model/bert.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class BertModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -130,12 +129,12 @@ class BertModel(object):
         """Get the first feature of each sequence for classification"""
 
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -150,43 +149,45 @@ class BertModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/language_model/chinese_roberta_wwm_ext/model/transformer_encoder.py b/modules/text/language_model/chinese_roberta_wwm_ext/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/chinese_roberta_wwm_ext/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/chinese_roberta_wwm_ext/module.py b/modules/text/language_model/chinese_roberta_wwm_ext/module.py
similarity index 89%
rename from modules/text/semantic_model/chinese_roberta_wwm_ext/module.py
rename to modules/text/language_model/chinese_roberta_wwm_ext/module.py
index d45f232b829972274a5a3077bd9e7c22655c6dcb..e2eed07a5b9e672ac4ba1b4385a0ae313fc0eade 100644
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext/module.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext/module.py
@@ -58,13 +58,12 @@ class BertWwm(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        bert = BertModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.bert_config,
-            use_fp16=False)
+        bert = BertModel(src_ids=input_ids,
+                         position_ids=position_ids,
+                         sentence_ids=segment_ids,
+                         input_mask=input_mask,
+                         config=self.bert_config,
+                         use_fp16=False)
         pooled_output = bert.get_pooled_output()
         sequence_output = bert.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/README.md b/modules/text/language_model/chinese_roberta_wwm_ext_large/README.md
similarity index 100%
rename from modules/text/semantic_model/chinese_roberta_wwm_ext_large/README.md
rename to modules/text/language_model/chinese_roberta_wwm_ext_large/README.md
diff --git a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/__init__.py b/modules/text/language_model/chinese_roberta_wwm_ext_large/__init__.py
similarity index 100%
rename from modules/text/semantic_model/chinese_roberta_wwm_ext_large/__init__.py
rename to modules/text/language_model/chinese_roberta_wwm_ext_large/__init__.py
diff --git a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/__init__.py b/modules/text/language_model/chinese_roberta_wwm_ext_large/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/__init__.py
rename to modules/text/language_model/chinese_roberta_wwm_ext_large/model/__init__.py
diff --git a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/bert.py b/modules/text/language_model/chinese_roberta_wwm_ext_large/model/bert.py
similarity index 50%
rename from modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/bert.py
rename to modules/text/language_model/chinese_roberta_wwm_ext_large/model/bert.py
index 782cfacbe7ecf5a44b205d77a48fed6190c498bd..0543b7dcf7c19bc5527e486c39542ea066201614 100644
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/bert.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext_large/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class BertModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -130,12 +129,12 @@ class BertModel(object):
         """Get the first feature of each sequence for classification"""
 
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -150,43 +149,45 @@ class BertModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/language_model/chinese_roberta_wwm_ext_large/model/transformer_encoder.py b/modules/text/language_model/chinese_roberta_wwm_ext_large/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/chinese_roberta_wwm_ext_large/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/module.py b/modules/text/language_model/chinese_roberta_wwm_ext_large/module.py
similarity index 89%
rename from modules/text/semantic_model/chinese_roberta_wwm_ext_large/module.py
rename to modules/text/language_model/chinese_roberta_wwm_ext_large/module.py
index ec67f8739c5abe1c42b1e165d9b17c66a6ae9a50..19c25dbadda88f58eb4b6a23e88882d578c81b1f 100644
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/module.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext_large/module.py
@@ -58,13 +58,12 @@ class BertWwm(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        bert = BertModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.bert_config,
-            use_fp16=False)
+        bert = BertModel(src_ids=input_ids,
+                         position_ids=position_ids,
+                         sentence_ids=segment_ids,
+                         input_mask=input_mask,
+                         config=self.bert_config,
+                         use_fp16=False)
         pooled_output = bert.get_pooled_output()
         sequence_output = bert.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/ernie/README.md b/modules/text/language_model/ernie/README.md
similarity index 100%
rename from modules/text/semantic_model/ernie/README.md
rename to modules/text/language_model/ernie/README.md
diff --git a/modules/text/semantic_model/ernie/__init__.py b/modules/text/language_model/ernie/__init__.py
similarity index 100%
rename from modules/text/semantic_model/ernie/__init__.py
rename to modules/text/language_model/ernie/__init__.py
diff --git a/modules/text/semantic_model/ernie/model/__init__.py b/modules/text/language_model/ernie/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/ernie/model/__init__.py
rename to modules/text/language_model/ernie/model/__init__.py
diff --git a/modules/text/semantic_model/ernie/model/ernie.py b/modules/text/language_model/ernie/model/ernie.py
similarity index 51%
rename from modules/text/semantic_model/ernie/model/ernie.py
rename to modules/text/language_model/ernie/model/ernie.py
index 88af7b9e64e4b92d796f293215c5481b5e359bb1..2f3e845b9a82b1a56f1e4dc5915ffb603c64e869 100644
--- a/modules/text/semantic_model/ernie/model/ernie.py
+++ b/modules/text/language_model/ernie/model/ernie.py
@@ -78,23 +78,23 @@ class ErnieModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -109,23 +109,22 @@ class ErnieModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -133,12 +132,12 @@ class ErnieModel(object):
     def get_pooled_output(self):
         """Get the first feature of each sequence for classification"""
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -153,43 +152,45 @@ class ErnieModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/language_model/ernie/model/transformer_encoder.py b/modules/text/language_model/ernie/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/ernie/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/ernie/module.py b/modules/text/language_model/ernie/module.py
similarity index 89%
rename from modules/text/semantic_model/ernie/module.py
rename to modules/text/language_model/ernie/module.py
index 03167a5db24419b912b782df3bf70a79d28316a2..deb074ee79614e55442d262cd4a603309a94369b 100644
--- a/modules/text/semantic_model/ernie/module.py
+++ b/modules/text/language_model/ernie/module.py
@@ -58,13 +58,12 @@ class Ernie(TransformerModule):
             sequence_output (tensor): token-level output for sequence task.
         """
         self.ernie_config._config_dict['use_task_id'] = False
-        ernie = ErnieModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.ernie_config,
-            use_fp16=False)
+        ernie = ErnieModel(src_ids=input_ids,
+                           position_ids=position_ids,
+                           sentence_ids=segment_ids,
+                           input_mask=input_mask,
+                           config=self.ernie_config,
+                           use_fp16=False)
         pooled_output = ernie.get_pooled_output()
         sequence_output = ernie.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/ernie_tiny/README.md b/modules/text/language_model/ernie_tiny/README.md
similarity index 100%
rename from modules/text/semantic_model/ernie_tiny/README.md
rename to modules/text/language_model/ernie_tiny/README.md
diff --git a/modules/text/semantic_model/ernie_tiny/__init__.py b/modules/text/language_model/ernie_tiny/__init__.py
similarity index 100%
rename from modules/text/semantic_model/ernie_tiny/__init__.py
rename to modules/text/language_model/ernie_tiny/__init__.py
diff --git a/modules/text/semantic_model/ernie_tiny/model/__init__.py b/modules/text/language_model/ernie_tiny/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/ernie_tiny/model/__init__.py
rename to modules/text/language_model/ernie_tiny/model/__init__.py
diff --git a/modules/text/semantic_model/ernie_tiny/model/ernie.py b/modules/text/language_model/ernie_tiny/model/ernie.py
similarity index 50%
rename from modules/text/semantic_model/ernie_tiny/model/ernie.py
rename to modules/text/language_model/ernie_tiny/model/ernie.py
index 6fa0281f4bfdb99356017fcf5e7a29f1e1edd2a7..5b323f592e68f05bc8d650c9c2c5011e714a8d8e 100644
--- a/modules/text/semantic_model/ernie_tiny/model/ernie.py
+++ b/modules/text/language_model/ernie_tiny/model/ernie.py
@@ -95,34 +95,34 @@ class ErnieModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._emb_dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._emb_dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._emb_dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
 
         if self._use_task_id:
-            task_emb_out = fluid.layers.embedding(
-                task_ids,
-                size=[self._task_types, self._emb_size],
-                dtype=self._emb_dtype,
-                param_attr=fluid.ParamAttr(name=self._task_emb_name, initializer=self._param_initializer))
+            task_emb_out = fluid.layers.embedding(task_ids,
+                                                  size=[self._task_types, self._emb_size],
+                                                  dtype=self._emb_dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._task_emb_name,
+                                                                             initializer=self._param_initializer))
 
             emb_out = emb_out + task_emb_out
 
@@ -137,23 +137,22 @@ class ErnieModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
         if self._dtype == "float16":
             self._enc_out = fluid.layers.cast(x=self._enc_out, dtype=self._emb_dtype)
 
@@ -163,12 +162,12 @@ class ErnieModel(object):
     def get_pooled_output(self):
         """Get the first feature of each sequence for classification"""
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_lm_output(self, mask_label, mask_pos):
@@ -183,40 +182,42 @@ class ErnieModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
 
         # transform: layer norm
-        mask_trans_feat = fluid.layers.layer_norm(
-            mask_trans_feat,
-            begin_norm_axis=len(mask_trans_feat.shape) - 1,
-            param_attr=fluid.ParamAttr(
-                name='mask_lm_trans_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias', initializer=fluid.initializer.Constant(1.)))
+        mask_trans_feat = fluid.layers.layer_norm(mask_trans_feat,
+                                                  begin_norm_axis=len(mask_trans_feat.shape) - 1,
+                                                  param_attr=fluid.ParamAttr(
+                                                      name='mask_lm_trans_layer_norm_scale',
+                                                      initializer=fluid.initializer.Constant(1.)),
+                                                  bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias',
+                                                                            initializer=fluid.initializer.Constant(1.)))
         # transform: layer norm
         #mask_trans_feat = pre_process_layer(
         #    mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._emb_dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._emb_dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
@@ -224,13 +225,14 @@ class ErnieModel(object):
         return mean_mask_lm_loss
 
     def get_task_output(self, task, task_labels):
-        task_fc_out = fluid.layers.fc(
-            input=self.next_sent_feat,
-            size=task["num_labels"],
-            param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0", initializer=self._param_initializer),
-            bias_attr=task["task_name"] + "_fc.b_0")
-        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=task_fc_out, label=task_labels, return_softmax=True)
+        task_fc_out = fluid.layers.fc(input=self.next_sent_feat,
+                                      size=task["num_labels"],
+                                      param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0",
+                                                                 initializer=self._param_initializer),
+                                      bias_attr=task["task_name"] + "_fc.b_0")
+        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(logits=task_fc_out,
+                                                                          label=task_labels,
+                                                                          return_softmax=True)
         task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels)
         mean_task_loss = fluid.layers.mean(task_loss)
         return mean_task_loss, task_acc
diff --git a/modules/text/language_model/ernie_tiny/model/transformer_encoder.py b/modules/text/language_model/ernie_tiny/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/ernie_tiny/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/ernie_tiny/module.py b/modules/text/language_model/ernie_tiny/module.py
similarity index 88%
rename from modules/text/semantic_model/ernie_tiny/module.py
rename to modules/text/language_model/ernie_tiny/module.py
index c6677af01361f1cfae383474b5687514cee10f4e..2edcd301ffd47ad50b5b89794a9a6f4a311a50b0 100644
--- a/modules/text/semantic_model/ernie_tiny/module.py
+++ b/modules/text/language_model/ernie_tiny/module.py
@@ -60,14 +60,13 @@ class ErnieTiny(TransformerModule):
             sequence_output (tensor): token-level output for sequence task.
         """
         self.ernie_config._config_dict['use_task_id'] = False
-        ernie = ErnieModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            task_ids=None,
-            input_mask=input_mask,
-            config=self.ernie_config,
-            use_fp16=False)
+        ernie = ErnieModel(src_ids=input_ids,
+                           position_ids=position_ids,
+                           sentence_ids=segment_ids,
+                           task_ids=None,
+                           input_mask=input_mask,
+                           config=self.ernie_config,
+                           use_fp16=False)
         pooled_output = ernie.get_pooled_output()
         sequence_output = ernie.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/ernie_v2_eng_base/README.md b/modules/text/language_model/ernie_v2_eng_base/README.md
similarity index 100%
rename from modules/text/semantic_model/ernie_v2_eng_base/README.md
rename to modules/text/language_model/ernie_v2_eng_base/README.md
diff --git a/modules/text/semantic_model/ernie_v2_eng_base/__init__.py b/modules/text/language_model/ernie_v2_eng_base/__init__.py
similarity index 100%
rename from modules/text/semantic_model/ernie_v2_eng_base/__init__.py
rename to modules/text/language_model/ernie_v2_eng_base/__init__.py
diff --git a/modules/text/semantic_model/ernie_v2_eng_base/model/__init__.py b/modules/text/language_model/ernie_v2_eng_base/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/ernie_v2_eng_base/model/__init__.py
rename to modules/text/language_model/ernie_v2_eng_base/model/__init__.py
diff --git a/modules/text/semantic_model/ernie_v2_eng_base/model/ernie.py b/modules/text/language_model/ernie_v2_eng_base/model/ernie.py
similarity index 50%
rename from modules/text/semantic_model/ernie_v2_eng_base/model/ernie.py
rename to modules/text/language_model/ernie_v2_eng_base/model/ernie.py
index 8eb74d001161e9c6c64fc5920bf02578e8b0d7b1..0376b285e089990179bce27d858dd93f285ee4cb 100644
--- a/modules/text/semantic_model/ernie_v2_eng_base/model/ernie.py
+++ b/modules/text/language_model/ernie_v2_eng_base/model/ernie.py
@@ -95,34 +95,34 @@ class ErnieModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._emb_dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._emb_dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._emb_dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
 
         if self._use_task_id:
-            task_emb_out = fluid.layers.embedding(
-                task_ids,
-                size=[self._task_types, self._emb_size],
-                dtype=self._emb_dtype,
-                param_attr=fluid.ParamAttr(name=self._task_emb_name, initializer=self._param_initializer))
+            task_emb_out = fluid.layers.embedding(task_ids,
+                                                  size=[self._task_types, self._emb_size],
+                                                  dtype=self._emb_dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._task_emb_name,
+                                                                             initializer=self._param_initializer))
 
             emb_out = emb_out + task_emb_out
 
@@ -137,23 +137,22 @@ class ErnieModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -163,12 +162,12 @@ class ErnieModel(object):
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
         if self._dtype == "float16":
             next_sent_feat = fluid.layers.cast(x=next_sent_feat, dtype=self._emb_dtype)
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_lm_output(self, mask_label, mask_pos):
@@ -185,40 +184,42 @@ class ErnieModel(object):
             mask_feat = fluid.layers.cast(x=mask_feat, dtype=self._emb_dtype)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
 
         # transform: layer norm
-        mask_trans_feat = fluid.layers.layer_norm(
-            mask_trans_feat,
-            begin_norm_axis=len(mask_trans_feat.shape) - 1,
-            param_attr=fluid.ParamAttr(
-                name='mask_lm_trans_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias', initializer=fluid.initializer.Constant(1.)))
+        mask_trans_feat = fluid.layers.layer_norm(mask_trans_feat,
+                                                  begin_norm_axis=len(mask_trans_feat.shape) - 1,
+                                                  param_attr=fluid.ParamAttr(
+                                                      name='mask_lm_trans_layer_norm_scale',
+                                                      initializer=fluid.initializer.Constant(1.)),
+                                                  bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias',
+                                                                            initializer=fluid.initializer.Constant(1.)))
         # transform: layer norm
         #mask_trans_feat = pre_process_layer(
         #    mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._emb_dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._emb_dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
@@ -226,13 +227,14 @@ class ErnieModel(object):
         return mean_mask_lm_loss
 
     def get_task_output(self, task, task_labels):
-        task_fc_out = fluid.layers.fc(
-            input=self.next_sent_feat,
-            size=task["num_labels"],
-            param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0", initializer=self._param_initializer),
-            bias_attr=task["task_name"] + "_fc.b_0")
-        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=task_fc_out, label=task_labels, return_softmax=True)
+        task_fc_out = fluid.layers.fc(input=self.next_sent_feat,
+                                      size=task["num_labels"],
+                                      param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0",
+                                                                 initializer=self._param_initializer),
+                                      bias_attr=task["task_name"] + "_fc.b_0")
+        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(logits=task_fc_out,
+                                                                          label=task_labels,
+                                                                          return_softmax=True)
         task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels)
         mean_task_loss = fluid.layers.mean(task_loss)
         return mean_task_loss, task_acc
diff --git a/modules/text/language_model/ernie_v2_eng_base/model/transformer_encoder.py b/modules/text/language_model/ernie_v2_eng_base/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/ernie_v2_eng_base/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/ernie_v2_eng_base/module.py b/modules/text/language_model/ernie_v2_eng_base/module.py
similarity index 88%
rename from modules/text/semantic_model/ernie_v2_eng_base/module.py
rename to modules/text/language_model/ernie_v2_eng_base/module.py
index fd2d957f0115a743d6d77dd622e1f1c81a8a8146..d292cba976a52f911171335e29a9633d22d06e1e 100644
--- a/modules/text/semantic_model/ernie_v2_eng_base/module.py
+++ b/modules/text/language_model/ernie_v2_eng_base/module.py
@@ -59,14 +59,13 @@ class ErnieV2EngBase(TransformerModule):
             sequence_output (tensor): token-level output for sequence task.
         """
         self.ernie_config._config_dict['use_task_id'] = False
-        ernie = ErnieModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            task_ids=None,
-            input_mask=input_mask,
-            config=self.ernie_config,
-            use_fp16=False)
+        ernie = ErnieModel(src_ids=input_ids,
+                           position_ids=position_ids,
+                           sentence_ids=segment_ids,
+                           task_ids=None,
+                           input_mask=input_mask,
+                           config=self.ernie_config,
+                           use_fp16=False)
         pooled_output = ernie.get_pooled_output()
         sequence_output = ernie.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/ernie_v2_eng_large/README.md b/modules/text/language_model/ernie_v2_eng_large/README.md
similarity index 100%
rename from modules/text/semantic_model/ernie_v2_eng_large/README.md
rename to modules/text/language_model/ernie_v2_eng_large/README.md
diff --git a/modules/text/semantic_model/ernie_v2_eng_large/__init__.py b/modules/text/language_model/ernie_v2_eng_large/__init__.py
similarity index 100%
rename from modules/text/semantic_model/ernie_v2_eng_large/__init__.py
rename to modules/text/language_model/ernie_v2_eng_large/__init__.py
diff --git a/modules/text/semantic_model/ernie_v2_eng_large/model/__init__.py b/modules/text/language_model/ernie_v2_eng_large/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/ernie_v2_eng_large/model/__init__.py
rename to modules/text/language_model/ernie_v2_eng_large/model/__init__.py
diff --git a/modules/text/semantic_model/ernie_v2_eng_large/model/ernie.py b/modules/text/language_model/ernie_v2_eng_large/model/ernie.py
similarity index 50%
rename from modules/text/semantic_model/ernie_v2_eng_large/model/ernie.py
rename to modules/text/language_model/ernie_v2_eng_large/model/ernie.py
index c2ae69262362f15bb49cca955b48786852fdfe1b..80f1a38204e788915e72fddda11617b98660d202 100644
--- a/modules/text/semantic_model/ernie_v2_eng_large/model/ernie.py
+++ b/modules/text/language_model/ernie_v2_eng_large/model/ernie.py
@@ -95,34 +95,34 @@ class ErnieModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._emb_dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._emb_dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._emb_dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
 
         if self._use_task_id:
-            task_emb_out = fluid.layers.embedding(
-                task_ids,
-                size=[self._task_types, self._emb_size],
-                dtype=self._emb_dtype,
-                param_attr=fluid.ParamAttr(name=self._task_emb_name, initializer=self._param_initializer))
+            task_emb_out = fluid.layers.embedding(task_ids,
+                                                  size=[self._task_types, self._emb_size],
+                                                  dtype=self._emb_dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._task_emb_name,
+                                                                             initializer=self._param_initializer))
 
             emb_out = emb_out + task_emb_out
 
@@ -137,23 +137,22 @@ class ErnieModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -163,12 +162,12 @@ class ErnieModel(object):
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
         if self._dtype == "float16":
             next_sent_feat = fluid.layers.cast(x=next_sent_feat, dtype=self._emb_dtype)
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_lm_output(self, mask_label, mask_pos):
@@ -185,40 +184,42 @@ class ErnieModel(object):
             mask_feat = fluid.layers.cast(x=mask_feat, dtype=self._emb_dtype)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
 
         # transform: layer norm
-        mask_trans_feat = fluid.layers.layer_norm(
-            mask_trans_feat,
-            begin_norm_axis=len(mask_trans_feat.shape) - 1,
-            param_attr=fluid.ParamAttr(
-                name='mask_lm_trans_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias', initializer=fluid.initializer.Constant(1.)))
+        mask_trans_feat = fluid.layers.layer_norm(mask_trans_feat,
+                                                  begin_norm_axis=len(mask_trans_feat.shape) - 1,
+                                                  param_attr=fluid.ParamAttr(
+                                                      name='mask_lm_trans_layer_norm_scale',
+                                                      initializer=fluid.initializer.Constant(1.)),
+                                                  bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias',
+                                                                            initializer=fluid.initializer.Constant(1.)))
         # transform: layer norm
         #mask_trans_feat = pre_process_layer(
         #    mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._emb_dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._emb_dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
@@ -226,13 +227,14 @@ class ErnieModel(object):
         return mean_mask_lm_loss
 
     def get_task_output(self, task, task_labels):
-        task_fc_out = fluid.layers.fc(
-            input=self.next_sent_feat,
-            size=task["num_labels"],
-            param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0", initializer=self._param_initializer),
-            bias_attr=task["task_name"] + "_fc.b_0")
-        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=task_fc_out, label=task_labels, return_softmax=True)
+        task_fc_out = fluid.layers.fc(input=self.next_sent_feat,
+                                      size=task["num_labels"],
+                                      param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0",
+                                                                 initializer=self._param_initializer),
+                                      bias_attr=task["task_name"] + "_fc.b_0")
+        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(logits=task_fc_out,
+                                                                          label=task_labels,
+                                                                          return_softmax=True)
         task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels)
         mean_task_loss = fluid.layers.mean(task_loss)
         return mean_task_loss, task_acc
diff --git a/modules/text/language_model/ernie_v2_eng_large/model/transformer_encoder.py b/modules/text/language_model/ernie_v2_eng_large/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/ernie_v2_eng_large/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/ernie_v2_eng_large/module.py b/modules/text/language_model/ernie_v2_eng_large/module.py
similarity index 88%
rename from modules/text/semantic_model/ernie_v2_eng_large/module.py
rename to modules/text/language_model/ernie_v2_eng_large/module.py
index 8d15bea6c27be2afc07da784ccbcb17d2f9809ee..ee014dc6d5a34d97c51c198752c117ad81b42473 100644
--- a/modules/text/semantic_model/ernie_v2_eng_large/module.py
+++ b/modules/text/language_model/ernie_v2_eng_large/module.py
@@ -59,14 +59,13 @@ class ErnieV2EngLarge(TransformerModule):
             sequence_output (tensor): token-level output for sequence task.
         """
         self.ernie_config._config_dict['use_task_id'] = False
-        ernie = ErnieModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            task_ids=None,
-            input_mask=input_mask,
-            config=self.ernie_config,
-            use_fp16=False)
+        ernie = ErnieModel(src_ids=input_ids,
+                           position_ids=position_ids,
+                           sentence_ids=segment_ids,
+                           task_ids=None,
+                           input_mask=input_mask,
+                           config=self.ernie_config,
+                           use_fp16=False)
         pooled_output = ernie.get_pooled_output()
         sequence_output = ernie.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/lda_news/README.md b/modules/text/language_model/lda_news/README.md
similarity index 100%
rename from modules/text/semantic_model/lda_news/README.md
rename to modules/text/language_model/lda_news/README.md
diff --git a/modules/text/semantic_model/lda_news/__init__.py b/modules/text/language_model/lda_news/__init__.py
similarity index 100%
rename from modules/text/semantic_model/lda_news/__init__.py
rename to modules/text/language_model/lda_news/__init__.py
diff --git a/modules/text/semantic_model/lda_news/config.py b/modules/text/language_model/lda_news/config.py
similarity index 100%
rename from modules/text/semantic_model/lda_news/config.py
rename to modules/text/language_model/lda_news/config.py
diff --git a/modules/text/semantic_model/slda_news/document.py b/modules/text/language_model/lda_news/document.py
similarity index 97%
rename from modules/text/semantic_model/slda_news/document.py
rename to modules/text/language_model/lda_news/document.py
index 4476230a5c9bc8d545b52386dbf00a201e59b468..98eae50511e6621db61c9700ff5e44d302667666 100644
--- a/modules/text/semantic_model/slda_news/document.py
+++ b/modules/text/language_model/lda_news/document.py
@@ -5,7 +5,6 @@ class Topic(object):
     """Basic data structure of topic, contains topic id and
        corresponding probability.
     """
-
     def __init__(self, tid, prob):
         self.tid = tid  # topic id
         self.prob = prob  # topic probability
@@ -15,7 +14,6 @@ class Token(object):
     """Basic storage unit of LDA documents, contains word id
        and corresponding topic.
     """
-
     def __init__(self, topic, id):
         self.topic = topic
         self.id = id
@@ -25,7 +23,6 @@ class Sentence(object):
     """Basic storage unit of SentenceLDA documents, contains word ids
        of the sentence and its corresponding topic id.
     """
-
     def __init__(self, topic, tokens):
         self.topic = topic
         self.tokens = tokens
@@ -34,7 +31,6 @@ class Sentence(object):
 class LDADoc(object):
     """The storage structure of LDA model's inference result.
     """
-
     def __init__(self):
         self._num_topics = None  # Number of topics.
         self._num_accum = None  # Number of accumulated sample rounds.
@@ -120,8 +116,8 @@ class LDADoc(object):
         dense_dist = np.zeros(self._num_topics)
         if self.size() == 0:
             return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
-            self.size() + self._alpha * self._num_topics)
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
+                                                                                      self._alpha * self._num_topics)
         return dense_dist
 
     def accumulate_topic_num(self):
@@ -133,7 +129,6 @@ class SLDADoc(LDADoc):
     """Sentence LDA Document, inherited from LDADoc.
        Add add_sentence interface.
     """
-
     def __init__(self):
         super().__init__()
         self.__sentences = None
diff --git a/modules/text/semantic_model/lda_news/inference_engine.py b/modules/text/language_model/lda_news/inference_engine.py
similarity index 100%
rename from modules/text/semantic_model/lda_news/inference_engine.py
rename to modules/text/language_model/lda_news/inference_engine.py
diff --git a/modules/text/semantic_model/lda_news/model.py b/modules/text/language_model/lda_news/model.py
similarity index 99%
rename from modules/text/semantic_model/lda_news/model.py
rename to modules/text/language_model/lda_news/model.py
index 3ef089f9058e1a405d24440b152767fac745df9b..11f186d45fafc013aa1b57bdb945006daef5cfa3 100644
--- a/modules/text/semantic_model/lda_news/model.py
+++ b/modules/text/language_model/lda_news/model.py
@@ -11,7 +11,6 @@ from lda_news.vocab import Vocab, WordCount
 class TopicModel(object):
     """Storage Structure of Topic model, including vocabulary and word topic count.
     """
-
     def __init__(self, model_dir, config):
         """
         Args:
diff --git a/modules/text/semantic_model/lda_news/module.py b/modules/text/language_model/lda_news/module.py
similarity index 96%
rename from modules/text/semantic_model/lda_news/module.py
rename to modules/text/language_model/lda_news/module.py
index 6066ce0dc61382d085587a4634f259879c3441e9..1a0e5f8fd7942e9cb5b87dbfce2df0ed619f6d23 100644
--- a/modules/text/semantic_model/lda_news/module.py
+++ b/modules/text/language_model/lda_news/module.py
@@ -105,8 +105,9 @@ class TopicModel(hub.Module):
             wd = WordAndDis()
             wd.word = word
             sm = SemanticMatching()
-            wd.distance = sm.likelihood_based_similarity(
-                terms=[word], doc_topic_dist=doc_topic_dist, model=self.__engine.get_model())
+            wd.distance = sm.likelihood_based_similarity(terms=[word],
+                                                         doc_topic_dist=doc_topic_dist,
+                                                         model=self.__engine.get_model())
             items.append(wd)
 
         def take_elem(word_dis):
diff --git a/modules/text/semantic_model/lda_news/sampler.py b/modules/text/language_model/lda_news/sampler.py
similarity index 100%
rename from modules/text/semantic_model/lda_news/sampler.py
rename to modules/text/language_model/lda_news/sampler.py
diff --git a/modules/text/semantic_model/lda_news/semantic_matching.py b/modules/text/language_model/lda_news/semantic_matching.py
similarity index 100%
rename from modules/text/semantic_model/lda_news/semantic_matching.py
rename to modules/text/language_model/lda_news/semantic_matching.py
diff --git a/modules/text/semantic_model/lda_news/tokenizer.py b/modules/text/language_model/lda_news/tokenizer.py
similarity index 99%
rename from modules/text/semantic_model/lda_news/tokenizer.py
rename to modules/text/language_model/lda_news/tokenizer.py
index e59037d88207298d227fc0f476b83f3229e80c36..c07b3c4da066fe382352c563e448f34f0114b9fb 100644
--- a/modules/text/semantic_model/lda_news/tokenizer.py
+++ b/modules/text/language_model/lda_news/tokenizer.py
@@ -5,7 +5,6 @@
 class Tokenizer(object):
     """Base tokenizer class.
     """
-
     def __init__(self):
         pass
 
@@ -19,7 +18,6 @@ class SimpleTokenizer(Tokenizer):
 
        Notes: This tokenizer can only recognize the words in the corresponding vocab file.
     """
-
     def __init__(self, vocab_path):
         super().__init__()
         self.__max_word_len = 0
diff --git a/modules/text/semantic_model/lda_news/util.py b/modules/text/language_model/lda_news/util.py
similarity index 99%
rename from modules/text/semantic_model/lda_news/util.py
rename to modules/text/language_model/lda_news/util.py
index 9f1ebdab79063fe0aa91838ad80f01c75f76e51f..e589602ddd1f1a02be38d5cc2a5c88c6cd51c504 100644
--- a/modules/text/semantic_model/lda_news/util.py
+++ b/modules/text/language_model/lda_news/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
     """Return time cost of function f.
     """
-
     def timed(*args, **kwargs):
         start_time = time.time()
         result = f(*args, **kwargs)
diff --git a/modules/text/semantic_model/lda_news/vocab.py b/modules/text/language_model/lda_news/vocab.py
similarity index 100%
rename from modules/text/semantic_model/lda_news/vocab.py
rename to modules/text/language_model/lda_news/vocab.py
diff --git a/modules/text/semantic_model/lda_news/vose_alias.py b/modules/text/language_model/lda_news/vose_alias.py
similarity index 99%
rename from modules/text/semantic_model/lda_news/vose_alias.py
rename to modules/text/language_model/lda_news/vose_alias.py
index e4f158b6eaf10786c299fc26f772b80dddee6c36..be80ee8ddb2e2872fded119c8cd216537a1206ed 100644
--- a/modules/text/semantic_model/lda_news/vose_alias.py
+++ b/modules/text/language_model/lda_news/vose_alias.py
@@ -6,7 +6,6 @@ from lda_news.util import rand, rand_k
 class VoseAlias(object):
     """Vose's Alias Method.
     """
-
     def __init__(self):
         self.__alias = None
         self.__prob = None  # np.array
diff --git a/modules/text/semantic_model/lda_novel/README.md b/modules/text/language_model/lda_novel/README.md
similarity index 100%
rename from modules/text/semantic_model/lda_novel/README.md
rename to modules/text/language_model/lda_novel/README.md
diff --git a/modules/text/semantic_model/lda_novel/__init__.py b/modules/text/language_model/lda_novel/__init__.py
similarity index 100%
rename from modules/text/semantic_model/lda_novel/__init__.py
rename to modules/text/language_model/lda_novel/__init__.py
diff --git a/modules/text/semantic_model/lda_novel/config.py b/modules/text/language_model/lda_novel/config.py
similarity index 100%
rename from modules/text/semantic_model/lda_novel/config.py
rename to modules/text/language_model/lda_novel/config.py
diff --git a/modules/text/semantic_model/lda_news/document.py b/modules/text/language_model/lda_novel/document.py
similarity index 97%
rename from modules/text/semantic_model/lda_news/document.py
rename to modules/text/language_model/lda_novel/document.py
index 4476230a5c9bc8d545b52386dbf00a201e59b468..98eae50511e6621db61c9700ff5e44d302667666 100644
--- a/modules/text/semantic_model/lda_news/document.py
+++ b/modules/text/language_model/lda_novel/document.py
@@ -5,7 +5,6 @@ class Topic(object):
     """Basic data structure of topic, contains topic id and
        corresponding probability.
     """
-
     def __init__(self, tid, prob):
         self.tid = tid  # topic id
         self.prob = prob  # topic probability
@@ -15,7 +14,6 @@ class Token(object):
     """Basic storage unit of LDA documents, contains word id
        and corresponding topic.
     """
-
     def __init__(self, topic, id):
         self.topic = topic
         self.id = id
@@ -25,7 +23,6 @@ class Sentence(object):
     """Basic storage unit of SentenceLDA documents, contains word ids
        of the sentence and its corresponding topic id.
     """
-
     def __init__(self, topic, tokens):
         self.topic = topic
         self.tokens = tokens
@@ -34,7 +31,6 @@ class Sentence(object):
 class LDADoc(object):
     """The storage structure of LDA model's inference result.
     """
-
     def __init__(self):
         self._num_topics = None  # Number of topics.
         self._num_accum = None  # Number of accumulated sample rounds.
@@ -120,8 +116,8 @@ class LDADoc(object):
         dense_dist = np.zeros(self._num_topics)
         if self.size() == 0:
             return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
-            self.size() + self._alpha * self._num_topics)
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
+                                                                                      self._alpha * self._num_topics)
         return dense_dist
 
     def accumulate_topic_num(self):
@@ -133,7 +129,6 @@ class SLDADoc(LDADoc):
     """Sentence LDA Document, inherited from LDADoc.
        Add add_sentence interface.
     """
-
     def __init__(self):
         super().__init__()
         self.__sentences = None
diff --git a/modules/text/semantic_model/lda_novel/inference_engine.py b/modules/text/language_model/lda_novel/inference_engine.py
similarity index 100%
rename from modules/text/semantic_model/lda_novel/inference_engine.py
rename to modules/text/language_model/lda_novel/inference_engine.py
diff --git a/modules/text/semantic_model/lda_novel/model.py b/modules/text/language_model/lda_novel/model.py
similarity index 99%
rename from modules/text/semantic_model/lda_novel/model.py
rename to modules/text/language_model/lda_novel/model.py
index f16962be868c1cb33666a89f94843782f7e012fe..47b0c4ecfc817c046c171fc281a0ea9f7896e41d 100644
--- a/modules/text/semantic_model/lda_novel/model.py
+++ b/modules/text/language_model/lda_novel/model.py
@@ -11,7 +11,6 @@ from lda_novel.vocab import Vocab, WordCount
 class TopicModel(object):
     """Storage Structure of Topic model, including vocabulary and word topic count.
     """
-
     def __init__(self, model_dir, config):
         """
         Args:
diff --git a/modules/text/semantic_model/lda_novel/module.py b/modules/text/language_model/lda_novel/module.py
similarity index 96%
rename from modules/text/semantic_model/lda_novel/module.py
rename to modules/text/language_model/lda_novel/module.py
index ed211ac4017312bf469b9cb70d87c1c321a00a1a..c10d4489e1e9ec4f7c878813c61e9820b837d756 100644
--- a/modules/text/semantic_model/lda_novel/module.py
+++ b/modules/text/language_model/lda_novel/module.py
@@ -105,8 +105,9 @@ class TopicModel(hub.Module):
             wd = WordAndDis()
             wd.word = word
             sm = SemanticMatching()
-            wd.distance = sm.likelihood_based_similarity(
-                terms=[word], doc_topic_dist=doc_topic_dist, model=self.__engine.get_model())
+            wd.distance = sm.likelihood_based_similarity(terms=[word],
+                                                         doc_topic_dist=doc_topic_dist,
+                                                         model=self.__engine.get_model())
             items.append(wd)
 
         def take_elem(word_dis):
diff --git a/modules/text/semantic_model/lda_novel/sampler.py b/modules/text/language_model/lda_novel/sampler.py
similarity index 100%
rename from modules/text/semantic_model/lda_novel/sampler.py
rename to modules/text/language_model/lda_novel/sampler.py
diff --git a/modules/text/semantic_model/lda_novel/semantic_matching.py b/modules/text/language_model/lda_novel/semantic_matching.py
similarity index 100%
rename from modules/text/semantic_model/lda_novel/semantic_matching.py
rename to modules/text/language_model/lda_novel/semantic_matching.py
diff --git a/modules/text/semantic_model/lda_webpage/tokenizer.py b/modules/text/language_model/lda_novel/tokenizer.py
similarity index 99%
rename from modules/text/semantic_model/lda_webpage/tokenizer.py
rename to modules/text/language_model/lda_novel/tokenizer.py
index 585aed885b63b0e2a2d450b77a6d018615c86b04..1d9afabc421bbe62fca462fecb50b8e1ce2da6f5 100644
--- a/modules/text/semantic_model/lda_webpage/tokenizer.py
+++ b/modules/text/language_model/lda_novel/tokenizer.py
@@ -7,7 +7,6 @@ from paddlehub.common.logger import logger
 class Tokenizer(object):
     """Base tokenizer class.
     """
-
     def __init__(self):
         pass
 
@@ -21,7 +20,6 @@ class SimpleTokenizer(Tokenizer):
 
        Notes: This tokenizer can only recognize the words in the corresponding vocab file.
     """
-
     def __init__(self, vocab_path):
         super().__init__()
         self.__max_word_len = 0
diff --git a/modules/text/semantic_model/lda_novel/util.py b/modules/text/language_model/lda_novel/util.py
similarity index 99%
rename from modules/text/semantic_model/lda_novel/util.py
rename to modules/text/language_model/lda_novel/util.py
index fd2943084d1ed6905936400ccbcc1808e96c9e85..4b7818259b9640b785d6f8cd65053d394303ec8e 100644
--- a/modules/text/semantic_model/lda_novel/util.py
+++ b/modules/text/language_model/lda_novel/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
     """Return time cost of function f.
     """
-
     def timed(*args, **kwargs):
         start_time = time.time()
         result = f(*args, **kwargs)
diff --git a/modules/text/semantic_model/lda_novel/vocab.py b/modules/text/language_model/lda_novel/vocab.py
similarity index 100%
rename from modules/text/semantic_model/lda_novel/vocab.py
rename to modules/text/language_model/lda_novel/vocab.py
diff --git a/modules/text/semantic_model/lda_novel/vose_alias.py b/modules/text/language_model/lda_novel/vose_alias.py
similarity index 99%
rename from modules/text/semantic_model/lda_novel/vose_alias.py
rename to modules/text/language_model/lda_novel/vose_alias.py
index ab9ba9084fbef7eaffd72ad13713c904ff852ef8..4bb7dbb6584f172df9ada382d1c14859a9f76a43 100644
--- a/modules/text/semantic_model/lda_novel/vose_alias.py
+++ b/modules/text/language_model/lda_novel/vose_alias.py
@@ -9,7 +9,6 @@ from lda_novel.util import rand, rand_k
 class VoseAlias(object):
     """Vose's Alias Method.
     """
-
     def __init__(self):
         self.__alias = None
         self.__prob = None  # np.array
diff --git a/modules/text/semantic_model/lda_webpage/README.md b/modules/text/language_model/lda_webpage/README.md
similarity index 100%
rename from modules/text/semantic_model/lda_webpage/README.md
rename to modules/text/language_model/lda_webpage/README.md
diff --git a/modules/text/semantic_model/lda_webpage/__init__.py b/modules/text/language_model/lda_webpage/__init__.py
similarity index 100%
rename from modules/text/semantic_model/lda_webpage/__init__.py
rename to modules/text/language_model/lda_webpage/__init__.py
diff --git a/modules/text/semantic_model/lda_webpage/config.py b/modules/text/language_model/lda_webpage/config.py
similarity index 100%
rename from modules/text/semantic_model/lda_webpage/config.py
rename to modules/text/language_model/lda_webpage/config.py
diff --git a/modules/text/semantic_model/lda_webpage/document.py b/modules/text/language_model/lda_webpage/document.py
similarity index 97%
rename from modules/text/semantic_model/lda_webpage/document.py
rename to modules/text/language_model/lda_webpage/document.py
index 4476230a5c9bc8d545b52386dbf00a201e59b468..98eae50511e6621db61c9700ff5e44d302667666 100644
--- a/modules/text/semantic_model/lda_webpage/document.py
+++ b/modules/text/language_model/lda_webpage/document.py
@@ -5,7 +5,6 @@ class Topic(object):
     """Basic data structure of topic, contains topic id and
        corresponding probability.
     """
-
     def __init__(self, tid, prob):
         self.tid = tid  # topic id
         self.prob = prob  # topic probability
@@ -15,7 +14,6 @@ class Token(object):
     """Basic storage unit of LDA documents, contains word id
        and corresponding topic.
     """
-
     def __init__(self, topic, id):
         self.topic = topic
         self.id = id
@@ -25,7 +23,6 @@ class Sentence(object):
     """Basic storage unit of SentenceLDA documents, contains word ids
        of the sentence and its corresponding topic id.
     """
-
     def __init__(self, topic, tokens):
         self.topic = topic
         self.tokens = tokens
@@ -34,7 +31,6 @@ class Sentence(object):
 class LDADoc(object):
     """The storage structure of LDA model's inference result.
     """
-
     def __init__(self):
         self._num_topics = None  # Number of topics.
         self._num_accum = None  # Number of accumulated sample rounds.
@@ -120,8 +116,8 @@ class LDADoc(object):
         dense_dist = np.zeros(self._num_topics)
         if self.size() == 0:
             return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
-            self.size() + self._alpha * self._num_topics)
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
+                                                                                      self._alpha * self._num_topics)
         return dense_dist
 
     def accumulate_topic_num(self):
@@ -133,7 +129,6 @@ class SLDADoc(LDADoc):
     """Sentence LDA Document, inherited from LDADoc.
        Add add_sentence interface.
     """
-
     def __init__(self):
         super().__init__()
         self.__sentences = None
diff --git a/modules/text/semantic_model/lda_webpage/inference_engine.py b/modules/text/language_model/lda_webpage/inference_engine.py
similarity index 100%
rename from modules/text/semantic_model/lda_webpage/inference_engine.py
rename to modules/text/language_model/lda_webpage/inference_engine.py
diff --git a/modules/text/semantic_model/lda_webpage/model.py b/modules/text/language_model/lda_webpage/model.py
similarity index 99%
rename from modules/text/semantic_model/lda_webpage/model.py
rename to modules/text/language_model/lda_webpage/model.py
index 8c05da149a159bd160bf6cffd751bf21548a4352..58fd16e08d744b85434c81d1f1e93ccb52897985 100644
--- a/modules/text/semantic_model/lda_webpage/model.py
+++ b/modules/text/language_model/lda_webpage/model.py
@@ -11,7 +11,6 @@ from lda_webpage.vocab import Vocab, WordCount
 class TopicModel(object):
     """Storage Structure of Topic model, including vocabulary and word topic count.
     """
-
     def __init__(self, model_dir, config):
         """
         Args:
diff --git a/modules/text/semantic_model/lda_webpage/module.py b/modules/text/language_model/lda_webpage/module.py
similarity index 96%
rename from modules/text/semantic_model/lda_webpage/module.py
rename to modules/text/language_model/lda_webpage/module.py
index ebe1da43f731167eb74f61790805340c577f2dd1..0603e952bb78f96ca3552f0d7566bd79e2c773b4 100644
--- a/modules/text/semantic_model/lda_webpage/module.py
+++ b/modules/text/language_model/lda_webpage/module.py
@@ -105,8 +105,9 @@ class TopicModel(hub.Module):
             wd = WordAndDis()
             wd.word = word
             sm = SemanticMatching()
-            wd.distance = sm.likelihood_based_similarity(
-                terms=[word], doc_topic_dist=doc_topic_dist, model=self.__engine.get_model())
+            wd.distance = sm.likelihood_based_similarity(terms=[word],
+                                                         doc_topic_dist=doc_topic_dist,
+                                                         model=self.__engine.get_model())
             items.append(wd)
 
         def take_elem(word_dis):
diff --git a/modules/text/semantic_model/lda_webpage/sampler.py b/modules/text/language_model/lda_webpage/sampler.py
similarity index 100%
rename from modules/text/semantic_model/lda_webpage/sampler.py
rename to modules/text/language_model/lda_webpage/sampler.py
diff --git a/modules/text/semantic_model/lda_webpage/semantic_matching.py b/modules/text/language_model/lda_webpage/semantic_matching.py
similarity index 100%
rename from modules/text/semantic_model/lda_webpage/semantic_matching.py
rename to modules/text/language_model/lda_webpage/semantic_matching.py
diff --git a/modules/text/semantic_model/slda_novel/tokenizer.py b/modules/text/language_model/lda_webpage/tokenizer.py
similarity index 99%
rename from modules/text/semantic_model/slda_novel/tokenizer.py
rename to modules/text/language_model/lda_webpage/tokenizer.py
index 585aed885b63b0e2a2d450b77a6d018615c86b04..1d9afabc421bbe62fca462fecb50b8e1ce2da6f5 100644
--- a/modules/text/semantic_model/slda_novel/tokenizer.py
+++ b/modules/text/language_model/lda_webpage/tokenizer.py
@@ -7,7 +7,6 @@ from paddlehub.common.logger import logger
 class Tokenizer(object):
     """Base tokenizer class.
     """
-
     def __init__(self):
         pass
 
@@ -21,7 +20,6 @@ class SimpleTokenizer(Tokenizer):
 
        Notes: This tokenizer can only recognize the words in the corresponding vocab file.
     """
-
     def __init__(self, vocab_path):
         super().__init__()
         self.__max_word_len = 0
diff --git a/modules/text/semantic_model/lda_webpage/util.py b/modules/text/language_model/lda_webpage/util.py
similarity index 99%
rename from modules/text/semantic_model/lda_webpage/util.py
rename to modules/text/language_model/lda_webpage/util.py
index 09892ee7956307d50e7c8e1c4a5260fc31734403..edd5923b2658a1a5cc07684b9c5f2bd93558c65a 100644
--- a/modules/text/semantic_model/lda_webpage/util.py
+++ b/modules/text/language_model/lda_webpage/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
     """Return time cost of function f.
     """
-
     def timed(*args, **kwargs):
         start_time = time.time()
         result = f(*args, **kwargs)
diff --git a/modules/text/semantic_model/lda_webpage/vocab.py b/modules/text/language_model/lda_webpage/vocab.py
similarity index 100%
rename from modules/text/semantic_model/lda_webpage/vocab.py
rename to modules/text/language_model/lda_webpage/vocab.py
diff --git a/modules/text/semantic_model/lda_webpage/vose_alias.py b/modules/text/language_model/lda_webpage/vose_alias.py
similarity index 99%
rename from modules/text/semantic_model/lda_webpage/vose_alias.py
rename to modules/text/language_model/lda_webpage/vose_alias.py
index f722c6927aad955f35834589666b1bc5a3185125..66ad348a5a877208530f9a8946a2d8e87cda57fc 100644
--- a/modules/text/semantic_model/lda_webpage/vose_alias.py
+++ b/modules/text/language_model/lda_webpage/vose_alias.py
@@ -9,7 +9,6 @@ from lda_webpage.util import rand, rand_k
 class VoseAlias(object):
     """Vose's Alias Method.
     """
-
     def __init__(self):
         self.__alias = None
         self.__prob = None  # np.array
diff --git a/modules/text/semantic_model/rbt3/README.md b/modules/text/language_model/rbt3/README.md
similarity index 100%
rename from modules/text/semantic_model/rbt3/README.md
rename to modules/text/language_model/rbt3/README.md
diff --git a/modules/text/semantic_model/rbt3/__init__.py b/modules/text/language_model/rbt3/__init__.py
similarity index 100%
rename from modules/text/semantic_model/rbt3/__init__.py
rename to modules/text/language_model/rbt3/__init__.py
diff --git a/modules/text/semantic_model/rbt3/model/__init__.py b/modules/text/language_model/rbt3/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/rbt3/model/__init__.py
rename to modules/text/language_model/rbt3/model/__init__.py
diff --git a/modules/text/semantic_model/rbt3/model/bert.py b/modules/text/language_model/rbt3/model/bert.py
similarity index 50%
rename from modules/text/semantic_model/rbt3/model/bert.py
rename to modules/text/language_model/rbt3/model/bert.py
index 9afb90c193f8c2898d99bc55d1678ef1f62b2303..4d37cb02489ca54bfa273965334ba589e67d31ea 100644
--- a/modules/text/semantic_model/rbt3/model/bert.py
+++ b/modules/text/language_model/rbt3/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class BertModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -130,12 +129,12 @@ class BertModel(object):
         """Get the first feature of each sequence for classification"""
 
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -150,43 +149,45 @@ class BertModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/language_model/rbt3/model/transformer_encoder.py b/modules/text/language_model/rbt3/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/rbt3/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/rbt3/module.py b/modules/text/language_model/rbt3/module.py
similarity index 89%
rename from modules/text/semantic_model/rbt3/module.py
rename to modules/text/language_model/rbt3/module.py
index 60eadf060d568c1fec6bb4cce4a7fbaac39045ed..b35e0cd890bd2238e86c9d92870c4a6224728efe 100644
--- a/modules/text/semantic_model/rbt3/module.py
+++ b/modules/text/language_model/rbt3/module.py
@@ -58,13 +58,12 @@ class BertWwm(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        bert = BertModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.bert_config,
-            use_fp16=False)
+        bert = BertModel(src_ids=input_ids,
+                         position_ids=position_ids,
+                         sentence_ids=segment_ids,
+                         input_mask=input_mask,
+                         config=self.bert_config,
+                         use_fp16=False)
         pooled_output = bert.get_pooled_output()
         sequence_output = bert.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/rbtl3/README.md b/modules/text/language_model/rbtl3/README.md
similarity index 100%
rename from modules/text/semantic_model/rbtl3/README.md
rename to modules/text/language_model/rbtl3/README.md
diff --git a/modules/text/semantic_model/rbtl3/__init__.py b/modules/text/language_model/rbtl3/__init__.py
similarity index 100%
rename from modules/text/semantic_model/rbtl3/__init__.py
rename to modules/text/language_model/rbtl3/__init__.py
diff --git a/modules/text/semantic_model/rbtl3/model/__init__.py b/modules/text/language_model/rbtl3/model/__init__.py
similarity index 100%
rename from modules/text/semantic_model/rbtl3/model/__init__.py
rename to modules/text/language_model/rbtl3/model/__init__.py
diff --git a/modules/text/semantic_model/rbtl3/model/bert.py b/modules/text/language_model/rbtl3/model/bert.py
similarity index 50%
rename from modules/text/semantic_model/rbtl3/model/bert.py
rename to modules/text/language_model/rbtl3/model/bert.py
index e61be4035cf8379f02dc588eb2420f9699449629..8c27ad344ac2d84d563854b898ca8a7d675507d2 100644
--- a/modules/text/semantic_model/rbtl3/model/bert.py
+++ b/modules/text/language_model/rbtl3/model/bert.py
@@ -74,23 +74,23 @@ class BertModel(object):
 
     def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
         # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        position_emb_out = fluid.layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._pos_emb_name, initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(
-            sentence_ids,
-            size=[self._sent_types, self._emb_size],
-            dtype=self._dtype,
-            param_attr=fluid.ParamAttr(name=self._sent_emb_name, initializer=self._param_initializer))
+        emb_out = fluid.layers.embedding(input=src_ids,
+                                         size=[self._voc_size, self._emb_size],
+                                         dtype=self._dtype,
+                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
+                                                                    initializer=self._param_initializer),
+                                         is_sparse=False)
+        position_emb_out = fluid.layers.embedding(input=position_ids,
+                                                  size=[self._max_position_seq_len, self._emb_size],
+                                                  dtype=self._dtype,
+                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
+                                                                             initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(sentence_ids,
+                                              size=[self._sent_types, self._emb_size],
+                                              dtype=self._dtype,
+                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
+                                                                         initializer=self._param_initializer))
 
         emb_out = emb_out + position_emb_out
         emb_out = emb_out + sent_emb_out
@@ -105,23 +105,22 @@ class BertModel(object):
         n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
         n_head_self_attn_mask.stop_gradient = True
 
-        self._enc_out = encoder(
-            enc_input=emb_out,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            name='encoder')
+        self._enc_out = encoder(enc_input=emb_out,
+                                attn_bias=n_head_self_attn_mask,
+                                n_layer=self._n_layer,
+                                n_head=self._n_head,
+                                d_key=self._emb_size // self._n_head,
+                                d_value=self._emb_size // self._n_head,
+                                d_model=self._emb_size,
+                                d_inner_hid=self._emb_size * 4,
+                                prepostprocess_dropout=self._prepostprocess_dropout,
+                                attention_dropout=self._attention_dropout,
+                                relu_dropout=0,
+                                hidden_act=self._hidden_act,
+                                preprocess_cmd="",
+                                postprocess_cmd="dan",
+                                param_initializer=self._param_initializer,
+                                name='encoder')
 
     def get_sequence_output(self):
         return self._enc_out
@@ -130,12 +129,12 @@ class BertModel(object):
         """Get the first feature of each sequence for classification"""
 
         next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
+        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
+                                         size=self._emb_size,
+                                         act="tanh",
+                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
+                                                                    initializer=self._param_initializer),
+                                         bias_attr="pooled_fc.b_0")
         return next_sent_feat
 
     def get_pretraining_output(self, mask_label, mask_pos, labels):
@@ -150,43 +149,45 @@ class BertModel(object):
         mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
 
         # transform: fc
-        mask_trans_feat = fluid.layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0', initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        mask_trans_feat = fluid.layers.fc(input=mask_feat,
+                                          size=self._emb_size,
+                                          act=self._hidden_act,
+                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
+                                                                     initializer=self._param_initializer),
+                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
         # transform: layer norm
         mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
 
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0))
+        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
+                                                initializer=fluid.initializer.Constant(value=0.0))
         if self._weight_sharing:
-            fc_out = fluid.layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                transpose_y=True)
-            fc_out += fluid.layers.create_parameter(
-                shape=[self._voc_size], dtype=self._dtype, attr=mask_lm_out_bias_attr, is_bias=True)
+            fc_out = fluid.layers.matmul(x=mask_trans_feat,
+                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                                         transpose_y=True)
+            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
+                                                    dtype=self._dtype,
+                                                    attr=mask_lm_out_bias_attr,
+                                                    is_bias=True)
 
         else:
-            fc_out = fluid.layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
+                                                                initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
 
         mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
         mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
 
-        next_sent_fc_out = fluid.layers.fc(
-            input=next_sent_feat,
-            size=2,
-            param_attr=fluid.ParamAttr(name="next_sent_fc.w_0", initializer=self._param_initializer),
-            bias_attr="next_sent_fc.b_0")
+        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
+                                           size=2,
+                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
+                                                                      initializer=self._param_initializer),
+                                           bias_attr="next_sent_fc.b_0")
 
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
-            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
+                                                                                    label=labels,
+                                                                                    return_softmax=True)
 
         next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
 
diff --git a/modules/text/language_model/rbtl3/model/transformer_encoder.py b/modules/text/language_model/rbtl3/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15d838883fdad1e3432e4ea8715f2320d67929f
--- /dev/null
+++ b/modules/text/language_model/rbtl3/model/transformer_encoder.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(hidden,
+                                dropout_prob=dropout_rate,
+                                dropout_implementation="upscale_in_train",
+                                is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(out,
+                                    begin_norm_axis=len(out.shape) - 1,
+                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
+                                                               initializer=fluid.initializer.Constant(1.)),
+                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
+                                                              initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(out,
+                                     dropout_prob=dropout_rate,
+                                     dropout_implementation="upscale_in_train",
+                                     is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(pre_process_layer(enc_input,
+                                                         preprocess_cmd,
+                                                         prepostprocess_dropout,
+                                                         name=name + '_pre_att'),
+                                       None,
+                                       None,
+                                       attn_bias,
+                                       d_key,
+                                       d_value,
+                                       d_model,
+                                       n_head,
+                                       attention_dropout,
+                                       param_initializer=param_initializer,
+                                       name=name + '_multi_head_att')
+    attn_output = post_process_layer(enc_input,
+                                     attn_output,
+                                     postprocess_cmd,
+                                     prepostprocess_dropout,
+                                     name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
+                                                             preprocess_cmd,
+                                                             prepostprocess_dropout,
+                                                             name=name + '_pre_ffn'),
+                                           d_inner_hid,
+                                           d_model,
+                                           relu_dropout,
+                                           hidden_act,
+                                           param_initializer=param_initializer,
+                                           name=name + '_ffn')
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(enc_input,
+                                   attn_bias,
+                                   n_head,
+                                   d_key,
+                                   d_value,
+                                   d_model,
+                                   d_inner_hid,
+                                   prepostprocess_dropout,
+                                   attention_dropout,
+                                   relu_dropout,
+                                   hidden_act,
+                                   preprocess_cmd,
+                                   postprocess_cmd,
+                                   param_initializer=param_initializer,
+                                   name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
diff --git a/modules/text/semantic_model/rbtl3/module.py b/modules/text/language_model/rbtl3/module.py
similarity index 89%
rename from modules/text/semantic_model/rbtl3/module.py
rename to modules/text/language_model/rbtl3/module.py
index b6cc1270fd6771aefcd75a0fc4bb7ad9baf7d04c..a60c30a4cb179f517614a72d786c178428740b44 100644
--- a/modules/text/semantic_model/rbtl3/module.py
+++ b/modules/text/language_model/rbtl3/module.py
@@ -58,13 +58,12 @@ class BertWwm(TransformerModule):
             pooled_output (tensor):  sentence-level output for classification task.
             sequence_output (tensor): token-level output for sequence task.
         """
-        bert = BertModel(
-            src_ids=input_ids,
-            position_ids=position_ids,
-            sentence_ids=segment_ids,
-            input_mask=input_mask,
-            config=self.bert_config,
-            use_fp16=False)
+        bert = BertModel(src_ids=input_ids,
+                         position_ids=position_ids,
+                         sentence_ids=segment_ids,
+                         input_mask=input_mask,
+                         config=self.bert_config,
+                         use_fp16=False)
         pooled_output = bert.get_pooled_output()
         sequence_output = bert.get_sequence_output()
         return pooled_output, sequence_output
diff --git a/modules/text/semantic_model/simnet_bow/README.md b/modules/text/language_model/simnet_bow/README.md
similarity index 100%
rename from modules/text/semantic_model/simnet_bow/README.md
rename to modules/text/language_model/simnet_bow/README.md
diff --git a/modules/text/semantic_model/simnet_bow/__init__.py b/modules/text/language_model/simnet_bow/__init__.py
similarity index 100%
rename from modules/text/semantic_model/simnet_bow/__init__.py
rename to modules/text/language_model/simnet_bow/__init__.py
diff --git a/modules/text/semantic_model/simnet_bow/assets/params.txt b/modules/text/language_model/simnet_bow/assets/params.txt
similarity index 100%
rename from modules/text/semantic_model/simnet_bow/assets/params.txt
rename to modules/text/language_model/simnet_bow/assets/params.txt
diff --git a/modules/text/semantic_model/simnet_bow/assets/vocab.txt b/modules/text/language_model/simnet_bow/assets/vocab.txt
similarity index 100%
rename from modules/text/semantic_model/simnet_bow/assets/vocab.txt
rename to modules/text/language_model/simnet_bow/assets/vocab.txt
diff --git a/modules/text/semantic_model/simnet_bow/module.py b/modules/text/language_model/simnet_bow/module.py
similarity index 87%
rename from modules/text/semantic_model/simnet_bow/module.py
rename to modules/text/language_model/simnet_bow/module.py
index ef4a49935ff0c61c9b144d4d1fd31dc6c1ce0306..51681d3398f486ca2aa60e152796f087cc2b551e 100644
--- a/modules/text/semantic_model/simnet_bow/module.py
+++ b/modules/text/language_model/simnet_bow/module.py
@@ -29,13 +29,12 @@ class DataFormatError(Exception):
         self.args = args
 
 
-@moduleinfo(
-    name="simnet_bow",
-    version="1.2.0",
-    summary="Baidu's open-source similarity network model based on bow_pairwise.",
-    author="baidu-nlp",
-    author_email="",
-    type="nlp/sentiment_analysis")
+@moduleinfo(name="simnet_bow",
+            version="1.2.0",
+            summary="Baidu's open-source similarity network model based on bow_pairwise.",
+            author="baidu-nlp",
+            author_email="",
+            type="nlp/sentiment_analysis")
 class SimnetBow(hub.Module):
     def _initialize(self):
         """
@@ -107,42 +106,40 @@ class SimnetBow(hub.Module):
             seq_len_used = fluid.layers.squeeze(seq_len, axes=[1])
 
             # Add embedding layer.
-            w_param_attrs = fluid.ParamAttr(
-                name="emb", initializer=fluid.initializer.TruncatedNormal(scale=0.02), trainable=trainable)
+            w_param_attrs = fluid.ParamAttr(name="emb",
+                                            initializer=fluid.initializer.TruncatedNormal(scale=0.02),
+                                            trainable=trainable)
             dict_dim = 500002
-            emb_1 = fluid.layers.embedding(
-                input=text_1,
-                size=[dict_dim, 128],
-                is_sparse=True,
-                padding_idx=dict_dim - 1,
-                dtype='float32',
-                param_attr=w_param_attrs)
+            emb_1 = fluid.layers.embedding(input=text_1,
+                                           size=[dict_dim, 128],
+                                           is_sparse=True,
+                                           padding_idx=dict_dim - 1,
+                                           dtype='float32',
+                                           param_attr=w_param_attrs)
             emb_1_name = emb_1.name
             data_list = [text_1]
             emb_name_list = [emb_1_name]
 
             if num_slots > 1:
                 text_2 = fluid.data(name='text_2', shape=[-1, max_seq_len], dtype='int64', lod_level=0)
-                emb_2 = fluid.embedding(
-                    input=text_2,
-                    size=[dict_dim, 128],
-                    is_sparse=True,
-                    padding_idx=dict_dim - 1,
-                    dtype='float32',
-                    param_attr=w_param_attrs)
+                emb_2 = fluid.embedding(input=text_2,
+                                        size=[dict_dim, 128],
+                                        is_sparse=True,
+                                        padding_idx=dict_dim - 1,
+                                        dtype='float32',
+                                        param_attr=w_param_attrs)
                 emb_2_name = emb_2.name
                 data_list.append(text_2)
                 emb_name_list.append(emb_2_name)
 
             if num_slots > 2:
                 text_3 = fluid.data(name='text_3', shape=[-1, max_seq_len], dtype='int64', lod_level=0)
-                emb_3 = fluid.embedding(
-                    input=text_3,
-                    size=[dict_dim, 128],
-                    is_sparse=True,
-                    padding_idx=dict_dim - 1,
-                    dtype='float32',
-                    param_attr=w_param_attrs)
+                emb_3 = fluid.embedding(input=text_3,
+                                        size=[dict_dim, 128],
+                                        is_sparse=True,
+                                        padding_idx=dict_dim - 1,
+                                        dtype='float32',
+                                        param_attr=w_param_attrs)
                 emb_3_name = emb_3.name
                 data_list.append(text_3)
                 emb_name_list.append(emb_3_name)
@@ -298,8 +295,10 @@ class SimnetBow(hub.Module):
         """
         Run as a command
         """
-        self.parser = argparse.ArgumentParser(
-            description="Run the simnet_bow module.", prog='hub run simnet_bow', usage='%(prog)s', add_help=True)
+        self.parser = argparse.ArgumentParser(description="Run the simnet_bow module.",
+                                              prog='hub run simnet_bow',
+                                              usage='%(prog)s',
+                                              add_help=True)
 
         self.arg_input_group = self.parser.add_argument_group(title="Input options", description="Input data. Required")
         self.arg_config_group = self.parser.add_argument_group(
@@ -324,8 +323,10 @@ class SimnetBow(hub.Module):
         """
         Add the command config options
         """
-        self.arg_config_group.add_argument(
-            '--use_gpu', type=ast.literal_eval, default=False, help="whether use GPU for prediction")
+        self.arg_config_group.add_argument('--use_gpu',
+                                           type=ast.literal_eval,
+                                           default=False,
+                                           help="whether use GPU for prediction")
 
         self.arg_config_group.add_argument('--batch_size', type=int, default=1, help="batch size for prediction")
 
diff --git a/modules/text/semantic_model/simnet_bow/processor.py b/modules/text/language_model/simnet_bow/processor.py
similarity index 100%
rename from modules/text/semantic_model/simnet_bow/processor.py
rename to modules/text/language_model/simnet_bow/processor.py
diff --git a/modules/text/semantic_model/slda_news/README.md b/modules/text/language_model/slda_news/README.md
similarity index 100%
rename from modules/text/semantic_model/slda_news/README.md
rename to modules/text/language_model/slda_news/README.md
diff --git a/modules/text/semantic_model/slda_news/__init__.py b/modules/text/language_model/slda_news/__init__.py
similarity index 100%
rename from modules/text/semantic_model/slda_news/__init__.py
rename to modules/text/language_model/slda_news/__init__.py
diff --git a/modules/text/semantic_model/slda_news/config.py b/modules/text/language_model/slda_news/config.py
similarity index 100%
rename from modules/text/semantic_model/slda_news/config.py
rename to modules/text/language_model/slda_news/config.py
diff --git a/modules/text/semantic_model/lda_novel/document.py b/modules/text/language_model/slda_news/document.py
similarity index 97%
rename from modules/text/semantic_model/lda_novel/document.py
rename to modules/text/language_model/slda_news/document.py
index 4476230a5c9bc8d545b52386dbf00a201e59b468..98eae50511e6621db61c9700ff5e44d302667666 100644
--- a/modules/text/semantic_model/lda_novel/document.py
+++ b/modules/text/language_model/slda_news/document.py
@@ -5,7 +5,6 @@ class Topic(object):
     """Basic data structure of topic, contains topic id and
        corresponding probability.
     """
-
     def __init__(self, tid, prob):
         self.tid = tid  # topic id
         self.prob = prob  # topic probability
@@ -15,7 +14,6 @@ class Token(object):
     """Basic storage unit of LDA documents, contains word id
        and corresponding topic.
     """
-
     def __init__(self, topic, id):
         self.topic = topic
         self.id = id
@@ -25,7 +23,6 @@ class Sentence(object):
     """Basic storage unit of SentenceLDA documents, contains word ids
        of the sentence and its corresponding topic id.
     """
-
     def __init__(self, topic, tokens):
         self.topic = topic
         self.tokens = tokens
@@ -34,7 +31,6 @@ class Sentence(object):
 class LDADoc(object):
     """The storage structure of LDA model's inference result.
     """
-
     def __init__(self):
         self._num_topics = None  # Number of topics.
         self._num_accum = None  # Number of accumulated sample rounds.
@@ -120,8 +116,8 @@ class LDADoc(object):
         dense_dist = np.zeros(self._num_topics)
         if self.size() == 0:
             return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
-            self.size() + self._alpha * self._num_topics)
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
+                                                                                      self._alpha * self._num_topics)
         return dense_dist
 
     def accumulate_topic_num(self):
@@ -133,7 +129,6 @@ class SLDADoc(LDADoc):
     """Sentence LDA Document, inherited from LDADoc.
        Add add_sentence interface.
     """
-
     def __init__(self):
         super().__init__()
         self.__sentences = None
diff --git a/modules/text/semantic_model/slda_news/inference_engine.py b/modules/text/language_model/slda_news/inference_engine.py
similarity index 100%
rename from modules/text/semantic_model/slda_news/inference_engine.py
rename to modules/text/language_model/slda_news/inference_engine.py
diff --git a/modules/text/semantic_model/slda_news/model.py b/modules/text/language_model/slda_news/model.py
similarity index 99%
rename from modules/text/semantic_model/slda_news/model.py
rename to modules/text/language_model/slda_news/model.py
index f63ca92e0a6c63f44f2e0281d8382dd19e394cd8..23f030ea574cc298bb7a771386427e7522a39d38 100644
--- a/modules/text/semantic_model/slda_news/model.py
+++ b/modules/text/language_model/slda_news/model.py
@@ -11,7 +11,6 @@ from slda_news.vocab import Vocab, WordCount
 class TopicModel(object):
     """Storage Structure of Topic model, including vocabulary and word topic count.
     """
-
     def __init__(self, model_dir, config):
         """
         Args:
diff --git a/modules/text/semantic_model/slda_news/module.py b/modules/text/language_model/slda_news/module.py
similarity index 100%
rename from modules/text/semantic_model/slda_news/module.py
rename to modules/text/language_model/slda_news/module.py
diff --git a/modules/text/semantic_model/slda_news/sampler.py b/modules/text/language_model/slda_news/sampler.py
similarity index 100%
rename from modules/text/semantic_model/slda_news/sampler.py
rename to modules/text/language_model/slda_news/sampler.py
diff --git a/modules/text/semantic_model/slda_news/semantic_matching.py b/modules/text/language_model/slda_news/semantic_matching.py
similarity index 100%
rename from modules/text/semantic_model/slda_news/semantic_matching.py
rename to modules/text/language_model/slda_news/semantic_matching.py
diff --git a/modules/text/semantic_model/lda_novel/tokenizer.py b/modules/text/language_model/slda_news/tokenizer.py
similarity index 99%
rename from modules/text/semantic_model/lda_novel/tokenizer.py
rename to modules/text/language_model/slda_news/tokenizer.py
index 585aed885b63b0e2a2d450b77a6d018615c86b04..1d9afabc421bbe62fca462fecb50b8e1ce2da6f5 100644
--- a/modules/text/semantic_model/lda_novel/tokenizer.py
+++ b/modules/text/language_model/slda_news/tokenizer.py
@@ -7,7 +7,6 @@ from paddlehub.common.logger import logger
 class Tokenizer(object):
     """Base tokenizer class.
     """
-
     def __init__(self):
         pass
 
@@ -21,7 +20,6 @@ class SimpleTokenizer(Tokenizer):
 
        Notes: This tokenizer can only recognize the words in the corresponding vocab file.
     """
-
     def __init__(self, vocab_path):
         super().__init__()
         self.__max_word_len = 0
diff --git a/modules/text/semantic_model/slda_news/util.py b/modules/text/language_model/slda_news/util.py
similarity index 99%
rename from modules/text/semantic_model/slda_news/util.py
rename to modules/text/language_model/slda_news/util.py
index b1f01135390ce9964b9565c1db9cb5c71d8fdbcd..8a241056adbf7609bac489c2cabb7634ddc2fc7a 100644
--- a/modules/text/semantic_model/slda_news/util.py
+++ b/modules/text/language_model/slda_news/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
     """Return time cost of function f.
     """
-
     def timed(*args, **kwargs):
         start_time = time.time()
         result = f(*args, **kwargs)
diff --git a/modules/text/semantic_model/slda_news/vocab.py b/modules/text/language_model/slda_news/vocab.py
similarity index 100%
rename from modules/text/semantic_model/slda_news/vocab.py
rename to modules/text/language_model/slda_news/vocab.py
diff --git a/modules/text/semantic_model/slda_news/vose_alias.py b/modules/text/language_model/slda_news/vose_alias.py
similarity index 99%
rename from modules/text/semantic_model/slda_news/vose_alias.py
rename to modules/text/language_model/slda_news/vose_alias.py
index 4eae586da1fa6fff4adf9d73f5af996c0b80761d..702dfa22eead290edee7b71b126b796bc738e5b9 100644
--- a/modules/text/semantic_model/slda_news/vose_alias.py
+++ b/modules/text/language_model/slda_news/vose_alias.py
@@ -9,7 +9,6 @@ from slda_news.util import rand, rand_k
 class VoseAlias(object):
     """Vose's Alias Method.
     """
-
     def __init__(self):
         self.__alias = None
         self.__prob = None  # np.array
diff --git a/modules/text/semantic_model/slda_novel/README.md b/modules/text/language_model/slda_novel/README.md
similarity index 100%
rename from modules/text/semantic_model/slda_novel/README.md
rename to modules/text/language_model/slda_novel/README.md
diff --git a/modules/text/semantic_model/slda_novel/__init__.py b/modules/text/language_model/slda_novel/__init__.py
similarity index 100%
rename from modules/text/semantic_model/slda_novel/__init__.py
rename to modules/text/language_model/slda_novel/__init__.py
diff --git a/modules/text/semantic_model/slda_novel/config.py b/modules/text/language_model/slda_novel/config.py
similarity index 100%
rename from modules/text/semantic_model/slda_novel/config.py
rename to modules/text/language_model/slda_novel/config.py
diff --git a/modules/text/language_model/slda_novel/document.py b/modules/text/language_model/slda_novel/document.py
new file mode 100644
index 0000000000000000000000000000000000000000..98eae50511e6621db61c9700ff5e44d302667666
--- /dev/null
+++ b/modules/text/language_model/slda_novel/document.py
@@ -0,0 +1,171 @@
+import numpy as np
+
+
+class Topic(object):
+    """Basic data structure of topic, contains topic id and
+       corresponding probability.
+    """
+    def __init__(self, tid, prob):
+        self.tid = tid  # topic id
+        self.prob = prob  # topic probability
+
+
+class Token(object):
+    """Basic storage unit of LDA documents, contains word id
+       and corresponding topic.
+    """
+    def __init__(self, topic, id):
+        self.topic = topic
+        self.id = id
+
+
+class Sentence(object):
+    """Basic storage unit of SentenceLDA documents, contains word ids
+       of the sentence and its corresponding topic id.
+    """
+    def __init__(self, topic, tokens):
+        self.topic = topic
+        self.tokens = tokens
+
+
+class LDADoc(object):
+    """The storage structure of LDA model's inference result.
+    """
+    def __init__(self):
+        self._num_topics = None  # Number of topics.
+        self._num_accum = None  # Number of accumulated sample rounds.
+        self._alpha = None  # Document prior parameter.
+        self._tokens = None  # Storage structure of inference results.
+        self._topic_sum = None  # Document's topic sum in one round samples.
+        self._accum_topic_sum = None  # Accumulated results of topic sum.
+
+    def init(self, num_topics):
+        """Initialize the LDADoc according to num_topics.
+        """
+        self._num_topics = num_topics
+        self._num_accum = 0
+        self._tokens = []
+        self._topic_sum = np.zeros(self._num_topics)
+        self._accum_topic_sum = np.zeros(self._num_topics)
+
+    def add_token(self, token):
+        """Add new word to current LDADoc.
+        Arg:
+            token: Token class object.
+        """
+        assert token.topic >= 0, "Topic %d out of range!" % token.topic
+        assert token.topic < self._num_topics, "Topic %d out of range!" % token.topic
+        self._tokens.append(token)
+        self._topic_sum[token.topic] += 1
+
+    def token(self, index):
+        return self._tokens[index]
+
+    def set_topic(self, index, new_topic):
+        """Set the index word's topic to new_topic, and update the corresponding
+           topic distribution.
+        """
+        assert new_topic >= 0, "Topic %d out of range!" % new_topic
+        assert new_topic < self._num_topics, "Topic %d out of range!" % new_topic
+        old_topic = self._tokens[index].topic
+        if new_topic == old_topic:
+            return
+        self._tokens[index].topic = new_topic
+        self._topic_sum[old_topic] -= 1
+        self._topic_sum[new_topic] += 1
+
+    def set_alpha(self, alpha):
+        self._alpha = alpha
+
+    def size(self):
+        """Return number of words in LDADoc.
+        """
+        return len(self._tokens)
+
+    def topic_sum(self, topic_id):
+        return self._topic_sum[topic_id]
+
+    def sparse_topic_dist(self, sort=True):
+        """Return the topic distribution of documents in sparse format.
+           By default, it is sorted according to the topic probability
+           under the descending order.
+        """
+        topic_dist = []
+        sum_ = np.sum(self._accum_topic_sum)
+        if sum_ == 0:
+            return topic_dist
+        for i in range(0, self._num_topics):
+            if self._accum_topic_sum[i] == 0:
+                continue
+            topic_dist.append(Topic(i, self._accum_topic_sum[i] * 1.0 / sum_))
+        if sort:
+
+            def take_elem(topic):
+                return topic.prob
+
+            topic_dist.sort(key=take_elem, reverse=True)
+            if topic_dist is None:
+                topic_dist = []
+
+        return topic_dist
+
+    def dense_topic_dist(self):
+        """Return the distribution of document topics in dense format,
+           taking into account the prior parameter alpha.
+        """
+        dense_dist = np.zeros(self._num_topics)
+        if self.size() == 0:
+            return dense_dist
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
+                                                                                      self._alpha * self._num_topics)
+        return dense_dist
+
+    def accumulate_topic_num(self):
+        self._accum_topic_sum += self._topic_sum
+        self._num_accum += 1
+
+
+class SLDADoc(LDADoc):
+    """Sentence LDA Document, inherited from LDADoc.
+       Add add_sentence interface.
+    """
+    def __init__(self):
+        super().__init__()
+        self.__sentences = None
+
+    def init(self, num_topics):
+        """Initialize the SLDADoc according to num_topics.
+        """
+        self._num_topics = num_topics
+        self.__sentences = []
+        self._num_accum = 0
+        self._topic_sum = np.zeros(self._num_topics)
+        self._accum_topic_sum = np.zeros(self._num_topics)
+
+    def add_sentence(self, sent):
+        """Add new sentence to current SLDADoc.
+        Arg:
+            sent: Sentence class object.
+        """
+        assert sent.topic >= 0, "Topic %d out of range!" % (sent.topic)
+        assert sent.topic < self._num_topics, "Topic %d out of range!" % (sent.topic)
+        self.__sentences.append(sent)
+        self._topic_sum[sent.topic] += 1
+
+    def set_topic(self, index, new_topic):
+        assert new_topic >= 0, "Topic %d out of range!" % (new_topic)
+        assert new_topic < self._num_topics, "Topic %d out of range!" % (new_topic)
+        old_topic = self.__sentences[index].topic
+        if new_topic == old_topic:
+            return
+        self.__sentences[index].topic = new_topic
+        self._topic_sum[old_topic] -= 1
+        self._topic_sum[new_topic] += 1
+
+    def size(self):
+        """Return number of sentences in SLDADoc.
+        """
+        return len(self.__sentences)
+
+    def sent(self, index):
+        return self.__sentences[index]
diff --git a/modules/text/semantic_model/slda_novel/inference_engine.py b/modules/text/language_model/slda_novel/inference_engine.py
similarity index 100%
rename from modules/text/semantic_model/slda_novel/inference_engine.py
rename to modules/text/language_model/slda_novel/inference_engine.py
diff --git a/modules/text/semantic_model/slda_novel/model.py b/modules/text/language_model/slda_novel/model.py
similarity index 99%
rename from modules/text/semantic_model/slda_novel/model.py
rename to modules/text/language_model/slda_novel/model.py
index cd4e6bab5f4701d4481f064c9ea1bbef829b37db..05dac700ff31a4fc78b577e1f3ba09b66ceecb2e 100644
--- a/modules/text/semantic_model/slda_novel/model.py
+++ b/modules/text/language_model/slda_novel/model.py
@@ -11,7 +11,6 @@ from slda_novel.vocab import Vocab, WordCount
 class TopicModel(object):
     """Storage Structure of Topic model, including vocabulary and word topic count.
     """
-
     def __init__(self, model_dir, config):
         """
         Args:
diff --git a/modules/text/semantic_model/slda_novel/module.py b/modules/text/language_model/slda_novel/module.py
similarity index 100%
rename from modules/text/semantic_model/slda_novel/module.py
rename to modules/text/language_model/slda_novel/module.py
diff --git a/modules/text/semantic_model/slda_novel/sampler.py b/modules/text/language_model/slda_novel/sampler.py
similarity index 100%
rename from modules/text/semantic_model/slda_novel/sampler.py
rename to modules/text/language_model/slda_novel/sampler.py
diff --git a/modules/text/semantic_model/slda_novel/semantic_matching.py b/modules/text/language_model/slda_novel/semantic_matching.py
similarity index 100%
rename from modules/text/semantic_model/slda_novel/semantic_matching.py
rename to modules/text/language_model/slda_novel/semantic_matching.py
diff --git a/modules/text/semantic_model/slda_news/tokenizer.py b/modules/text/language_model/slda_novel/tokenizer.py
similarity index 99%
rename from modules/text/semantic_model/slda_news/tokenizer.py
rename to modules/text/language_model/slda_novel/tokenizer.py
index 585aed885b63b0e2a2d450b77a6d018615c86b04..1d9afabc421bbe62fca462fecb50b8e1ce2da6f5 100644
--- a/modules/text/semantic_model/slda_news/tokenizer.py
+++ b/modules/text/language_model/slda_novel/tokenizer.py
@@ -7,7 +7,6 @@ from paddlehub.common.logger import logger
 class Tokenizer(object):
     """Base tokenizer class.
     """
-
     def __init__(self):
         pass
 
@@ -21,7 +20,6 @@ class SimpleTokenizer(Tokenizer):
 
        Notes: This tokenizer can only recognize the words in the corresponding vocab file.
     """
-
     def __init__(self, vocab_path):
         super().__init__()
         self.__max_word_len = 0
diff --git a/modules/text/semantic_model/slda_novel/util.py b/modules/text/language_model/slda_novel/util.py
similarity index 99%
rename from modules/text/semantic_model/slda_novel/util.py
rename to modules/text/language_model/slda_novel/util.py
index b92e183a215b3bcfc6f4e8962409b1d71e01cc8c..6b24c714c9e15d150986c7aa9af5bf6b1f6c0bfa 100644
--- a/modules/text/semantic_model/slda_novel/util.py
+++ b/modules/text/language_model/slda_novel/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
     """Return time cost of function f.
     """
-
     def timed(*args, **kwargs):
         start_time = time.time()
         result = f(*args, **kwargs)
diff --git a/modules/text/semantic_model/slda_novel/vocab.py b/modules/text/language_model/slda_novel/vocab.py
similarity index 100%
rename from modules/text/semantic_model/slda_novel/vocab.py
rename to modules/text/language_model/slda_novel/vocab.py
diff --git a/modules/text/semantic_model/slda_novel/vose_alias.py b/modules/text/language_model/slda_novel/vose_alias.py
similarity index 99%
rename from modules/text/semantic_model/slda_novel/vose_alias.py
rename to modules/text/language_model/slda_novel/vose_alias.py
index 1f424a042e4c0e7378bceccd6219170a45fcda31..a3ddba616e9d8b501d5400a014c0d3580683e700 100644
--- a/modules/text/semantic_model/slda_novel/vose_alias.py
+++ b/modules/text/language_model/slda_novel/vose_alias.py
@@ -9,7 +9,6 @@ from slda_novel.util import rand, rand_k
 class VoseAlias(object):
     """Vose's Alias Method.
     """
-
     def __init__(self):
         self.__alias = None
         self.__prob = None  # np.array
diff --git a/modules/text/semantic_model/slda_webpage/README.md b/modules/text/language_model/slda_webpage/README.md
similarity index 100%
rename from modules/text/semantic_model/slda_webpage/README.md
rename to modules/text/language_model/slda_webpage/README.md
diff --git a/modules/text/semantic_model/slda_webpage/__init__.py b/modules/text/language_model/slda_webpage/__init__.py
similarity index 100%
rename from modules/text/semantic_model/slda_webpage/__init__.py
rename to modules/text/language_model/slda_webpage/__init__.py
diff --git a/modules/text/semantic_model/slda_webpage/config.py b/modules/text/language_model/slda_webpage/config.py
similarity index 100%
rename from modules/text/semantic_model/slda_webpage/config.py
rename to modules/text/language_model/slda_webpage/config.py
diff --git a/modules/text/language_model/slda_webpage/document.py b/modules/text/language_model/slda_webpage/document.py
new file mode 100644
index 0000000000000000000000000000000000000000..98eae50511e6621db61c9700ff5e44d302667666
--- /dev/null
+++ b/modules/text/language_model/slda_webpage/document.py
@@ -0,0 +1,171 @@
+import numpy as np
+
+
+class Topic(object):
+    """Basic data structure of topic, contains topic id and
+       corresponding probability.
+    """
+    def __init__(self, tid, prob):
+        self.tid = tid  # topic id
+        self.prob = prob  # topic probability
+
+
+class Token(object):
+    """Basic storage unit of LDA documents, contains word id
+       and corresponding topic.
+    """
+    def __init__(self, topic, id):
+        self.topic = topic
+        self.id = id
+
+
+class Sentence(object):
+    """Basic storage unit of SentenceLDA documents, contains word ids
+       of the sentence and its corresponding topic id.
+    """
+    def __init__(self, topic, tokens):
+        self.topic = topic
+        self.tokens = tokens
+
+
+class LDADoc(object):
+    """The storage structure of LDA model's inference result.
+    """
+    def __init__(self):
+        self._num_topics = None  # Number of topics.
+        self._num_accum = None  # Number of accumulated sample rounds.
+        self._alpha = None  # Document prior parameter.
+        self._tokens = None  # Storage structure of inference results.
+        self._topic_sum = None  # Document's topic sum in one round samples.
+        self._accum_topic_sum = None  # Accumulated results of topic sum.
+
+    def init(self, num_topics):
+        """Initialize the LDADoc according to num_topics.
+        """
+        self._num_topics = num_topics
+        self._num_accum = 0
+        self._tokens = []
+        self._topic_sum = np.zeros(self._num_topics)
+        self._accum_topic_sum = np.zeros(self._num_topics)
+
+    def add_token(self, token):
+        """Add new word to current LDADoc.
+        Arg:
+            token: Token class object.
+        """
+        assert token.topic >= 0, "Topic %d out of range!" % token.topic
+        assert token.topic < self._num_topics, "Topic %d out of range!" % token.topic
+        self._tokens.append(token)
+        self._topic_sum[token.topic] += 1
+
+    def token(self, index):
+        return self._tokens[index]
+
+    def set_topic(self, index, new_topic):
+        """Set the index word's topic to new_topic, and update the corresponding
+           topic distribution.
+        """
+        assert new_topic >= 0, "Topic %d out of range!" % new_topic
+        assert new_topic < self._num_topics, "Topic %d out of range!" % new_topic
+        old_topic = self._tokens[index].topic
+        if new_topic == old_topic:
+            return
+        self._tokens[index].topic = new_topic
+        self._topic_sum[old_topic] -= 1
+        self._topic_sum[new_topic] += 1
+
+    def set_alpha(self, alpha):
+        self._alpha = alpha
+
+    def size(self):
+        """Return number of words in LDADoc.
+        """
+        return len(self._tokens)
+
+    def topic_sum(self, topic_id):
+        return self._topic_sum[topic_id]
+
+    def sparse_topic_dist(self, sort=True):
+        """Return the topic distribution of documents in sparse format.
+           By default, it is sorted according to the topic probability
+           under the descending order.
+        """
+        topic_dist = []
+        sum_ = np.sum(self._accum_topic_sum)
+        if sum_ == 0:
+            return topic_dist
+        for i in range(0, self._num_topics):
+            if self._accum_topic_sum[i] == 0:
+                continue
+            topic_dist.append(Topic(i, self._accum_topic_sum[i] * 1.0 / sum_))
+        if sort:
+
+            def take_elem(topic):
+                return topic.prob
+
+            topic_dist.sort(key=take_elem, reverse=True)
+            if topic_dist is None:
+                topic_dist = []
+
+        return topic_dist
+
+    def dense_topic_dist(self):
+        """Return the distribution of document topics in dense format,
+           taking into account the prior parameter alpha.
+        """
+        dense_dist = np.zeros(self._num_topics)
+        if self.size() == 0:
+            return dense_dist
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
+                                                                                      self._alpha * self._num_topics)
+        return dense_dist
+
+    def accumulate_topic_num(self):
+        self._accum_topic_sum += self._topic_sum
+        self._num_accum += 1
+
+
+class SLDADoc(LDADoc):
+    """Sentence LDA Document, inherited from LDADoc.
+       Add add_sentence interface.
+    """
+    def __init__(self):
+        super().__init__()
+        self.__sentences = None
+
+    def init(self, num_topics):
+        """Initialize the SLDADoc according to num_topics.
+        """
+        self._num_topics = num_topics
+        self.__sentences = []
+        self._num_accum = 0
+        self._topic_sum = np.zeros(self._num_topics)
+        self._accum_topic_sum = np.zeros(self._num_topics)
+
+    def add_sentence(self, sent):
+        """Add new sentence to current SLDADoc.
+        Arg:
+            sent: Sentence class object.
+        """
+        assert sent.topic >= 0, "Topic %d out of range!" % (sent.topic)
+        assert sent.topic < self._num_topics, "Topic %d out of range!" % (sent.topic)
+        self.__sentences.append(sent)
+        self._topic_sum[sent.topic] += 1
+
+    def set_topic(self, index, new_topic):
+        assert new_topic >= 0, "Topic %d out of range!" % (new_topic)
+        assert new_topic < self._num_topics, "Topic %d out of range!" % (new_topic)
+        old_topic = self.__sentences[index].topic
+        if new_topic == old_topic:
+            return
+        self.__sentences[index].topic = new_topic
+        self._topic_sum[old_topic] -= 1
+        self._topic_sum[new_topic] += 1
+
+    def size(self):
+        """Return number of sentences in SLDADoc.
+        """
+        return len(self.__sentences)
+
+    def sent(self, index):
+        return self.__sentences[index]
diff --git a/modules/text/semantic_model/slda_webpage/inference_engine.py b/modules/text/language_model/slda_webpage/inference_engine.py
similarity index 100%
rename from modules/text/semantic_model/slda_webpage/inference_engine.py
rename to modules/text/language_model/slda_webpage/inference_engine.py
diff --git a/modules/text/semantic_model/slda_webpage/model.py b/modules/text/language_model/slda_webpage/model.py
similarity index 99%
rename from modules/text/semantic_model/slda_webpage/model.py
rename to modules/text/language_model/slda_webpage/model.py
index e3e78020a7e4b3f273a22e437b0ae03ea0ed23f5..0b332ccba5e8b8b8dea55b2c8758ad6bb3d48f57 100644
--- a/modules/text/semantic_model/slda_webpage/model.py
+++ b/modules/text/language_model/slda_webpage/model.py
@@ -11,7 +11,6 @@ from slda_webpage.vocab import Vocab, WordCount
 class TopicModel(object):
     """Storage Structure of Topic model, including vocabulary and word topic count.
     """
-
     def __init__(self, model_dir, config):
         """
         Args:
diff --git a/modules/text/semantic_model/slda_webpage/module.py b/modules/text/language_model/slda_webpage/module.py
similarity index 100%
rename from modules/text/semantic_model/slda_webpage/module.py
rename to modules/text/language_model/slda_webpage/module.py
diff --git a/modules/text/semantic_model/slda_webpage/sampler.py b/modules/text/language_model/slda_webpage/sampler.py
similarity index 100%
rename from modules/text/semantic_model/slda_webpage/sampler.py
rename to modules/text/language_model/slda_webpage/sampler.py
diff --git a/modules/text/semantic_model/slda_webpage/semantic_matching.py b/modules/text/language_model/slda_webpage/semantic_matching.py
similarity index 100%
rename from modules/text/semantic_model/slda_webpage/semantic_matching.py
rename to modules/text/language_model/slda_webpage/semantic_matching.py
diff --git a/modules/text/language_model/slda_webpage/tokenizer.py b/modules/text/language_model/slda_webpage/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d9afabc421bbe62fca462fecb50b8e1ce2da6f5
--- /dev/null
+++ b/modules/text/language_model/slda_webpage/tokenizer.py
@@ -0,0 +1,125 @@
+import os
+
+import numpy as np
+from paddlehub.common.logger import logger
+
+
+class Tokenizer(object):
+    """Base tokenizer class.
+    """
+    def __init__(self):
+        pass
+
+    def tokenize(self, text):
+        raise NotImplementedError
+
+
+class SimpleTokenizer(Tokenizer):
+    """Simple version FMM(Forward Maximun Matching) word tokenizer. This tokenizer can only
+       be used in topic model demo, but not in real business application scenarios.
+
+       Notes: This tokenizer can only recognize the words in the corresponding vocab file.
+    """
+    def __init__(self, vocab_path):
+        super().__init__()
+        self.__max_word_len = 0
+        self.__vocab = set()
+        self.__load_vocab(vocab_path)
+
+    def tokenize(self, text):
+        """Tokenize the input string `text`, and return the tokenize result.
+        """
+        text_len = len(text)
+        result = []
+        i = 0
+        while i < text_len:
+            word = found_word = ""
+            # Deal with English characters.
+            if self.__is_eng_char(text[i]):
+                for j in range(i, text_len + 1):
+                    if j < text_len and self.__is_eng_char(text[j]):
+                        word += self.__tolower(text[j])
+                    else:
+                        # Forward matching by character granularity.
+                        if word in self.__vocab:
+                            result.append(word)
+                        i = j - 1
+                        break
+            else:
+                for j in range(i, min(i + self.__max_word_len, text_len)):
+                    word += text[j]
+                    if word in self.__vocab:
+                        found_word = word
+                if len(found_word) > 0:
+                    result.append(found_word)
+                    i += len(found_word) - 1
+            i += 1
+        return result
+
+    def contains(self, word):
+        """Check whether the word is in the vocabulary.
+        """
+        return word in self.__vocab
+
+    def __load_vocab(self, vocab_path):
+        """Load the word dictionary.
+        """
+        with open(vocab_path, 'r', encoding='utf-8') as fin:
+            vocab_size = 0
+            for line in fin.readlines():
+                fields = line.strip().split('\t')
+                assert len(fields) >= 2
+                word = fields[1]
+                self.__max_word_len = max(self.__max_word_len, len(word))
+                self.__vocab.add(word)
+                vocab_size += 1
+
+    def __is_eng_char(self, c):
+        """Check whether char c is an English character.
+        """
+        return (c >= 'A' and c <= 'Z') or (c >= 'a' and c <= 'z')
+
+    def __tolower(self, c):
+        """Return the lowercase character of the corresponding character, or return
+           the original character if there is no corresponding lowercase character.
+        """
+        return c.lower()
+
+
+class LACTokenizer(Tokenizer):
+    def __init__(self, vocab_path, lac):
+        super().__init__()
+        self.__max_word_len = 0
+        self.__vocab = set()
+        self.__lac = lac
+        self.__load_vocab(vocab_path)
+
+    def __load_vocab(self, vocab_path):
+        """Load the word dictionary.
+                """
+        with open(vocab_path, 'r', encoding='utf-8') as fin:
+            vocab_size = 0
+            for line in fin.readlines():
+                fields = line.strip().split('\t')
+                assert len(fields) >= 2
+                word = fields[1]
+                self.__max_word_len = max(self.__max_word_len, len(word))
+                self.__vocab.add(word)
+                vocab_size += 1
+
+    def tokenize(self, text):
+        results = self.__lac.lexical_analysis(texts=[text], use_gpu=False, batch_size=1, return_tag=True)
+        # Change English words to lower case.
+        # And just preserve the word in vocab.
+        words = results[0]["word"]
+        result = []
+        for word in words:
+            word = word.lower()
+            if word in self.__vocab:
+                result.append(word)
+        return result
+
+    def contains(self, word):
+        """Check whether the word is in the vocabulary.
+        """
+        return word in self.__vocab
diff --git a/modules/text/semantic_model/slda_webpage/util.py b/modules/text/language_model/slda_webpage/util.py
similarity index 99%
rename from modules/text/semantic_model/slda_webpage/util.py
rename to modules/text/language_model/slda_webpage/util.py
index 6323a820093ba7c29c6d45e4d80b5e328e3d8a79..e3181ead335233cb01dc8aaf79aa13dda336e1c5 100644
--- a/modules/text/semantic_model/slda_webpage/util.py
+++ b/modules/text/language_model/slda_webpage/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
     """Return time cost of function f.
     """
-
     def timed(*args, **kwargs):
         start_time = time.time()
         result = f(*args, **kwargs)
diff --git a/modules/text/semantic_model/slda_webpage/vocab.py b/modules/text/language_model/slda_webpage/vocab.py
similarity index 100%
rename from modules/text/semantic_model/slda_webpage/vocab.py
rename to modules/text/language_model/slda_webpage/vocab.py
diff --git a/modules/text/semantic_model/slda_webpage/vose_alias.py b/modules/text/language_model/slda_webpage/vose_alias.py
similarity index 99%
rename from modules/text/semantic_model/slda_webpage/vose_alias.py
rename to modules/text/language_model/slda_webpage/vose_alias.py
index 1190c84d453e68f8576717a1fab7aa21e14dc611..bc08b1654cc9a674f0951ad881f839a74d849c5d 100644
--- a/modules/text/semantic_model/slda_webpage/vose_alias.py
+++ b/modules/text/language_model/slda_webpage/vose_alias.py
@@ -9,7 +9,6 @@ from slda_webpage.util import rand, rand_k
 class VoseAlias(object):
     """Vose's Alias Method.
     """
-
     def __init__(self):
         self.__alias = None
         self.__prob = None  # np.array
diff --git a/modules/text/semantic_model/slda_weibo/README.md b/modules/text/language_model/slda_weibo/README.md
similarity index 100%
rename from modules/text/semantic_model/slda_weibo/README.md
rename to modules/text/language_model/slda_weibo/README.md
diff --git a/modules/text/semantic_model/slda_weibo/__init__.py b/modules/text/language_model/slda_weibo/__init__.py
similarity index 100%
rename from modules/text/semantic_model/slda_weibo/__init__.py
rename to modules/text/language_model/slda_weibo/__init__.py
diff --git a/modules/text/semantic_model/slda_weibo/config.py b/modules/text/language_model/slda_weibo/config.py
similarity index 100%
rename from modules/text/semantic_model/slda_weibo/config.py
rename to modules/text/language_model/slda_weibo/config.py
diff --git a/modules/text/language_model/slda_weibo/document.py b/modules/text/language_model/slda_weibo/document.py
new file mode 100644
index 0000000000000000000000000000000000000000..98eae50511e6621db61c9700ff5e44d302667666
--- /dev/null
+++ b/modules/text/language_model/slda_weibo/document.py
@@ -0,0 +1,171 @@
+import numpy as np
+
+
+class Topic(object):
+    """Basic data structure of topic, contains topic id and
+       corresponding probability.
+    """
+    def __init__(self, tid, prob):
+        self.tid = tid  # topic id
+        self.prob = prob  # topic probability
+
+
+class Token(object):
+    """Basic storage unit of LDA documents, contains word id
+       and corresponding topic.
+    """
+    def __init__(self, topic, id):
+        self.topic = topic
+        self.id = id
+
+
+class Sentence(object):
+    """Basic storage unit of SentenceLDA documents, contains word ids
+       of the sentence and its corresponding topic id.
+    """
+    def __init__(self, topic, tokens):
+        self.topic = topic
+        self.tokens = tokens
+
+
+class LDADoc(object):
+    """The storage structure of LDA model's inference result.
+    """
+    def __init__(self):
+        self._num_topics = None  # Number of topics.
+        self._num_accum = None  # Number of accumulated sample rounds.
+        self._alpha = None  # Document prior parameter.
+        self._tokens = None  # Storage structure of inference results.
+        self._topic_sum = None  # Document's topic sum in one round samples.
+        self._accum_topic_sum = None  # Accumulated results of topic sum.
+
+    def init(self, num_topics):
+        """Initialize the LDADoc according to num_topics.
+        """
+        self._num_topics = num_topics
+        self._num_accum = 0
+        self._tokens = []
+        self._topic_sum = np.zeros(self._num_topics)
+        self._accum_topic_sum = np.zeros(self._num_topics)
+
+    def add_token(self, token):
+        """Add new word to current LDADoc.
+        Arg:
+            token: Token class object.
+        """
+        assert token.topic >= 0, "Topic %d out of range!" % token.topic
+        assert token.topic < self._num_topics, "Topic %d out of range!" % token.topic
+        self._tokens.append(token)
+        self._topic_sum[token.topic] += 1
+
+    def token(self, index):
+        return self._tokens[index]
+
+    def set_topic(self, index, new_topic):
+        """Set the index word's topic to new_topic, and update the corresponding
+           topic distribution.
+        """
+        assert new_topic >= 0, "Topic %d out of range!" % new_topic
+        assert new_topic < self._num_topics, "Topic %d out of range!" % new_topic
+        old_topic = self._tokens[index].topic
+        if new_topic == old_topic:
+            return
+        self._tokens[index].topic = new_topic
+        self._topic_sum[old_topic] -= 1
+        self._topic_sum[new_topic] += 1
+
+    def set_alpha(self, alpha):
+        self._alpha = alpha
+
+    def size(self):
+        """Return number of words in LDADoc.
+        """
+        return len(self._tokens)
+
+    def topic_sum(self, topic_id):
+        return self._topic_sum[topic_id]
+
+    def sparse_topic_dist(self, sort=True):
+        """Return the topic distribution of documents in sparse format.
+           By default, it is sorted according to the topic probability
+           under the descending order.
+        """
+        topic_dist = []
+        sum_ = np.sum(self._accum_topic_sum)
+        if sum_ == 0:
+            return topic_dist
+        for i in range(0, self._num_topics):
+            if self._accum_topic_sum[i] == 0:
+                continue
+            topic_dist.append(Topic(i, self._accum_topic_sum[i] * 1.0 / sum_))
+        if sort:
+
+            def take_elem(topic):
+                return topic.prob
+
+            topic_dist.sort(key=take_elem, reverse=True)
+            if topic_dist is None:
+                topic_dist = []
+
+        return topic_dist
+
+    def dense_topic_dist(self):
+        """Return the distribution of document topics in dense format,
+           taking into account the prior parameter alpha.
+        """
+        dense_dist = np.zeros(self._num_topics)
+        if self.size() == 0:
+            return dense_dist
+        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (self.size() +
+                                                                                      self._alpha * self._num_topics)
+        return dense_dist
+
+    def accumulate_topic_num(self):
+        self._accum_topic_sum += self._topic_sum
+        self._num_accum += 1
+
+
+class SLDADoc(LDADoc):
+    """Sentence LDA Document, inherited from LDADoc.
+       Add add_sentence interface.
+    """
+    def __init__(self):
+        super().__init__()
+        self.__sentences = None
+
+    def init(self, num_topics):
+        """Initialize the SLDADoc according to num_topics.
+        """
+        self._num_topics = num_topics
+        self.__sentences = []
+        self._num_accum = 0
+        self._topic_sum = np.zeros(self._num_topics)
+        self._accum_topic_sum = np.zeros(self._num_topics)
+
+    def add_sentence(self, sent):
+        """Add new sentence to current SLDADoc.
+        Arg:
+            sent: Sentence class object.
+        """
+        assert sent.topic >= 0, "Topic %d out of range!" % (sent.topic)
+        assert sent.topic < self._num_topics, "Topic %d out of range!" % (sent.topic)
+        self.__sentences.append(sent)
+        self._topic_sum[sent.topic] += 1
+
+    def set_topic(self, index, new_topic):
+        assert new_topic >= 0, "Topic %d out of range!" % (new_topic)
+        assert new_topic < self._num_topics, "Topic %d out of range!" % (new_topic)
+        old_topic = self.__sentences[index].topic
+        if new_topic == old_topic:
+            return
+        self.__sentences[index].topic = new_topic
+        self._topic_sum[old_topic] -= 1
+        self._topic_sum[new_topic] += 1
+
+    def size(self):
+        """Return number of sentences in SLDADoc.
+        """
+        return len(self.__sentences)
+
+    def sent(self, index):
+        return self.__sentences[index]
diff --git a/modules/text/semantic_model/slda_weibo/inference_engine.py b/modules/text/language_model/slda_weibo/inference_engine.py
similarity index 100%
rename from modules/text/semantic_model/slda_weibo/inference_engine.py
rename to modules/text/language_model/slda_weibo/inference_engine.py
diff --git a/modules/text/semantic_model/slda_weibo/model.py b/modules/text/language_model/slda_weibo/model.py
similarity index 99%
rename from modules/text/semantic_model/slda_weibo/model.py
rename to modules/text/language_model/slda_weibo/model.py
index 500f44b554da2ca04aa34db27bbb07967ae41670..645bd1843147948ca9a0bd3f04c9b1a8af116f88 100644
--- a/modules/text/semantic_model/slda_weibo/model.py
+++ b/modules/text/language_model/slda_weibo/model.py
@@ -11,7 +11,6 @@ from slda_weibo.vocab import Vocab, WordCount
 class TopicModel(object):
     """Storage Structure of Topic model, including vocabulary and word topic count.
     """
-
     def __init__(self, model_dir, config):
         """
         Args:
diff --git a/modules/text/semantic_model/slda_weibo/module.py b/modules/text/language_model/slda_weibo/module.py
similarity index 100%
rename from modules/text/semantic_model/slda_weibo/module.py
rename to modules/text/language_model/slda_weibo/module.py
diff --git a/modules/text/semantic_model/slda_weibo/sampler.py b/modules/text/language_model/slda_weibo/sampler.py
similarity index 100%
rename from modules/text/semantic_model/slda_weibo/sampler.py
rename to modules/text/language_model/slda_weibo/sampler.py
diff --git a/modules/text/semantic_model/slda_weibo/semantic_matching.py b/modules/text/language_model/slda_weibo/semantic_matching.py
similarity index 100%
rename from modules/text/semantic_model/slda_weibo/semantic_matching.py
rename to modules/text/language_model/slda_weibo/semantic_matching.py
diff --git a/modules/text/language_model/slda_weibo/tokenizer.py b/modules/text/language_model/slda_weibo/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d9afabc421bbe62fca462fecb50b8e1ce2da6f5
--- /dev/null
+++ b/modules/text/language_model/slda_weibo/tokenizer.py
@@ -0,0 +1,125 @@
+import os
+
+import numpy as np
+from paddlehub.common.logger import logger
+
+
+class Tokenizer(object):
+    """Base tokenizer class.
+    """
+    def __init__(self):
+        pass
+
+    def tokenize(self, text):
+        raise NotImplementedError
+
+
+class SimpleTokenizer(Tokenizer):
+    """Simple version FMM(Forward Maximun Matching) word tokenizer. This tokenizer can only
+       be used in topic model demo, but not in real business application scenarios.
+
+       Notes: This tokenizer can only recognize the words in the corresponding vocab file.
+    """
+    def __init__(self, vocab_path):
+        super().__init__()
+        self.__max_word_len = 0
+        self.__vocab = set()
+        self.__load_vocab(vocab_path)
+
+    def tokenize(self, text):
+        """Tokenize the input string `text`, and return the tokenize result.
+        """
+        text_len = len(text)
+        result = []
+        i = 0
+        while i < text_len:
+            word = found_word = ""
+            # Deal with English characters.
+            if self.__is_eng_char(text[i]):
+                for j in range(i, text_len + 1):
+                    if j < text_len and self.__is_eng_char(text[j]):
+                        word += self.__tolower(text[j])
+                    else:
+                        # Forward matching by character granularity.
+                        if word in self.__vocab:
+                            result.append(word)
+                        i = j - 1
+                        break
+            else:
+                for j in range(i, min(i + self.__max_word_len, text_len)):
+                    word += text[j]
+                    if word in self.__vocab:
+                        found_word = word
+                if len(found_word) > 0:
+                    result.append(found_word)
+                    i += len(found_word) - 1
+            i += 1
+        return result
+
+    def contains(self, word):
+        """Check whether the word is in the vocabulary.
+        """
+        return word in self.__vocab
+
+    def __load_vocab(self, vocab_path):
+        """Load the word dictionary.
+        """
+        with open(vocab_path, 'r', encoding='utf-8') as fin:
+            vocab_size = 0
+            for line in fin.readlines():
+                fields = line.strip().split('\t')
+                assert len(fields) >= 2
+                word = fields[1]
+                self.__max_word_len = max(self.__max_word_len, len(word))
+                self.__vocab.add(word)
+                vocab_size += 1
+
+    def __is_eng_char(self, c):
+        """Check whether char c is an English character.
+        """
+        return (c >= 'A' and c <= 'Z') or (c >= 'a' and c <= 'z')
+
+    def __tolower(self, c):
+        """Return the lowercase character of the corresponding character, or return
+           the original character if there is no corresponding lowercase character.
+        """
+        return c.lower()
+
+
+class LACTokenizer(Tokenizer):
+    def __init__(self, vocab_path, lac):
+        super().__init__()
+        self.__max_word_len = 0
+        self.__vocab = set()
+        self.__lac = lac
+        self.__load_vocab(vocab_path)
+
+    def __load_vocab(self, vocab_path):
+        """Load the word dictionary.
+                """
+        with open(vocab_path, 'r', encoding='utf-8') as fin:
+            vocab_size = 0
+            for line in fin.readlines():
+                fields = line.strip().split('\t')
+                assert len(fields) >= 2
+                word = fields[1]
+                self.__max_word_len = max(self.__max_word_len, len(word))
+                self.__vocab.add(word)
+                vocab_size += 1
+
+    def tokenize(self, text):
+        results = self.__lac.lexical_analysis(texts=[text], use_gpu=False, batch_size=1, return_tag=True)
+        # Change English words to lower case.
+        # And just preserve the word in vocab.
+        words = results[0]["word"]
+        result = []
+        for word in words:
+            word = word.lower()
+            if word in self.__vocab:
+                result.append(word)
+        return result
+
+    def contains(self, word):
+        """Check whether the word is in the vocabulary.
+        """
+        return word in self.__vocab
diff --git a/modules/text/semantic_model/slda_weibo/util.py b/modules/text/language_model/slda_weibo/util.py
similarity index 99%
rename from modules/text/semantic_model/slda_weibo/util.py
rename to modules/text/language_model/slda_weibo/util.py
index 9c2a651ee1c7256348a38649d1abfe0a45ee2039..04b2fc99f7d631c764959b0022bb25918172d2bf 100644
--- a/modules/text/semantic_model/slda_weibo/util.py
+++ b/modules/text/language_model/slda_weibo/util.py
@@ -46,7 +46,6 @@ def rand_k(k):
 def timeit(f):
     """Return time cost of function f.
     """
-
     def timed(*args, **kwargs):
         start_time = time.time()
         result = f(*args, **kwargs)
diff --git a/modules/text/semantic_model/slda_weibo/vocab.py b/modules/text/language_model/slda_weibo/vocab.py
similarity index 100%
rename from modules/text/semantic_model/slda_weibo/vocab.py
rename to modules/text/language_model/slda_weibo/vocab.py
diff --git a/modules/text/semantic_model/slda_weibo/vose_alias.py b/modules/text/language_model/slda_weibo/vose_alias.py
similarity index 99%
rename from modules/text/semantic_model/slda_weibo/vose_alias.py
rename to modules/text/language_model/slda_weibo/vose_alias.py
index 268f307a298399dd66eccdd534a0e45ffde7c8b6..c8c132371966d9d804fb7ba1aa7332038cb7930f 100644
--- a/modules/text/semantic_model/slda_weibo/vose_alias.py
+++ b/modules/text/language_model/slda_weibo/vose_alias.py
@@ -9,7 +9,6 @@ from slda_weibo.util import rand, rand_k
 class VoseAlias(object):
     """Vose's Alias Method.
     """
-
     def __init__(self):
         self.__alias = None
         self.__prob = None  # np.array
diff --git a/modules/text/lexical_analysis/README.md b/modules/text/lexical_analysis/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..d48577993a8c0587bde85d7165c914311fe014e3 100644
--- a/modules/text/lexical_analysis/README.md
+++ b/modules/text/lexical_analysis/README.md
@@ -0,0 +1,10 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【词法分析】](https://www.paddlepaddle.org.cn/hublist)**
+
+### 词法分析
+
+- 推荐模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [词法分析-LAC](https://www.paddlepaddle.org.cn/hubdetail?name=lac&en_category=LexicalAnalysis) | 百度自研联合的词法分析模型，能整体性地完成中文分词、词性标注、专名识别任务。在百度自建数据集上评测，LAC效果：Precision=88.0%，Recall=88.7%，F1-Score=88.4%。 |
+| [切词网络-jieba](https://www.paddlepaddle.org.cn/hubdetail?name=jieba_paddle&en_category=LexicalAnalysis) | 该Module是jieba使用PaddlePaddle深度学习框架搭建的切词网络（双向GRU）。 同时也支持jieba的传统切词方法，如精确模式、全模式、搜索引擎模式等切词模式。 |
diff --git a/modules/text/semantic_model/README.md b/modules/text/semantic_model/README.md
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/transformer_encoder.py b/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/bert_multi_uncased_L_12_H_768_A_12/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/transformer_encoder.py b/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/bert_uncased_L_12_H_768_A_12/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/transformer_encoder.py b/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/bert_uncased_L_24_H_1024_A_16/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/chinese_bert_wwm/model/transformer_encoder.py b/modules/text/semantic_model/chinese_bert_wwm/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/chinese_bert_wwm/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/chinese_bert_wwm_ext/model/transformer_encoder.py b/modules/text/semantic_model/chinese_bert_wwm_ext/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/chinese_bert_wwm_ext/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/chinese_electra_base/model/transformer_encoder.py b/modules/text/semantic_model/chinese_electra_base/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/chinese_electra_base/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/chinese_electra_small/model/transformer_encoder.py b/modules/text/semantic_model/chinese_electra_small/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/chinese_electra_small/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/chinese_roberta_wwm_ext/model/transformer_encoder.py b/modules/text/semantic_model/chinese_roberta_wwm_ext/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/transformer_encoder.py b/modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/chinese_roberta_wwm_ext_large/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/ernie/model/transformer_encoder.py b/modules/text/semantic_model/ernie/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/ernie/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/ernie_tiny/model/transformer_encoder.py b/modules/text/semantic_model/ernie_tiny/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/ernie_tiny/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/ernie_v2_eng_base/model/transformer_encoder.py b/modules/text/semantic_model/ernie_v2_eng_base/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/ernie_v2_eng_base/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/ernie_v2_eng_large/model/transformer_encoder.py b/modules/text/semantic_model/ernie_v2_eng_large/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/ernie_v2_eng_large/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/rbt3/model/transformer_encoder.py b/modules/text/semantic_model/rbt3/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/rbt3/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/rbtl3/model/transformer_encoder.py b/modules/text/semantic_model/rbtl3/model/transformer_encoder.py
deleted file mode 100644
index 53051cde80308a17a30d9b92de11c712b63da406..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/rbtl3/model/transformer_encoder.py
+++ /dev/null
@@ -1,288 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(
-            input=queries,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(
-            input=keys,
-            size=d_key * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(
-            input=values,
-            size=d_value * n_head,
-            num_flatten_dims=2,
-            param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-            bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(
-        input=out,
-        size=d_model,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-        bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(
-        input=x,
-        size=d_inner_hid,
-        num_flatten_dims=2,
-        act=hidden_act,
-        param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    out = layers.fc(
-        input=hidden,
-        size=d_hid,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-        bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
-        None,
-        None,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        attention_dropout,
-        param_initializer=param_initializer,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(
-            enc_input,
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
diff --git a/modules/text/semantic_model/slda_novel/document.py b/modules/text/semantic_model/slda_novel/document.py
deleted file mode 100644
index 4476230a5c9bc8d545b52386dbf00a201e59b468..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/slda_novel/document.py
+++ /dev/null
@@ -1,176 +0,0 @@
-import numpy as np
-
-
-class Topic(object):
-    """Basic data structure of topic, contains topic id and
-       corresponding probability.
-    """
-
-    def __init__(self, tid, prob):
-        self.tid = tid  # topic id
-        self.prob = prob  # topic probability
-
-
-class Token(object):
-    """Basic storage unit of LDA documents, contains word id
-       and corresponding topic.
-    """
-
-    def __init__(self, topic, id):
-        self.topic = topic
-        self.id = id
-
-
-class Sentence(object):
-    """Basic storage unit of SentenceLDA documents, contains word ids
-       of the sentence and its corresponding topic id.
-    """
-
-    def __init__(self, topic, tokens):
-        self.topic = topic
-        self.tokens = tokens
-
-
-class LDADoc(object):
-    """The storage structure of LDA model's inference result.
-    """
-
-    def __init__(self):
-        self._num_topics = None  # Number of topics.
-        self._num_accum = None  # Number of accumulated sample rounds.
-        self._alpha = None  # Document prior parameter.
-        self._tokens = None  # Storage structure of inference results.
-        self._topic_sum = None  # Document's topic sum in one round samples.
-        self._accum_topic_sum = None  # Accumulated results of topic sum.
-
-    def init(self, num_topics):
-        """Initialize the LDADoc according to num_topics.
-        """
-        self._num_topics = num_topics
-        self._num_accum = 0
-        self._tokens = []
-        self._topic_sum = np.zeros(self._num_topics)
-        self._accum_topic_sum = np.zeros(self._num_topics)
-
-    def add_token(self, token):
-        """Add new word to current LDADoc.
-        Arg:
-            token: Token class object.
-        """
-        assert token.topic >= 0, "Topic %d out of range!" % token.topic
-        assert token.topic < self._num_topics, "Topic %d out of range!" % token.topic
-        self._tokens.append(token)
-        self._topic_sum[token.topic] += 1
-
-    def token(self, index):
-        return self._tokens[index]
-
-    def set_topic(self, index, new_topic):
-        """Set the index word's topic to new_topic, and update the corresponding
-           topic distribution.
-        """
-        assert new_topic >= 0, "Topic %d out of range!" % new_topic
-        assert new_topic < self._num_topics, "Topic %d out of range!" % new_topic
-        old_topic = self._tokens[index].topic
-        if new_topic == old_topic:
-            return
-        self._tokens[index].topic = new_topic
-        self._topic_sum[old_topic] -= 1
-        self._topic_sum[new_topic] += 1
-
-    def set_alpha(self, alpha):
-        self._alpha = alpha
-
-    def size(self):
-        """Return number of words in LDADoc.
-        """
-        return len(self._tokens)
-
-    def topic_sum(self, topic_id):
-        return self._topic_sum[topic_id]
-
-    def sparse_topic_dist(self, sort=True):
-        """Return the topic distribution of documents in sparse format.
-           By default, it is sorted according to the topic probability
-           under the descending order.
-        """
-        topic_dist = []
-        sum_ = np.sum(self._accum_topic_sum)
-        if sum_ == 0:
-            return topic_dist
-        for i in range(0, self._num_topics):
-            if self._accum_topic_sum[i] == 0:
-                continue
-            topic_dist.append(Topic(i, self._accum_topic_sum[i] * 1.0 / sum_))
-        if sort:
-
-            def take_elem(topic):
-                return topic.prob
-
-            topic_dist.sort(key=take_elem, reverse=True)
-            if topic_dist is None:
-                topic_dist = []
-
-        return topic_dist
-
-    def dense_topic_dist(self):
-        """Return the distribution of document topics in dense format,
-           taking into account the prior parameter alpha.
-        """
-        dense_dist = np.zeros(self._num_topics)
-        if self.size() == 0:
-            return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
-            self.size() + self._alpha * self._num_topics)
-        return dense_dist
-
-    def accumulate_topic_num(self):
-        self._accum_topic_sum += self._topic_sum
-        self._num_accum += 1
-
-
-class SLDADoc(LDADoc):
-    """Sentence LDA Document, inherited from LDADoc.
-       Add add_sentence interface.
-    """
-
-    def __init__(self):
-        super().__init__()
-        self.__sentences = None
-
-    def init(self, num_topics):
-        """Initialize the SLDADoc according to num_topics.
-        """
-        self._num_topics = num_topics
-        self.__sentences = []
-        self._num_accum = 0
-        self._topic_sum = np.zeros(self._num_topics)
-        self._accum_topic_sum = np.zeros(self._num_topics)
-
-    def add_sentence(self, sent):
-        """Add new sentence to current SLDADoc.
-        Arg:
-            sent: Sentence class object.
-        """
-        assert sent.topic >= 0, "Topic %d out of range!" % (sent.topic)
-        assert sent.topic < self._num_topics, "Topic %d out of range!" % (sent.topic)
-        self.__sentences.append(sent)
-        self._topic_sum[sent.topic] += 1
-
-    def set_topic(self, index, new_topic):
-        assert new_topic >= 0, "Topic %d out of range!" % (new_topic)
-        assert new_topic < self._num_topics, "Topic %d out of range!" % (new_topic)
-        old_topic = self.__sentences[index].topic
-        if new_topic == old_topic:
-            return
-        self.__sentences[index].topic = new_topic
-        self._topic_sum[old_topic] -= 1
-        self._topic_sum[new_topic] += 1
-
-    def size(self):
-        """Return number of sentences in SLDADoc.
-        """
-        return len(self.__sentences)
-
-    def sent(self, index):
-        return self.__sentences[index]
diff --git a/modules/text/semantic_model/slda_webpage/document.py b/modules/text/semantic_model/slda_webpage/document.py
deleted file mode 100644
index 4476230a5c9bc8d545b52386dbf00a201e59b468..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/slda_webpage/document.py
+++ /dev/null
@@ -1,176 +0,0 @@
-import numpy as np
-
-
-class Topic(object):
-    """Basic data structure of topic, contains topic id and
-       corresponding probability.
-    """
-
-    def __init__(self, tid, prob):
-        self.tid = tid  # topic id
-        self.prob = prob  # topic probability
-
-
-class Token(object):
-    """Basic storage unit of LDA documents, contains word id
-       and corresponding topic.
-    """
-
-    def __init__(self, topic, id):
-        self.topic = topic
-        self.id = id
-
-
-class Sentence(object):
-    """Basic storage unit of SentenceLDA documents, contains word ids
-       of the sentence and its corresponding topic id.
-    """
-
-    def __init__(self, topic, tokens):
-        self.topic = topic
-        self.tokens = tokens
-
-
-class LDADoc(object):
-    """The storage structure of LDA model's inference result.
-    """
-
-    def __init__(self):
-        self._num_topics = None  # Number of topics.
-        self._num_accum = None  # Number of accumulated sample rounds.
-        self._alpha = None  # Document prior parameter.
-        self._tokens = None  # Storage structure of inference results.
-        self._topic_sum = None  # Document's topic sum in one round samples.
-        self._accum_topic_sum = None  # Accumulated results of topic sum.
-
-    def init(self, num_topics):
-        """Initialize the LDADoc according to num_topics.
-        """
-        self._num_topics = num_topics
-        self._num_accum = 0
-        self._tokens = []
-        self._topic_sum = np.zeros(self._num_topics)
-        self._accum_topic_sum = np.zeros(self._num_topics)
-
-    def add_token(self, token):
-        """Add new word to current LDADoc.
-        Arg:
-            token: Token class object.
-        """
-        assert token.topic >= 0, "Topic %d out of range!" % token.topic
-        assert token.topic < self._num_topics, "Topic %d out of range!" % token.topic
-        self._tokens.append(token)
-        self._topic_sum[token.topic] += 1
-
-    def token(self, index):
-        return self._tokens[index]
-
-    def set_topic(self, index, new_topic):
-        """Set the index word's topic to new_topic, and update the corresponding
-           topic distribution.
-        """
-        assert new_topic >= 0, "Topic %d out of range!" % new_topic
-        assert new_topic < self._num_topics, "Topic %d out of range!" % new_topic
-        old_topic = self._tokens[index].topic
-        if new_topic == old_topic:
-            return
-        self._tokens[index].topic = new_topic
-        self._topic_sum[old_topic] -= 1
-        self._topic_sum[new_topic] += 1
-
-    def set_alpha(self, alpha):
-        self._alpha = alpha
-
-    def size(self):
-        """Return number of words in LDADoc.
-        """
-        return len(self._tokens)
-
-    def topic_sum(self, topic_id):
-        return self._topic_sum[topic_id]
-
-    def sparse_topic_dist(self, sort=True):
-        """Return the topic distribution of documents in sparse format.
-           By default, it is sorted according to the topic probability
-           under the descending order.
-        """
-        topic_dist = []
-        sum_ = np.sum(self._accum_topic_sum)
-        if sum_ == 0:
-            return topic_dist
-        for i in range(0, self._num_topics):
-            if self._accum_topic_sum[i] == 0:
-                continue
-            topic_dist.append(Topic(i, self._accum_topic_sum[i] * 1.0 / sum_))
-        if sort:
-
-            def take_elem(topic):
-                return topic.prob
-
-            topic_dist.sort(key=take_elem, reverse=True)
-            if topic_dist is None:
-                topic_dist = []
-
-        return topic_dist
-
-    def dense_topic_dist(self):
-        """Return the distribution of document topics in dense format,
-           taking into account the prior parameter alpha.
-        """
-        dense_dist = np.zeros(self._num_topics)
-        if self.size() == 0:
-            return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
-            self.size() + self._alpha * self._num_topics)
-        return dense_dist
-
-    def accumulate_topic_num(self):
-        self._accum_topic_sum += self._topic_sum
-        self._num_accum += 1
-
-
-class SLDADoc(LDADoc):
-    """Sentence LDA Document, inherited from LDADoc.
-       Add add_sentence interface.
-    """
-
-    def __init__(self):
-        super().__init__()
-        self.__sentences = None
-
-    def init(self, num_topics):
-        """Initialize the SLDADoc according to num_topics.
-        """
-        self._num_topics = num_topics
-        self.__sentences = []
-        self._num_accum = 0
-        self._topic_sum = np.zeros(self._num_topics)
-        self._accum_topic_sum = np.zeros(self._num_topics)
-
-    def add_sentence(self, sent):
-        """Add new sentence to current SLDADoc.
-        Arg:
-            sent: Sentence class object.
-        """
-        assert sent.topic >= 0, "Topic %d out of range!" % (sent.topic)
-        assert sent.topic < self._num_topics, "Topic %d out of range!" % (sent.topic)
-        self.__sentences.append(sent)
-        self._topic_sum[sent.topic] += 1
-
-    def set_topic(self, index, new_topic):
-        assert new_topic >= 0, "Topic %d out of range!" % (new_topic)
-        assert new_topic < self._num_topics, "Topic %d out of range!" % (new_topic)
-        old_topic = self.__sentences[index].topic
-        if new_topic == old_topic:
-            return
-        self.__sentences[index].topic = new_topic
-        self._topic_sum[old_topic] -= 1
-        self._topic_sum[new_topic] += 1
-
-    def size(self):
-        """Return number of sentences in SLDADoc.
-        """
-        return len(self.__sentences)
-
-    def sent(self, index):
-        return self.__sentences[index]
diff --git a/modules/text/semantic_model/slda_webpage/tokenizer.py b/modules/text/semantic_model/slda_webpage/tokenizer.py
deleted file mode 100644
index 585aed885b63b0e2a2d450b77a6d018615c86b04..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/slda_webpage/tokenizer.py
+++ /dev/null
@@ -1,127 +0,0 @@
-import os
-
-import numpy as np
-from paddlehub.common.logger import logger
-
-
-class Tokenizer(object):
-    """Base tokenizer class.
-    """
-
-    def __init__(self):
-        pass
-
-    def tokenize(self, text):
-        raise NotImplementedError
-
-
-class SimpleTokenizer(Tokenizer):
-    """Simple version FMM(Forward Maximun Matching) word tokenizer. This tokenizer can only
-       be used in topic model demo, but not in real business application scenarios.
-
-       Notes: This tokenizer can only recognize the words in the corresponding vocab file.
-    """
-
-    def __init__(self, vocab_path):
-        super().__init__()
-        self.__max_word_len = 0
-        self.__vocab = set()
-        self.__load_vocab(vocab_path)
-
-    def tokenize(self, text):
-        """Tokenize the input string `text`, and return the tokenize result.
-        """
-        text_len = len(text)
-        result = []
-        i = 0
-        while i < text_len:
-            word = found_word = ""
-            # Deal with English characters.
-            if self.__is_eng_char(text[i]):
-                for j in range(i, text_len + 1):
-                    if j < text_len and self.__is_eng_char(text[j]):
-                        word += self.__tolower(text[j])
-                    else:
-                        # Forward matching by character granularity.
-                        if word in self.__vocab:
-                            result.append(word)
-                        i = j - 1
-                        break
-            else:
-                for j in range(i, min(i + self.__max_word_len, text_len)):
-                    word += text[j]
-                    if word in self.__vocab:
-                        found_word = word
-                if len(found_word) > 0:
-                    result.append(found_word)
-                    i += len(found_word) - 1
-            i += 1
-        return result
-
-    def contains(self, word):
-        """Check whether the word is in the vocabulary.
-        """
-        return word in self.__vocab
-
-    def __load_vocab(self, vocab_path):
-        """Load the word dictionary.
-        """
-        with open(vocab_path, 'r', encoding='utf-8') as fin:
-            vocab_size = 0
-            for line in fin.readlines():
-                fields = line.strip().split('\t')
-                assert len(fields) >= 2
-                word = fields[1]
-                self.__max_word_len = max(self.__max_word_len, len(word))
-                self.__vocab.add(word)
-                vocab_size += 1
-
-    def __is_eng_char(self, c):
-        """Check whether char c is an English character.
-        """
-        return (c >= 'A' and c <= 'Z') or (c >= 'a' and c <= 'z')
-
-    def __tolower(self, c):
-        """Return the lowercase character of the corresponding character, or return
-           the original character if there is no corresponding lowercase character.
-        """
-        return c.lower()
-
-
-class LACTokenizer(Tokenizer):
-    def __init__(self, vocab_path, lac):
-        super().__init__()
-        self.__max_word_len = 0
-        self.__vocab = set()
-        self.__lac = lac
-        self.__load_vocab(vocab_path)
-
-    def __load_vocab(self, vocab_path):
-        """Load the word dictionary.
-                """
-        with open(vocab_path, 'r', encoding='utf-8') as fin:
-            vocab_size = 0
-            for line in fin.readlines():
-                fields = line.strip().split('\t')
-                assert len(fields) >= 2
-                word = fields[1]
-                self.__max_word_len = max(self.__max_word_len, len(word))
-                self.__vocab.add(word)
-                vocab_size += 1
-
-    def tokenize(self, text):
-        results = self.__lac.lexical_analysis(texts=[text], use_gpu=False, batch_size=1, return_tag=True)
-        # Change English words to lower case.
-        # And just preserve the word in vocab.
-        words = results[0]["word"]
-        result = []
-        for word in words:
-            word = word.lower()
-            if word in self.__vocab:
-                result.append(word)
-        return result
-
-    def contains(self, word):
-        """Check whether the word is in the vocabulary.
-        """
-        return word in self.__vocab
diff --git a/modules/text/semantic_model/slda_weibo/document.py b/modules/text/semantic_model/slda_weibo/document.py
deleted file mode 100644
index 4476230a5c9bc8d545b52386dbf00a201e59b468..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/slda_weibo/document.py
+++ /dev/null
@@ -1,176 +0,0 @@
-import numpy as np
-
-
-class Topic(object):
-    """Basic data structure of topic, contains topic id and
-       corresponding probability.
-    """
-
-    def __init__(self, tid, prob):
-        self.tid = tid  # topic id
-        self.prob = prob  # topic probability
-
-
-class Token(object):
-    """Basic storage unit of LDA documents, contains word id
-       and corresponding topic.
-    """
-
-    def __init__(self, topic, id):
-        self.topic = topic
-        self.id = id
-
-
-class Sentence(object):
-    """Basic storage unit of SentenceLDA documents, contains word ids
-       of the sentence and its corresponding topic id.
-    """
-
-    def __init__(self, topic, tokens):
-        self.topic = topic
-        self.tokens = tokens
-
-
-class LDADoc(object):
-    """The storage structure of LDA model's inference result.
-    """
-
-    def __init__(self):
-        self._num_topics = None  # Number of topics.
-        self._num_accum = None  # Number of accumulated sample rounds.
-        self._alpha = None  # Document prior parameter.
-        self._tokens = None  # Storage structure of inference results.
-        self._topic_sum = None  # Document's topic sum in one round samples.
-        self._accum_topic_sum = None  # Accumulated results of topic sum.
-
-    def init(self, num_topics):
-        """Initialize the LDADoc according to num_topics.
-        """
-        self._num_topics = num_topics
-        self._num_accum = 0
-        self._tokens = []
-        self._topic_sum = np.zeros(self._num_topics)
-        self._accum_topic_sum = np.zeros(self._num_topics)
-
-    def add_token(self, token):
-        """Add new word to current LDADoc.
-        Arg:
-            token: Token class object.
-        """
-        assert token.topic >= 0, "Topic %d out of range!" % token.topic
-        assert token.topic < self._num_topics, "Topic %d out of range!" % token.topic
-        self._tokens.append(token)
-        self._topic_sum[token.topic] += 1
-
-    def token(self, index):
-        return self._tokens[index]
-
-    def set_topic(self, index, new_topic):
-        """Set the index word's topic to new_topic, and update the corresponding
-           topic distribution.
-        """
-        assert new_topic >= 0, "Topic %d out of range!" % new_topic
-        assert new_topic < self._num_topics, "Topic %d out of range!" % new_topic
-        old_topic = self._tokens[index].topic
-        if new_topic == old_topic:
-            return
-        self._tokens[index].topic = new_topic
-        self._topic_sum[old_topic] -= 1
-        self._topic_sum[new_topic] += 1
-
-    def set_alpha(self, alpha):
-        self._alpha = alpha
-
-    def size(self):
-        """Return number of words in LDADoc.
-        """
-        return len(self._tokens)
-
-    def topic_sum(self, topic_id):
-        return self._topic_sum[topic_id]
-
-    def sparse_topic_dist(self, sort=True):
-        """Return the topic distribution of documents in sparse format.
-           By default, it is sorted according to the topic probability
-           under the descending order.
-        """
-        topic_dist = []
-        sum_ = np.sum(self._accum_topic_sum)
-        if sum_ == 0:
-            return topic_dist
-        for i in range(0, self._num_topics):
-            if self._accum_topic_sum[i] == 0:
-                continue
-            topic_dist.append(Topic(i, self._accum_topic_sum[i] * 1.0 / sum_))
-        if sort:
-
-            def take_elem(topic):
-                return topic.prob
-
-            topic_dist.sort(key=take_elem, reverse=True)
-            if topic_dist is None:
-                topic_dist = []
-
-        return topic_dist
-
-    def dense_topic_dist(self):
-        """Return the distribution of document topics in dense format,
-           taking into account the prior parameter alpha.
-        """
-        dense_dist = np.zeros(self._num_topics)
-        if self.size() == 0:
-            return dense_dist
-        dense_dist = (self._accum_topic_sum * 1.0 / self._num_accum + self._alpha) / (
-            self.size() + self._alpha * self._num_topics)
-        return dense_dist
-
-    def accumulate_topic_num(self):
-        self._accum_topic_sum += self._topic_sum
-        self._num_accum += 1
-
-
-class SLDADoc(LDADoc):
-    """Sentence LDA Document, inherited from LDADoc.
-       Add add_sentence interface.
-    """
-
-    def __init__(self):
-        super().__init__()
-        self.__sentences = None
-
-    def init(self, num_topics):
-        """Initialize the SLDADoc according to num_topics.
-        """
-        self._num_topics = num_topics
-        self.__sentences = []
-        self._num_accum = 0
-        self._topic_sum = np.zeros(self._num_topics)
-        self._accum_topic_sum = np.zeros(self._num_topics)
-
-    def add_sentence(self, sent):
-        """Add new sentence to current SLDADoc.
-        Arg:
-            sent: Sentence class object.
-        """
-        assert sent.topic >= 0, "Topic %d out of range!" % (sent.topic)
-        assert sent.topic < self._num_topics, "Topic %d out of range!" % (sent.topic)
-        self.__sentences.append(sent)
-        self._topic_sum[sent.topic] += 1
-
-    def set_topic(self, index, new_topic):
-        assert new_topic >= 0, "Topic %d out of range!" % (new_topic)
-        assert new_topic < self._num_topics, "Topic %d out of range!" % (new_topic)
-        old_topic = self.__sentences[index].topic
-        if new_topic == old_topic:
-            return
-        self.__sentences[index].topic = new_topic
-        self._topic_sum[old_topic] -= 1
-        self._topic_sum[new_topic] += 1
-
-    def size(self):
-        """Return number of sentences in SLDADoc.
-        """
-        return len(self.__sentences)
-
-    def sent(self, index):
-        return self.__sentences[index]
diff --git a/modules/text/semantic_model/slda_weibo/tokenizer.py b/modules/text/semantic_model/slda_weibo/tokenizer.py
deleted file mode 100644
index 585aed885b63b0e2a2d450b77a6d018615c86b04..0000000000000000000000000000000000000000
--- a/modules/text/semantic_model/slda_weibo/tokenizer.py
+++ /dev/null
@@ -1,127 +0,0 @@
-import os
-
-import numpy as np
-from paddlehub.common.logger import logger
-
-
-class Tokenizer(object):
-    """Base tokenizer class.
-    """
-
-    def __init__(self):
-        pass
-
-    def tokenize(self, text):
-        raise NotImplementedError
-
-
-class SimpleTokenizer(Tokenizer):
-    """Simple version FMM(Forward Maximun Matching) word tokenizer. This tokenizer can only
-       be used in topic model demo, but not in real business application scenarios.
-
-       Notes: This tokenizer can only recognize the words in the corresponding vocab file.
-    """
-
-    def __init__(self, vocab_path):
-        super().__init__()
-        self.__max_word_len = 0
-        self.__vocab = set()
-        self.__load_vocab(vocab_path)
-
-    def tokenize(self, text):
-        """Tokenize the input string `text`, and return the tokenize result.
-        """
-        text_len = len(text)
-        result = []
-        i = 0
-        while i < text_len:
-            word = found_word = ""
-            # Deal with English characters.
-            if self.__is_eng_char(text[i]):
-                for j in range(i, text_len + 1):
-                    if j < text_len and self.__is_eng_char(text[j]):
-                        word += self.__tolower(text[j])
-                    else:
-                        # Forward matching by character granularity.
-                        if word in self.__vocab:
-                            result.append(word)
-                        i = j - 1
-                        break
-            else:
-                for j in range(i, min(i + self.__max_word_len, text_len)):
-                    word += text[j]
-                    if word in self.__vocab:
-                        found_word = word
-                if len(found_word) > 0:
-                    result.append(found_word)
-                    i += len(found_word) - 1
-            i += 1
-        return result
-
-    def contains(self, word):
-        """Check whether the word is in the vocabulary.
-        """
-        return word in self.__vocab
-
-    def __load_vocab(self, vocab_path):
-        """Load the word dictionary.
-        """
-        with open(vocab_path, 'r', encoding='utf-8') as fin:
-            vocab_size = 0
-            for line in fin.readlines():
-                fields = line.strip().split('\t')
-                assert len(fields) >= 2
-                word = fields[1]
-                self.__max_word_len = max(self.__max_word_len, len(word))
-                self.__vocab.add(word)
-                vocab_size += 1
-
-    def __is_eng_char(self, c):
-        """Check whether char c is an English character.
-        """
-        return (c >= 'A' and c <= 'Z') or (c >= 'a' and c <= 'z')
-
-    def __tolower(self, c):
-        """Return the lowercase character of the corresponding character, or return
-           the original character if there is no corresponding lowercase character.
-        """
-        return c.lower()
-
-
-class LACTokenizer(Tokenizer):
-    def __init__(self, vocab_path, lac):
-        super().__init__()
-        self.__max_word_len = 0
-        self.__vocab = set()
-        self.__lac = lac
-        self.__load_vocab(vocab_path)
-
-    def __load_vocab(self, vocab_path):
-        """Load the word dictionary.
-                """
-        with open(vocab_path, 'r', encoding='utf-8') as fin:
-            vocab_size = 0
-            for line in fin.readlines():
-                fields = line.strip().split('\t')
-                assert len(fields) >= 2
-                word = fields[1]
-                self.__max_word_len = max(self.__max_word_len, len(word))
-                self.__vocab.add(word)
-                vocab_size += 1
-
-    def tokenize(self, text):
-        results = self.__lac.lexical_analysis(texts=[text], use_gpu=False, batch_size=1, return_tag=True)
-        # Change English words to lower case.
-        # And just preserve the word in vocab.
-        words = results[0]["word"]
-        result = []
-        for word in words:
-            word = word.lower()
-            if word in self.__vocab:
-                result.append(word)
-        return result
-
-    def contains(self, word):
-        """Check whether the word is in the vocabulary.
-        """
-        return word in self.__vocab
diff --git a/modules/text/sentiment_analysis/README.md b/modules/text/sentiment_analysis/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..8109e7d59a55d1e75e854311bc69eebe10f84aea 100644
--- a/modules/text/sentiment_analysis/README.md
+++ b/modules/text/sentiment_analysis/README.md
@@ -0,0 +1,12 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【情感分析】](https://www.paddlepaddle.org.cn/hublist)**
+
+### 情感分析
+情感倾向分析（Sentiment Classification，简称Senta）针对带有主观描述的中文文本，可自动判断该文本的情感极性类别并给出相应的置信度，能够帮助企业理解用户消费习惯、分析热点话题和危机舆情监控，为企业提供有利的决策支持。
+
+- 推荐模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [情感分析-LSTM](https://www.paddlepaddle.org.cn/hubdetail?name=senta_lstm&en_category=SentimentAnalysis) |情感倾向分析LSTM实现 |
+| [情感分析-GRU](https://www.paddlepaddle.org.cn/hubdetail?name=senta_gru&en_category=SentimentAnalysis) |情感倾向分析GRU实现 |
+| [对话情绪识别](https://www.paddlepaddle.org.cn/hubdetail?name=emotion_detection_textcnn&en_category=SentimentAnalysis) |针对智能对话场景中的用户文本，自动判断该文本的情绪类别并给出相应的置信度，情绪类型分为积极、消极、中性。 |
diff --git a/modules/text/syntactic_analysis/README.md b/modules/text/syntactic_analysis/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..608262d1d8e8e3fae44523d6a2933edf67989f12 100644
--- a/modules/text/syntactic_analysis/README.md
+++ b/modules/text/syntactic_analysis/README.md
@@ -0,0 +1,9 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【句法分析】](https://www.paddlepaddle.org.cn/hublist)**
+
+### 句法分析
+
+- 推荐模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [句法分析-DDParser](https://www.paddlepaddle.org.cn/hubdetail?name=ddparser&en_category=SyntacticAnalysis) | DDParser（Baidu Dependency Parser）是百度NLP基于大规模标注数据和深度学习平台飞桨研发的中文依存句法分析工具，可帮助用户直接获取输入文本中的关联词对、长距离依赖词对等。 |
diff --git a/modules/text/text_generation/README.md b/modules/text/text_generation/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9443d555208e4ef1a223880bc79ad2f8e83f0904
--- /dev/null
+++ b/modules/text/text_generation/README.md
@@ -0,0 +1,14 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【文本生成】](https://www.paddlepaddle.org.cn/hublist)**
+
+### 文本生成
+
+
+- 推荐模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [AI写诗](https://www.paddlepaddle.org.cn/hubdetail?name=ernie_gen_poetry&en_category=TextGeneration) |根据输入上句内容，自动生成后面的多句诗歌 |
+| [AI情话](https://www.paddlepaddle.org.cn/hubdetail?name=ernie_gen_lover_words&en_category=TextGeneration) |根据输入上句内容，自动生成后面的多句情话。 |
+| [生成式对话](https://www.paddlepaddle.org.cn/hubdetail?name=plato2_en_large&en_category=TextGeneration) |交互式多轮问答，目前仅支持英文 |
+| [看图写诗](https://www.paddlepaddle.org.cn/hubdetail?name=reading_pictures_writing_poems&en_category=TextGeneration)|该模型可自动根据图像生成古诗词|
+.
diff --git a/modules/text/text_review/README.md b/modules/text/text_review/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f4514db0e2dc978472b66e9fa2a2076a97e5b16f
--- /dev/null
+++ b/modules/text/text_review/README.md
@@ -0,0 +1,11 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【文本审核】](https://www.paddlepaddle.org.cn/hublist)**
+
+### 文本审核
+
+- 推荐模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [色情识别-LSTM](https://www.paddlepaddle.org.cn/hubdetail?name=porn_detection_lstm&en_category=TextCensorship) | 色情检测模型可自动判别文本是否涉黄并给出相应的置信度，对文本中的色情描述、低俗交友、污秽文爱进行识别，核心采用LSTM网络结构并按字粒度进行切词。 |
+| [色情识别-GRU](https://www.paddlepaddle.org.cn/hubdetail?name=porn_detection_gru&en_category=TextCensorship) | 色情检测模型可自动判别文本是否涉黄并给出相应的置信度，对文本中的色情描述、低俗交友、污秽文爱进行识别，核心采用GRU网络结构并按字粒度进行切词。 |
+| [色情识别-CNN](https://www.paddlepaddle.org.cn/hubdetail?name=porn_detection_cnn&en_category=TextCensorship) | 色情检测模型可自动判别文本是否涉黄并给出相应的置信度，对文本中的色情描述、低俗交友、污秽文爱进行识别，核心采用CNN网络结构并按字粒度进行切词。|
diff --git a/modules/video/README.md b/modules/video/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..4787d0a46c85d97b60a54a885e2052def11069ed 100644
--- a/modules/video/README.md
+++ b/modules/video/README.md
@@ -0,0 +1,9 @@
+## **更好用户体验，建议参考WEB端官方文档 -> [【视频分类】](https://www.paddlepaddle.org.cn/hublist)**
+
+### 视频分类
+视频数据包含语音、图像等多种信息，因此理解视频任务不仅需要处理语音和图像，还需要提取视频帧时间序列中的上下文信息，视频分类模型适用于各类短视频快速打标签等场景。
+- 推荐模型
+
+| 模型名称                                                     | 模型简介                                                     |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [视频分类](https://www.paddlepaddle.org.cn/hubdetail?name=videotag_tsn_lstm&en_category=VideoClassification) | 基于千万短视频预训练的视频分类模型，支持超过3000个短视频标签，在实际业务中取得89.9%的Top-1精度，同时具有良好的泛化能力，