add sequence classification with ernie/bert/roberta finetuned in dygraph

ca8541bd · Steffy-zxf · GitHub · 060c13ec · ca8541bd · ca8541bd
106 changed file
--- a/demo/text_classification/README.md
+++ b/demo/text_classification/README.md
+# PaddleHub Transformer模型fine-tune文本分类（动态图）
+
+本示例将展示如何使用PaddleHub Transformer模型（如 ERNIE、BERT、RoBERTa等模型）module 以动态图方式fine-tune并完成预测任务。
+
+## 如何开始Fine-tune
+
+
+我们以中文情感分类公开数据集ChnSentiCorp为示例数据集，可以运行下面的命令，在训练集（train.tsv）上进行模型训练，并在开发集（dev.tsv）验证。通过如下命令，即可启动训练。
+
+```shell
+# 设置使用的GPU卡号
+export CUDA_VISIBLE_DEVICES=0
+python train.py
+```
+
+
+## 代码步骤
+
+使用PaddleHub Fine-tune API进行Fine-tune可以分为4个步骤。
+
+### Step1: 选择模型
+```python
+import paddlehub as hub
+
+model = hub.Module(name='ernie_tiny', version='2.0.0', task='sequence_classification')
+```
+
+其中，参数：
+
+* `name`：模型名称，可以选择`ernie`，`ernie-tiny`，`bert_chinese_L-12_H-768_A-12`，`chinese-roberta-wwm-ext`，`chinese-roberta-wwm-ext-large`等。
+* `version`：module版本号
+* `task`：fine-tune任务。此处为`sequence_classification`，表示文本分类任务。
+
+### Step2: 下载并加载数据集
+
+```python
+train_dataset = hub.datasets.ChnSentiCorp(
+    tokenizer=model.get_tokenizer(tokenize_chinese_chars=True), max_seq_len=128, mode='train')
+dev_dataset = hub.datasets.ChnSentiCorp(
+    tokenizer=model.get_tokenizer(tokenize_chinese_chars=True), max_seq_len=128, mode='dev')
+```
+
+* `tokenizer`：表示该module所需用到的tokenizer，其将对输入文本完成切词，并转化成module运行所需模型输入格式。
+* `mode`：选择数据模式，可选项有 `train`, `test`, `val`， 默认为`train`。
+* `max_seq_len`：ERNIE/BERT模型使用的最大序列长度，若出现显存不足，请适当调低这一参数。
+
+### Step3:  选择优化策略和运行配置
+
+```python
+optimizer = paddle.optimizer.Adam(learning_rate=5e-5, parameters=model.parameters())
+trainer = hub.Trainer(model, optimizer, checkpoint_dir='test_ernie_text_cls')
+
+trainer.train(train_dataset, epochs=3, batch_size=32, eval_dataset=dev_dataset)
+
+# 在测试集上评估当前训练模型
+trainer.evaluate(test_dataset, batch_size=32)
+```
+
+#### 优化策略
+
+Paddle2.0-rc提供了多种优化器选择，如`SGD`, `Adam`, `Adamax`等，详细参见[策略](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.0-rc/api/paddle/optimizer/optimizer/Optimizer_cn.html)。
+
+其中`Adam`:
+
+* `learning_rate`: 全局学习率。默认为1e-3；
+* `parameters`: 待优化模型参数。
+
+#### 运行配置
+
+`Trainer` 主要控制Fine-tune的训练，包含以下可控制的参数:
+
+* `model`: 被优化模型；
+* `optimizer`: 优化器选择；
+* `use_vdl`: 是否使用vdl可视化训练过程；
+* `checkpoint_dir`: 保存模型参数的地址；
+* `compare_metrics`: 保存最优模型的衡量指标；
+
+`trainer.train` 主要控制具体的训练过程，包含以下可控制的参数：
+
+* `train_dataset`: 训练时所用的数据集；
+* `epochs`: 训练轮数；
+* `batch_size`: 训练的批大小，如果使用GPU，请根据实际情况调整batch_size；
+* `num_workers`: works的数量，默认为0；
+* `eval_dataset`: 验证集；
+* `log_interval`: 打印日志的间隔， 单位为执行批训练的次数。
+* `save_interval`: 保存模型的间隔频次，单位为执行训练的轮数。
+
+## 模型预测
+
+当完成Fine-tune后，Fine-tune过程在验证集上表现最优的模型会被保存在`${CHECKPOINT_DIR}/best_model`目录下，其中`${CHECKPOINT_DIR}`目录为Fine-tune时所选择的保存checkpoint的目录。
+
+我们以以下数据为待预测数据，使用该模型来进行预测
+
+```text
+这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般
+怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片
+作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。
+```
+
+```python
+import paddlehub as hub
+
+data = [
+    ['这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般'],
+    ['怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片'],
+    ['作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。'],
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    directory='/mnt/zhangxuefei/program-paddle/PaddleHub/modules/text/language_model/ernie_tiny',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='./test_ernie_text_cls/best_model/model.pdparams',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text[0], results[idx]))
+```
+
+参数配置正确后，请执行脚本`python predict.py`， 加载模型具体可参见[加载](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.0-rc/api/paddle/framework/io/load_cn.html#load)。
+
+### 依赖
+
+paddlepaddle >= 2.0.0rc
+
+paddlehub >= 2.0.0
--- a/modules/text/language_model/ernie/model/__init__.py
+++ b/modules/text/language_model/ernie/model/__init__.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -11,3 +11,23 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import paddlehub as hub
+
+if __name__ == '__main__':
+
+    data = [
+        ['这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般'],
+        ['怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片'],
+        ['作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。'],
+    ]
+    label_map = {0: 'negative', 1: 'positive'}
+
+    model = hub.Module(
+        name='ernie_tiny',
+        version='2.0.0',
+        task='sequence_classification',
+        load_checkpoint='./test_ernie_text_cls/best_model/model.pdparams',
+        label_map=label_map)
+    results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+    for idx, text in enumerate(data):
+        print('Data: {} \t Lable: {}'.format(text[0], results[idx]))
--- a/modules/text/language_model/ernie_tiny/model/__init__.py
+++ b/modules/text/language_model/ernie_tiny/model/__init__.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -11,3 +11,21 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import paddle
+import paddlehub as hub
+
+if __name__ == '__main__':
+    model = hub.Module(name='ernie_tiny', version='2.0.0', task='sequence_classification')
+
+    train_dataset = hub.datasets.ChnSentiCorp(
+        tokenizer=model.get_tokenizer(tokenize_chinese_chars=True), max_seq_len=128, mode='train')
+    dev_dataset = hub.datasets.ChnSentiCorp(
+        tokenizer=model.get_tokenizer(tokenize_chinese_chars=True), max_seq_len=128, mode='dev')
+    test_dataset = hub.datasets.ChnSentiCorp(
+        tokenizer=model.get_tokenizer(tokenize_chinese_chars=True), max_seq_len=128, mode='test')
+
+    optimizer = paddle.optimizer.AdamW(learning_rate=5e-5, parameters=model.parameters())
+    trainer = hub.Trainer(model, optimizer, checkpoint_dir='test_ernie_text_cls', use_gpu=True)
+
+    trainer.train(train_dataset, epochs=3, batch_size=32, eval_dataset=dev_dataset, save_interval=1)
+    trainer.evaluate(test_dataset, batch_size=32)
--- a/modules/text/language_model/bert-base-cased/README.md
+++ b/modules/text/language_model/bert-base-cased/README.md
+```shell
+$ hub install bert-base-cased==2.0.0
+```
+
+<p align="center">
+<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
+</p>
+
+更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
+
+## API
+
+```python
+def __init__(
+    task=None,
+    load_checkpoint=None,
+    label_map=None)
+```
+
+创建Module对象（动态图组网版本）。
+
+**参数**
+
+* `task`： 任务名称，可为`sequence_classification`。
+* `load_checkpoint`：使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
+* `label_map`：预测时的类别映射表。
+
+```python
+def predict(
+    data,
+    max_seq_len=128,
+    batch_size=1,
+    use_gpu=False)
+```
+
+**参数**
+
+* `data`： 待预测数据，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，
+    每个样例可以包含text\_a与text\_b。每个样例文本数量（1个或者2个）需和训练时保持一致。
+* `max_seq_len`：模型处理文本的最大长度
+* `batch_size`：模型批处理大小
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+```python
+def get_embedding(
+    texts,
+    use_gpu=False
+)
+```
+
+用于获取输入文本的句子粒度特征与字粒度特征
+
+**参数**
+
+* `texts`：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+* `results`：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
+
+
+**代码示例**
+
+```python
+import paddlehub as hub
+
+data = [
+    '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    name='bert-base-cased',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='/path/to/parameters',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text, results[idx]))
+```
+
+参考PaddleHub 文本分类示例。https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classifcation
+
+## 服务部署
+
+PaddleHub Serving可以部署一个在线获取预训练词向量。
+
+### Step1: 启动PaddleHub Serving
+
+运行启动命令：
+
+```shell
+$ hub serving start -m bert-base-cased
+```
+
+这样就完成了一个获取预训练词向量服务化API的部署，默认端口号为8866。
+
+**NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+
+### Step2: 发送预测请求
+
+配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果
+
+```python
+import requests
+import json
+
+# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
+text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般"]]
+# 以key的方式指定text传入预测方法的时的参数，此例中为"texts"
+# 对应本地部署，则为module.get_embedding(texts=text)
+data = {"texts": text}
+# 发送post请求，content-type类型应指定json方式
+url = "http://10.12.121.132:8866/predict/bert-base-cased"
+# 指定post请求的headers为application/json方式
+headers = {"Content-Type": "application/json"}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+print(r.json())
+```
+
+##   查看代码
+
+https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
+
+
+## 依赖
+
+paddlepaddle >= 2.0.0
+
+paddlehub >= 2.0.0
+
+## 更新历史
+
+* 1.0.0
+
+  初始发布
+
+* 1.1.0
+
+  支持get_embedding与get_params_layer
+
+* 2.0.0
+
+  全面升级动态图，接口有所变化。
--- a/modules/text/language_model/bert_cased_L_12_H_768_A_12/__init__.py
+++ b/modules/text/language_model/bert_cased_L_12_H_768_A_12/__init__.py
--- a/modules/text/language_model/bert-base-cased/module.py
+++ b/modules/text/language_model/bert-base-cased/module.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Dict, List, Optional, Union, Tuple
+import os
+
+from paddle.dataset.common import DATA_HOME
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlehub import BertTokenizer
+from paddlehub.module.modeling_bert import BertForSequenceClassification, BertModel
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download
+
+
+@moduleinfo(
+    name="bert-base-cased",
+    version="2.0.0",
+    summary=
+    "bert_cased_L-12_H-768_A-12, 12-layer, 768-hidden, 12-heads, 110M parameters. The module is executed as paddle.dygraph.",
+    author="paddlepaddle",
+    author_email="",
+    type="nlp/semantic_model")
+class Bert(nn.Layer):
+    """
+    BERT model
+    """
+
+    def __init__(
+            self,
+            task=None,
+            load_checkpoint=None,
+            label_map=None,
+    ):
+        super(Bert, self).__init__()
+        # TODO(zhangxuefei): add token_classification task
+        if task == 'sequence_classification':
+            self.model = BertForSequenceClassification.from_pretrained(pretrained_model_name_or_path='bert-base-cased')
+            self.criterion = paddle.nn.loss.CrossEntropyLoss()
+            self.metric = paddle.metric.Accuracy(name='acc_accumulation')
+        elif task is None:
+            self.model = BertModel.from_pretrained(pretrained_model_name_or_path='bert-base-cased')
+        else:
+            raise RuntimeError("Unknown task %s, task should be sequence_classification" % task)
+
+        self.task = task
+        self.label_map = label_map
+
+        if load_checkpoint is not None and os.path.isfile(load_checkpoint):
+            state_dict = paddle.load(load_checkpoint)
+            self.set_state_dict(state_dict)
+            logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint))
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        result = self.model(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.task is not None:
+            logits = result
+            probs = F.softmax(logits, axis=1)
+            if labels is not None:
+                loss = self.criterion(logits, labels)
+                correct = self.metric.compute(probs, labels)
+                acc = self.metric.update(correct)
+                return probs, loss, acc
+            return probs
+        else:
+            sequence_output, pooled_output = result
+            return sequence_output, pooled_output
+
+    def get_vocab_path(self):
+        """
+        Gets the path of the module vocabulary path.
+        """
+        save_path = os.path.join(DATA_HOME, 'bert-base-cased', 'bert-base-cased-vocab.txt')
+        if not os.path.exists(save_path) or not os.path.isfile(save_path):
+            url = "https://paddle-hapi.bj.bcebos.com/models/bert/bert-base-cased-vocab.txt"
+            download(url, os.path.join(DATA_HOME, 'bert-base-cased'))
+        return save_path
+
+    def get_tokenizer(self, tokenize_chinese_chars=True):
+        """
+        Gets the tokenizer that is customized for this module.
+        Args:
+            tokenize_chinese_chars (:obj: bool , defaults to :obj: True):
+                Whether to tokenize chinese characters or not.
+        Returns:
+            tokenizer (:obj:BertTokenizer) : The tokenizer which was customized for this module.
+        """
+        return BertTokenizer(tokenize_chinese_chars=tokenize_chinese_chars, vocab_file=self.get_vocab_path())
+
+    def training_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for training, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as loss and metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'loss': avg_loss, 'metrics': {'acc': acc}}
+
+    def validation_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for validation, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'metrics': {'acc': acc}}
+
+    def predict(self, data, max_seq_len=128, batch_size=1, use_gpu=False):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The processed data whose each element is the raw text.
+            max_seq_len (:obj:`int`, `optional`, defaults to :int:`None`):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            use_gpu(obj:`bool`, defaults to `False`): Whether to use gpu to run or not.
+
+        Returns:
+            results(obj:`list`): All the predictions labels.
+        """
+        # TODO(zhangxuefei): add task token_classification task predict.
+        if self.task not in ['sequence_classification']:
+            raise RuntimeError("The predict method is for sequence_classification task, but got task %s." % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        tokenizer = self.get_tokenizer()
+
+        examples = []
+        for text in data:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, max_seq_len=max_seq_len)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], max_seq_len=max_seq_len)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+            examples.append((encoded_inputs['input_ids'], encoded_inputs['segment_ids']))
+
+        def _batchify_fn(batch):
+            input_ids = [entry[0] for entry in batch]
+            segment_ids = [entry[1] for entry in batch]
+            return input_ids, segment_ids
+
+        # Seperates data into some batches.
+        batches = []
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                batches.append(one_batch)
+                one_batch = []
+        if one_batch:
+            # The last batch whose size is less than the config batch_size setting.
+            batches.append(one_batch)
+
+        results = []
+        self.eval()
+        for batch in batches:
+            input_ids, segment_ids = _batchify_fn(batch)
+            input_ids = paddle.to_tensor(input_ids)
+            segment_ids = paddle.to_tensor(segment_ids)
+
+            # TODO(zhangxuefei): add task token_classification postprocess after prediction.
+            if self.task == 'sequence_classification':
+                probs = self(input_ids, segment_ids)
+                idx = paddle.argmax(probs, axis=1).numpy()
+                idx = idx.tolist()
+                labels = [self.label_map[i] for i in idx]
+                results.extend(labels)
+
+        return results
+
+    @serving
+    def get_embedding(self, texts, use_gpu=False):
+        if self.task is not None:
+            raise RuntimeError("The get_embedding method is only valid when task is None, but got task %s" % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+
+        tokenizer = self.get_tokenizer()
+        results = []
+        for text in texts:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, pad_to_max_seq_len=False)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], pad_to_max_seq_len=False)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+
+            encoded_inputs = tokenizer.encode(text, pad_to_max_seq_len=False)
+            input_ids = paddle.to_tensor(encoded_inputs['input_ids']).unsqueeze(0)
+            segment_ids = paddle.to_tensor(encoded_inputs['segment_ids']).unsqueeze(0)
+            sequence_output, pooled_output = self(input_ids, segment_ids)
+
+            sequence_output = sequence_output.squeeze(0)
+            pooled_output = pooled_output.squeeze(0)
+            results.append((sequence_output.numpy().tolist(), pooled_output.numpy().tolist()))
+        return results
--- a/modules/text/language_model/bert-base-chinese/README.md
+++ b/modules/text/language_model/bert-base-chinese/README.md
+```shell
+$ hub install bert-base-chinese==2.0.0
+```
+<p align="center">
+<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
+</p>
+
+更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
+
+## API
+
+```python
+def __init__(
+    task=None,
+    load_checkpoint=None,
+    label_map=None)
+```
+
+创建Module对象（动态图组网版本）。
+
+**参数**
+
+* `task`： 任务名称，可为`sequence_classification`。
+* `load_checkpoint`：使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
+* `label_map`：预测时的类别映射表。
+
+```python
+def predict(
+    data,
+    max_seq_len=128,
+    batch_size=1,
+    use_gpu=False)
+```
+
+**参数**
+
+* `data`： 待预测数据，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，
+    每个样例可以包含text\_a与text\_b。每个样例文本数量（1个或者2个）需和训练时保持一致。
+* `max_seq_len`：模型处理文本的最大长度
+* `batch_size`：模型批处理大小
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+```python
+def get_embedding(
+    texts,
+    use_gpu=False
+)
+```
+
+用于获取输入文本的句子粒度特征与字粒度特征
+
+**参数**
+
+* `texts`：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+* `results`：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
+
+
+**代码示例**
+
+```python
+import paddlehub as hub
+
+data = [
+    '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    name='bert-base-chinese',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='/path/to/parameters',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text, results[idx]))
+```
+
+参考PaddleHub 文本分类示例。https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classifcation
+
+## 服务部署
+
+PaddleHub Serving可以部署一个在线获取预训练词向量。
+
+### Step1: 启动PaddleHub Serving
+
+运行启动命令：
+
+```shell
+$ hub serving start -m bert-base-chinese
+```
+
+这样就完成了一个获取预训练词向量服务化API的部署，默认端口号为8866。
+
+**NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+
+### Step2: 发送预测请求
+
+配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果
+
+```python
+import requests
+import json
+
+# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
+text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般"]]
+# 以key的方式指定text传入预测方法的时的参数，此例中为"texts"
+# 对应本地部署，则为module.get_embedding(texts=text)
+data = {"texts": text}
+# 发送post请求，content-type类型应指定json方式
+url = "http://10.12.121.132:8866/predict/bert-base-chinese"
+# 指定post请求的headers为application/json方式
+headers = {"Content-Type": "application/json"}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+print(r.json())
+```
+
+##   查看代码
+
+https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
+
+
+## 依赖
+
+paddlepaddle >= 2.0.0
+
+paddlehub >= 2.0.0
+
+## 更新历史
+
+* 1.0.0
+
+  初始发布
+
+* 1.1.0
+
+  支持get_embedding与get_params_layer
+
+* 2.0.0
+
+  全面升级动态图，接口有所变化。
--- a/modules/text/language_model/bert_cased_L_12_H_768_A_12/model/__init__.py
+++ b/modules/text/language_model/bert_cased_L_12_H_768_A_12/model/__init__.py
--- a/modules/text/language_model/bert-base-chinese/module.py
+++ b/modules/text/language_model/bert-base-chinese/module.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Dict, List, Optional, Union, Tuple
+import os
+
+from paddle.dataset.common import DATA_HOME
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlehub import BertTokenizer
+from paddlehub.module.modeling_bert import BertForSequenceClassification, BertModel
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download
+
+
+@moduleinfo(
+    name="bert-base-chinese",
+    version="2.0.0",
+    summary=
+    "bert_chinese_L-12_H-768_A-12, 12-layer, 768-hidden, 12-heads, 110M parameters. The module is executed as paddle.dygraph.",
+    author="paddlepaddle",
+    author_email="",
+    type="nlp/semantic_model")
+class Bert(nn.Layer):
+    """
+    Bert model
+    """
+
+    def __init__(
+            self,
+            task=None,
+            load_checkpoint=None,
+            label_map=None,
+    ):
+        super(Bert, self).__init__()
+        # TODO(zhangxuefei): add token_classification task
+        if task == 'sequence_classification':
+            self.model = BertForSequenceClassification.from_pretrained(
+                pretrained_model_name_or_path='bert-base-chinese')
+            self.criterion = paddle.nn.loss.CrossEntropyLoss()
+            self.metric = paddle.metric.Accuracy(name='acc_accumulation')
+        elif task is None:
+            self.model = BertModel.from_pretrained(pretrained_model_name_or_path='bert-base-chinese')
+        else:
+            raise RuntimeError("Unknown task %s, task should be sequence_classification" % task)
+
+        self.task = task
+        self.label_map = label_map
+
+        if load_checkpoint is not None and os.path.isfile(load_checkpoint):
+            state_dict = paddle.load(load_checkpoint)
+            self.set_state_dict(state_dict)
+            logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint))
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        result = self.model(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.task is not None:
+            logits = result
+            probs = F.softmax(logits, axis=1)
+            if labels is not None:
+                loss = self.criterion(logits, labels)
+                correct = self.metric.compute(probs, labels)
+                acc = self.metric.update(correct)
+                return probs, loss, acc
+            return probs
+        else:
+            sequence_output, pooled_output = result
+            return sequence_output, pooled_output
+
+    def get_vocab_path(self):
+        """
+        Gets the path of the module vocabulary path.
+        """
+        save_path = os.path.join(DATA_HOME, 'bert-base-chinese', 'bert-base-chinese-vocab.txt')
+        if not os.path.exists(save_path) or not os.path.isfile(save_path):
+            url = "https://paddle-hapi.bj.bcebos.com/models/bert/bert-base-chinese-vocab.txt"
+            download(url, os.path.join(DATA_HOME, 'bert-base-chinese'))
+        return save_path
+
+    def get_tokenizer(self, tokenize_chinese_chars=True):
+        """
+        Gets the tokenizer that is customized for this module.
+        Args:
+            tokenize_chinese_chars (:obj: bool , defaults to :obj: True):
+                Whether to tokenize chinese characters or not.
+        Returns:
+            tokenizer (:obj:BertTokenizer) : The tokenizer which was customized for this module.
+        """
+        return BertTokenizer(tokenize_chinese_chars=tokenize_chinese_chars, vocab_file=self.get_vocab_path())
+
+    def training_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for training, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as loss and metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'loss': avg_loss, 'metrics': {'acc': acc}}
+
+    def validation_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for validation, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'metrics': {'acc': acc}}
+
+    def predict(self, data, max_seq_len=128, batch_size=1, use_gpu=False):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The processed data whose each element is the raw text.
+            max_seq_len (:obj:`int`, `optional`, defaults to :int:`None`):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            use_gpu(obj:`bool`, defaults to `False`): Whether to use gpu to run or not.
+
+        Returns:
+            results(obj:`list`): All the predictions labels.
+        """
+        # TODO(zhangxuefei): add task token_classification task predict.
+        if self.task not in ['sequence_classification']:
+            raise RuntimeError("The predict method is for sequence_classification task, but got task %s." % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        tokenizer = self.get_tokenizer()
+
+        examples = []
+        for text in data:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, max_seq_len=max_seq_len)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], max_seq_len=max_seq_len)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+            examples.append((encoded_inputs['input_ids'], encoded_inputs['segment_ids']))
+
+        def _batchify_fn(batch):
+            input_ids = [entry[0] for entry in batch]
+            segment_ids = [entry[1] for entry in batch]
+            return input_ids, segment_ids
+
+        # Seperates data into some batches.
+        batches = []
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                batches.append(one_batch)
+                one_batch = []
+        if one_batch:
+            # The last batch whose size is less than the config batch_size setting.
+            batches.append(one_batch)
+
+        results = []
+        self.eval()
+        for batch in batches:
+            input_ids, segment_ids = _batchify_fn(batch)
+            input_ids = paddle.to_tensor(input_ids)
+            segment_ids = paddle.to_tensor(segment_ids)
+
+            # TODO(zhangxuefei): add task token_classification postprocess after prediction.
+            if self.task == 'sequence_classification':
+                probs = self(input_ids, segment_ids)
+                idx = paddle.argmax(probs, axis=1).numpy()
+                idx = idx.tolist()
+                labels = [self.label_map[i] for i in idx]
+                results.extend(labels)
+
+        return results
+
+    @serving
+    def get_embedding(self, texts, use_gpu=False):
+        if self.task is not None:
+            raise RuntimeError("The get_embedding method is only valid when task is None, but got task %s" % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+
+        tokenizer = self.get_tokenizer()
+        results = []
+        for text in texts:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, pad_to_max_seq_len=False)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], pad_to_max_seq_len=False)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+
+            input_ids = paddle.to_tensor(encoded_inputs['input_ids']).unsqueeze(0)
+            segment_ids = paddle.to_tensor(encoded_inputs['segment_ids']).unsqueeze(0)
+            sequence_output, pooled_output = self(input_ids, segment_ids)
+
+            sequence_output = sequence_output.squeeze(0)
+            pooled_output = pooled_output.squeeze(0)
+            results.append((sequence_output.numpy().tolist(), pooled_output.numpy().tolist()))
+        return results
--- a/modules/text/language_model/bert-base-multilingual-cased/README.md
+++ b/modules/text/language_model/bert-base-multilingual-cased/README.md
+```shell
+$ hub install bert-base-multilingual-cased==2.0.0
+```
+<p align="center">
+<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
+</p>
+
+更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
+
+## API
+
+```python
+def __init__(
+    task=None,
+    load_checkpoint=None,
+    label_map=None)
+```
+
+创建Module对象（动态图组网版本）。
+
+**参数**
+
+* `task`： 任务名称，可为`sequence_classification`。
+* `load_checkpoint`：使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
+* `label_map`：预测时的类别映射表。
+
+```python
+def predict(
+    data,
+    max_seq_len=128,
+    batch_size=1,
+    use_gpu=False)
+```
+
+**参数**
+
+* `data`： 待预测数据，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，
+    每个样例可以包含text\_a与text\_b。每个样例文本数量（1个或者2个）需和训练时保持一致。
+* `max_seq_len`：模型处理文本的最大长度
+* `batch_size`：模型批处理大小
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+```python
+def get_embedding(
+    texts,
+    use_gpu=False
+)
+```
+
+用于获取输入文本的句子粒度特征与字粒度特征
+
+**参数**
+
+* `texts`：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+* `results`：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
+
+
+**代码示例**
+
+```python
+import paddlehub as hub
+
+data = [
+    '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    name='bert-base-multilingual-cased',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='/path/to/parameters',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text, results[idx]))
+```
+
+参考PaddleHub 文本分类示例。https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classifcation
+
+## 服务部署
+
+PaddleHub Serving可以部署一个在线获取预训练词向量。
+
+### Step1: 启动PaddleHub Serving
+
+运行启动命令：
+
+```shell
+$ hub serving start -m bert-base-multilingual-cased
+```
+
+这样就完成了一个获取预训练词向量服务化API的部署，默认端口号为8866。
+
+**NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+
+### Step2: 发送预测请求
+
+配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果
+
+```python
+import requests
+import json
+
+# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
+text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般"]]
+# 以key的方式指定text传入预测方法的时的参数，此例中为"texts"
+# 对应本地部署，则为module.get_embedding(texts=text)
+data = {"texts": text}
+# 发送post请求，content-type类型应指定json方式
+url = "http://10.12.121.132:8866/predict/bert-base-multilingual-cased"
+# 指定post请求的headers为application/json方式
+headers = {"Content-Type": "application/json"}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+print(r.json())
+```
+
+##   查看代码
+
+https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
+
+
+## 依赖
+
+paddlepaddle >= 2.0.0
+
+paddlehub >= 2.0.0
+
+## 更新历史
+
+* 1.0.0
+
+  初始发布
+
+* 1.1.0
+
+  支持get_embedding与get_params_layer
+
+* 2.0.0
+
+  全面升级动态图，接口有所变化。
--- a/modules/text/language_model/bert_cased_L_24_H_1024_A_16/__init__.py
+++ b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/__init__.py
--- a/modules/text/language_model/bert-base-multilingual-cased/module.py
+++ b/modules/text/language_model/bert-base-multilingual-cased/module.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Dict, List, Optional, Union, Tuple
+import os
+
+from paddle.dataset.common import DATA_HOME
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlehub import BertTokenizer
+from paddlehub.module.modeling_bert import BertForSequenceClassification, BertModel
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download
+
+
+@moduleinfo(
+    name="bert-base-multilingual-cased",
+    version="2.0.0",
+    summary=
+    "bert_multi_cased_L-12_H-768_A-12, 12-layer, 768-hidden, 12-heads, 110M parameters. The module is executed as paddle.dygraph.",
+    author="paddlepaddle",
+    author_email="",
+    type="nlp/semantic_model")
+class Bert(nn.Layer):
+    """
+    BERT model
+    """
+
+    def __init__(
+            self,
+            task=None,
+            load_checkpoint=None,
+            label_map=None,
+    ):
+        super(Bert, self).__init__()
+        # TODO(zhangxuefei): add token_classification task
+        if task == 'sequence_classification':
+            self.model = BertForSequenceClassification.from_pretrained(
+                pretrained_model_name_or_path='bert-base-multilingual-cased')
+            self.criterion = paddle.nn.loss.CrossEntropyLoss()
+            self.metric = paddle.metric.Accuracy(name='acc_accumulation')
+        elif task is None:
+            self.model = BertModel.from_pretrained(pretrained_model_name_or_path='bert-base-multilingual-cased')
+        else:
+            raise RuntimeError("Unknown task %s, task should be sequence_classification" % task)
+
+        self.task = task
+        self.label_map = label_map
+
+        if load_checkpoint is not None and os.path.isfile(load_checkpoint):
+            state_dict = paddle.load(load_checkpoint)
+            self.set_state_dict(state_dict)
+            logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint))
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        result = self.model(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.task is not None:
+            logits = result
+            probs = F.softmax(logits, axis=1)
+            if labels is not None:
+                loss = self.criterion(logits, labels)
+                correct = self.metric.compute(probs, labels)
+                acc = self.metric.update(correct)
+                return probs, loss, acc
+            return probs
+        else:
+            sequence_output, pooled_output = result
+            return sequence_output, pooled_output
+
+    def get_vocab_path(self):
+        """
+        Gets the path of the module vocabulary path.
+        """
+        save_path = os.path.join(DATA_HOME, 'bert-base-multilingual-cased', 'bert-base-multilingual-cased-vocab.txt')
+        if not os.path.exists(save_path) or not os.path.isfile(save_path):
+            url = "https://paddle-hapi.bj.bcebos.com/models/bert/bert-base-multilingual-cased-vocab.txt"
+            download(url, os.path.join(DATA_HOME, 'bert-base-multilingual-cased'))
+        return save_path
+
+    def get_tokenizer(self, tokenize_chinese_chars=True):
+        """
+        Gets the tokenizer that is customized for this module.
+        Args:
+            tokenize_chinese_chars (:obj: bool , defaults to :obj: True):
+                Whether to tokenize chinese characters or not.
+        Returns:
+            tokenizer (:obj:BertTokenizer) : The tokenizer which was customized for this module.
+        """
+        return BertTokenizer(tokenize_chinese_chars=tokenize_chinese_chars, vocab_file=self.get_vocab_path())
+
+    def training_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for training, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as loss and metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'loss': avg_loss, 'metrics': {'acc': acc}}
+
+    def validation_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for validation, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'metrics': {'acc': acc}}
+
+    def predict(self, data, max_seq_len=128, batch_size=1, use_gpu=False):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The processed data whose each element is the raw text.
+            max_seq_len (:obj:`int`, `optional`, defaults to :int:`None`):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            use_gpu(obj:`bool`, defaults to `False`): Whether to use gpu to run or not.
+
+        Returns:
+            results(obj:`list`): All the predictions labels.
+        """
+        # TODO(zhangxuefei): add task token_classification task predict.
+        if self.task not in ['sequence_classification']:
+            raise RuntimeError("The predict method is for sequence_classification task, but got task %s." % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        tokenizer = self.get_tokenizer()
+
+        examples = []
+        for text in data:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, max_seq_len=max_seq_len)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], max_seq_len=max_seq_len)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+            examples.append((encoded_inputs['input_ids'], encoded_inputs['segment_ids']))
+
+        def _batchify_fn(batch):
+            input_ids = [entry[0] for entry in batch]
+            segment_ids = [entry[1] for entry in batch]
+            return input_ids, segment_ids
+
+        # Seperates data into some batches.
+        batches = []
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                batches.append(one_batch)
+                one_batch = []
+        if one_batch:
+            # The last batch whose size is less than the config batch_size setting.
+            batches.append(one_batch)
+
+        results = []
+        self.eval()
+        for batch in batches:
+            input_ids, segment_ids = _batchify_fn(batch)
+            input_ids = paddle.to_tensor(input_ids)
+            segment_ids = paddle.to_tensor(segment_ids)
+
+            # TODO(zhangxuefei): add task token_classification postprocess after prediction.
+            if self.task == 'sequence_classification':
+                probs = self(input_ids, segment_ids)
+                idx = paddle.argmax(probs, axis=1).numpy()
+                idx = idx.tolist()
+                labels = [self.label_map[i] for i in idx]
+                results.extend(labels)
+
+        return results
+
+    @serving
+    def get_embedding(self, texts, use_gpu=False):
+        if self.task is not None:
+            raise RuntimeError("The get_embedding method is only valid when task is None, but got task %s" % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+
+        tokenizer = self.get_tokenizer()
+        results = []
+        for text in texts:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, pad_to_max_seq_len=False)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], pad_to_max_seq_len=False)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+
+            input_ids = paddle.to_tensor(encoded_inputs['input_ids']).unsqueeze(0)
+            segment_ids = paddle.to_tensor(encoded_inputs['segment_ids']).unsqueeze(0)
+            sequence_output, pooled_output = self(input_ids, segment_ids)
+
+            sequence_output = sequence_output.squeeze(0)
+            pooled_output = pooled_output.squeeze(0)
+            results.append((sequence_output.numpy().tolist(), pooled_output.numpy().tolist()))
+        return results
--- a/modules/text/language_model/bert-base-multilingual-uncased/README.md
+++ b/modules/text/language_model/bert-base-multilingual-uncased/README.md
+```shell
+$ hub install bert-base-multilingual-uncased==2.0.0
+```
+<p align="center">
+<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
+</p>
+
+更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
+
+## API
+
+```python
+def __init__(
+    task=None,
+    load_checkpoint=None,
+    label_map=None)
+```
+
+创建Module对象（动态图组网版本）。
+
+**参数**
+
+* `task`： 任务名称，可为`sequence_classification`。
+* `load_checkpoint`：使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
+* `label_map`：预测时的类别映射表。
+
+```python
+def predict(
+    data,
+    max_seq_len=128,
+    batch_size=1,
+    use_gpu=False)
+```
+
+**参数**
+
+* `data`： 待预测数据，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，
+    每个样例可以包含text\_a与text\_b。每个样例文本数量（1个或者2个）需和训练时保持一致。
+* `max_seq_len`：模型处理文本的最大长度
+* `batch_size`：模型批处理大小
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+```python
+def get_embedding(
+    texts,
+    use_gpu=False
+)
+```
+
+用于获取输入文本的句子粒度特征与字粒度特征
+
+**参数**
+
+* `texts`：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+* `results`：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
+
+
+**代码示例**
+
+```python
+import paddlehub as hub
+
+data = [
+    '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    name='bert-base-multilingual-uncased',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='/path/to/parameters',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text, results[idx]))
+```
+
+参考PaddleHub 文本分类示例。https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classifcation
+
+## 服务部署
+
+PaddleHub Serving可以部署一个在线获取预训练词向量。
+
+### Step1: 启动PaddleHub Serving
+
+运行启动命令：
+
+```shell
+$ hub serving start -m bert-base-multilingual-uncased
+```
+
+这样就完成了一个获取预训练词向量服务化API的部署，默认端口号为8866。
+
+**NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+
+### Step2: 发送预测请求
+
+配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果
+
+```python
+import requests
+import json
+
+# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
+text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般"]]
+# 以key的方式指定text传入预测方法的时的参数，此例中为"texts"
+# 对应本地部署，则为module.get_embedding(texts=text)
+data = {"texts": text}
+# 发送post请求，content-type类型应指定json方式
+url = "http://10.12.121.132:8866/predict/bert-base-multilingual-uncased"
+# 指定post请求的headers为application/json方式
+headers = {"Content-Type": "application/json"}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+print(r.json())
+```
+
+##   查看代码
+
+https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
+
+
+## 依赖
+
+paddlepaddle >= 2.0.0
+
+paddlehub >= 2.0.0
+
+## 更新历史
+
+* 1.0.0
+
+  初始发布
+
+* 1.1.0
+
+  支持get_embedding与get_params_layer
+
+* 2.0.0
+
+  全面升级动态图，接口有所变化。
--- a/modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/__init__.py
+++ b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/__init__.py
--- a/modules/text/language_model/bert-base-multilingual-uncased/module.py
+++ b/modules/text/language_model/bert-base-multilingual-uncased/module.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Dict, List, Optional, Union, Tuple
+import os
+
+from paddle.dataset.common import DATA_HOME
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlehub import BertTokenizer
+from paddlehub.module.modeling_bert import BertForSequenceClassification, BertModel
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download
+
+
+@moduleinfo(
+    name="bert-base-multilingual-uncased",
+    version="2.0.0",
+    summary=
+    "bert_multi_uncased_L-12_H-768_A-12, 12-layer, 768-hidden, 12-heads, 110M parameters. The module is executed as paddle.dygraph.",
+    author="paddlepaddle",
+    author_email="",
+    type="nlp/semantic_model")
+class Bert(nn.Layer):
+    """
+    BERT model
+    """
+
+    def __init__(
+            self,
+            task=None,
+            load_checkpoint=None,
+            label_map=None,
+    ):
+        super(Bert, self).__init__()
+        # TODO(zhangxuefei): add token_classification task
+        if task == 'sequence_classification':
+            self.model = BertForSequenceClassification.from_pretrained(
+                pretrained_model_name_or_path='bert-base-multilingual-uncased')
+            self.criterion = paddle.nn.loss.CrossEntropyLoss()
+            self.metric = paddle.metric.Accuracy(name='acc_accumulation')
+        elif task is None:
+            self.model = BertModel.from_pretrained(pretrained_model_name_or_path='bert-base-multilingual-uncased')
+        else:
+            raise RuntimeError("Unknown task %s, task should be sequence_classification" % task)
+
+        self.task = task
+        self.label_map = label_map
+
+        if load_checkpoint is not None and os.path.isfile(load_checkpoint):
+            state_dict = paddle.load(load_checkpoint)
+            self.set_state_dict(state_dict)
+            logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint))
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        result = self.model(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.task is not None:
+            logits = result
+            probs = F.softmax(logits, axis=1)
+            if labels is not None:
+                loss = self.criterion(logits, labels)
+                correct = self.metric.compute(probs, labels)
+                acc = self.metric.update(correct)
+                return probs, loss, acc
+            return probs
+        else:
+            sequence_output, pooled_output = result
+            return sequence_output, pooled_output
+
+    def get_vocab_path(self):
+        """
+        Gets the path of the module vocabulary path.
+        """
+        save_path = os.path.join(DATA_HOME, 'bert-base-multilingual-uncased',
+                                 'bert-base-multilingual-uncased-vocab.txt')
+        if not os.path.exists(save_path) or not os.path.isfile(save_path):
+            url = "https://paddle-hapi.bj.bcebos.com/models/bert/bert-base-multilingual-uncased-vocab.txt"
+            download(url, os.path.join(DATA_HOME, 'bert-base-multilingual-uncased'))
+        return save_path
+
+    def get_tokenizer(self, tokenize_chinese_chars=True):
+        """
+        Gets the tokenizer that is customized for this module.
+        Args:
+            tokenize_chinese_chars (:obj: bool , defaults to :obj: True):
+                Whether to tokenize chinese characters or not.
+        Returns:
+            tokenizer (:obj:BertTokenizer) : The tokenizer which was customized for this module.
+        """
+        return BertTokenizer(tokenize_chinese_chars=tokenize_chinese_chars, vocab_file=self.get_vocab_path())
+
+    def training_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for training, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as loss and metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'loss': avg_loss, 'metrics': {'acc': acc}}
+
+    def validation_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for validation, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'metrics': {'acc': acc}}
+
+    def predict(self, data, max_seq_len=128, batch_size=1, use_gpu=False):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The processed data whose each element is the raw text.
+            max_seq_len (:obj:`int`, `optional`, defaults to :int:`None`):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            use_gpu(obj:`bool`, defaults to `False`): Whether to use gpu to run or not.
+
+        Returns:
+            results(obj:`list`): All the predictions labels.
+        """
+        # TODO(zhangxuefei): add task token_classification task predict.
+        if self.task not in ['sequence_classification']:
+            raise RuntimeError("The predict method is for sequence_classification task, but got task %s." % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        tokenizer = self.get_tokenizer()
+
+        examples = []
+        for text in data:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, max_seq_len=max_seq_len)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], max_seq_len=max_seq_len)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+            examples.append((encoded_inputs['input_ids'], encoded_inputs['segment_ids']))
+
+        def _batchify_fn(batch):
+            input_ids = [entry[0] for entry in batch]
+            segment_ids = [entry[1] for entry in batch]
+            return input_ids, segment_ids
+
+        # Seperates data into some batches.
+        batches = []
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                batches.append(one_batch)
+                one_batch = []
+        if one_batch:
+            # The last batch whose size is less than the config batch_size setting.
+            batches.append(one_batch)
+
+        results = []
+        self.eval()
+        for batch in batches:
+            input_ids, segment_ids = _batchify_fn(batch)
+            input_ids = paddle.to_tensor(input_ids)
+            segment_ids = paddle.to_tensor(segment_ids)
+
+            # TODO(zhangxuefei): add task token_classification postprocess after prediction.
+            if self.task == 'sequence_classification':
+                probs = self(input_ids, segment_ids)
+                idx = paddle.argmax(probs, axis=1).numpy()
+                idx = idx.tolist()
+                labels = [self.label_map[i] for i in idx]
+                results.extend(labels)
+
+        return results
+
+    @serving
+    def get_embedding(self, texts, use_gpu=False):
+        if self.task is not None:
+            raise RuntimeError("The get_embedding method is only valid when task is None, but got task %s" % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+
+        tokenizer = self.get_tokenizer()
+        results = []
+        for text in texts:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, pad_to_max_seq_len=False)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], pad_to_max_seq_len=False)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+
+            input_ids = paddle.to_tensor(encoded_inputs['input_ids']).unsqueeze(0)
+            segment_ids = paddle.to_tensor(encoded_inputs['segment_ids']).unsqueeze(0)
+            sequence_output, pooled_output = self(input_ids, segment_ids)
+
+            sequence_output = sequence_output.squeeze(0)
+            pooled_output = pooled_output.squeeze(0)
+            results.append((sequence_output.numpy().tolist(), pooled_output.numpy().tolist()))
+        return results
--- a/modules/text/language_model/bert-base-uncased/README.md
+++ b/modules/text/language_model/bert-base-uncased/README.md
+```shell
+$ hub install bert-base-uncased==2.0.0
+```
+<p align="center">
+<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
+</p>
+
+更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
+
+## API
+
+```python
+def __init__(
+    task=None,
+    load_checkpoint=None,
+    label_map=None)
+```
+
+创建Module对象（动态图组网版本）。
+
+**参数**
+
+* `task`： 任务名称，可为`sequence_classification`。
+* `load_checkpoint`：使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
+* `label_map`：预测时的类别映射表。
+
+```python
+def predict(
+    data,
+    max_seq_len=128,
+    batch_size=1,
+    use_gpu=False)
+```
+
+**参数**
+
+* `data`： 待预测数据，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，
+    每个样例可以包含text\_a与text\_b。每个样例文本数量（1个或者2个）需和训练时保持一致。
+* `max_seq_len`：模型处理文本的最大长度
+* `batch_size`：模型批处理大小
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+```python
+def get_embedding(
+    texts,
+    use_gpu=False
+)
+```
+
+用于获取输入文本的句子粒度特征与字粒度特征
+
+**参数**
+
+* `texts`：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+* `results`：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
+
+
+**代码示例**
+
+```python
+import paddlehub as hub
+
+data = [
+    '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    name='bert-base-uncased',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='/path/to/parameters',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text, results[idx]))
+```
+
+参考PaddleHub 文本分类示例。https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classifcation
+
+## 服务部署
+
+PaddleHub Serving可以部署一个在线获取预训练词向量。
+
+### Step1: 启动PaddleHub Serving
+
+运行启动命令：
+
+```shell
+$ hub serving start -m bert-base-uncased
+```
+
+这样就完成了一个获取预训练词向量服务化API的部署，默认端口号为8866。
+
+**NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+
+### Step2: 发送预测请求
+
+配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果
+
+```python
+import requests
+import json
+
+# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
+text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般"]]
+# 以key的方式指定text传入预测方法的时的参数，此例中为"texts"
+# 对应本地部署，则为module.get_embedding(texts=text)
+data = {"texts": text}
+# 发送post请求，content-type类型应指定json方式
+url = "http://10.12.121.132:8866/predict/bert-base-uncased"
+# 指定post请求的headers为application/json方式
+headers = {"Content-Type": "application/json"}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+print(r.json())
+```
+
+##   查看代码
+
+https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
+
+
+## 依赖
+
+paddlepaddle >= 2.0.0
+
+paddlehub >= 2.0.0
+
+## 更新历史
+
+* 1.0.0
+
+  初始发布
+
+* 1.1.0
+
+  支持get_embedding与get_params_layer
+
+* 2.0.0
+
+  全面升级动态图，接口有所变化。
--- a/modules/text/language_model/bert_chinese_L_12_H_768_A_12/__init__.py
+++ b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/__init__.py
--- a/modules/text/language_model/bert-base-uncased/module.py
+++ b/modules/text/language_model/bert-base-uncased/module.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Dict, List, Optional, Union, Tuple
+import os
+
+from paddle.dataset.common import DATA_HOME
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlehub import BertTokenizer
+from paddlehub.module.modeling_bert import BertForSequenceClassification, BertModel
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download
+
+
+@moduleinfo(
+    name="bert-base-uncased",
+    version="2.0.0",
+    summary=
+    "bert_uncased_L-12_H-768_A-12, 12-layer, 768-hidden, 12-heads, 110M parameters. The module is executed as paddle.dygraph.",
+    author="paddlepaddle",
+    author_email="",
+    type="nlp/semantic_model")
+class Bert(nn.Layer):
+    """
+    BERT model
+    """
+
+    def __init__(
+            self,
+            task=None,
+            load_checkpoint=None,
+            label_map=None,
+    ):
+        super(Bert, self).__init__()
+        # TODO(zhangxuefei): add token_classification task
+        if task == 'sequence_classification':
+            self.model = BertForSequenceClassification.from_pretrained(
+                pretrained_model_name_or_path='bert-base-uncased')
+            self.criterion = paddle.nn.loss.CrossEntropyLoss()
+            self.metric = paddle.metric.Accuracy(name='acc_accumulation')
+        elif task is None:
+            self.model = BertModel.from_pretrained(pretrained_model_name_or_path='bert-base-uncased')
+        else:
+            raise RuntimeError("Unknown task %s, task should be sequence_classification" % task)
+
+        self.task = task
+        self.label_map = label_map
+
+        if load_checkpoint is not None and os.path.isfile(load_checkpoint):
+            state_dict = paddle.load(load_checkpoint)
+            self.set_state_dict(state_dict)
+            logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint))
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        result = self.model(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.task is not None:
+            logits = result
+            probs = F.softmax(logits, axis=1)
+            if labels is not None:
+                loss = self.criterion(logits, labels)
+                correct = self.metric.compute(probs, labels)
+                acc = self.metric.update(correct)
+                return probs, loss, acc
+            return probs
+        else:
+            sequence_output, pooled_output = result
+            return sequence_output, pooled_output
+
+    def get_vocab_path(self):
+        """
+        Gets the path of the module vocabulary path.
+        """
+        save_path = os.path.join(DATA_HOME, 'bert-base-uncased', 'bert-base-uncased-vocab.txt')
+        if not os.path.exists(save_path) or not os.path.isfile(save_path):
+            url = "https://paddle-hapi.bj.bcebos.com/models/bert/bert-base-uncased-vocab.txt"
+            download(url, os.path.join(DATA_HOME, 'bert-base-uncased'))
+        return save_path
+
+    def get_tokenizer(self, tokenize_chinese_chars=True):
+        """
+        Gets the tokenizer that is customized for this module.
+        Args:
+            tokenize_chinese_chars (:obj: bool , defaults to :obj: True):
+                Whether to tokenize chinese characters or not.
+        Returns:
+            tokenizer (:obj:BertTokenizer) : The tokenizer which was customized for this module.
+        """
+        return BertTokenizer(tokenize_chinese_chars=tokenize_chinese_chars, vocab_file=self.get_vocab_path())
+
+    def training_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for training, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as loss and metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'loss': avg_loss, 'metrics': {'acc': acc}}
+
+    def validation_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for validation, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'metrics': {'acc': acc}}
+
+    def predict(self, data, max_seq_len=128, batch_size=1, use_gpu=False):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The processed data whose each element is the raw text.
+            max_seq_len (:obj:`int`, `optional`, defaults to :int:`None`):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            use_gpu(obj:`bool`, defaults to `False`): Whether to use gpu to run or not.
+
+        Returns:
+            results(obj:`list`): All the predictions labels.
+        """
+        # TODO(zhangxuefei): add task token_classification task predict.
+        if self.task not in ['sequence_classification']:
+            raise RuntimeError("The predict method is for sequence_classification task, but got task %s." % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        tokenizer = self.get_tokenizer()
+
+        examples = []
+        for text in data:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, max_seq_len=max_seq_len)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], max_seq_len=max_seq_len)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+            examples.append((encoded_inputs['input_ids'], encoded_inputs['segment_ids']))
+
+        def _batchify_fn(batch):
+            input_ids = [entry[0] for entry in batch]
+            segment_ids = [entry[1] for entry in batch]
+            return input_ids, segment_ids
+
+        # Seperates data into some batches.
+        batches = []
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                batches.append(one_batch)
+                one_batch = []
+        if one_batch:
+            # The last batch whose size is less than the config batch_size setting.
+            batches.append(one_batch)
+
+        results = []
+        self.eval()
+        for batch in batches:
+            input_ids, segment_ids = _batchify_fn(batch)
+            input_ids = paddle.to_tensor(input_ids)
+            segment_ids = paddle.to_tensor(segment_ids)
+
+            # TODO(zhangxuefei): add task token_classification postprocess after prediction.
+            if self.task == 'sequence_classification':
+                probs = self(input_ids, segment_ids)
+                idx = paddle.argmax(probs, axis=1).numpy()
+                idx = idx.tolist()
+                labels = [self.label_map[i] for i in idx]
+                results.extend(labels)
+
+        return results
+
+    @serving
+    def get_embedding(self, texts, use_gpu=False):
+        if self.task is not None:
+            raise RuntimeError("The get_embedding method is only valid when task is None, but got task %s" % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+
+        tokenizer = self.get_tokenizer()
+        results = []
+        for text in texts:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, pad_to_max_seq_len=False)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], pad_to_max_seq_len=False)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+
+            input_ids = paddle.to_tensor(encoded_inputs['input_ids']).unsqueeze(0)
+            segment_ids = paddle.to_tensor(encoded_inputs['segment_ids']).unsqueeze(0)
+            sequence_output, pooled_output = self(input_ids, segment_ids)
+
+            sequence_output = sequence_output.squeeze(0)
+            pooled_output = pooled_output.squeeze(0)
+            results.append((sequence_output.numpy().tolist(), pooled_output.numpy().tolist()))
+        return results
--- a/modules/text/language_model/bert-large-cased/README.md
+++ b/modules/text/language_model/bert-large-cased/README.md
+```shell
+$ hub install bert-large-cased==2.0.0
+```
+<p align="center">
+<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
+</p>
+
+更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
+
+## API
+
+```python
+def __init__(
+    task=None,
+    load_checkpoint=None,
+    label_map=None)
+```
+
+创建Module对象（动态图组网版本）。
+
+**参数**
+
+* `task`： 任务名称，可为`sequence_classification`。
+* `load_checkpoint`：使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
+* `label_map`：预测时的类别映射表。
+
+```python
+def predict(
+    data,
+    max_seq_len=128,
+    batch_size=1,
+    use_gpu=False)
+```
+
+**参数**
+
+* `data`： 待预测数据，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，
+    每个样例可以包含text\_a与text\_b。每个样例文本数量（1个或者2个）需和训练时保持一致。
+* `max_seq_len`：模型处理文本的最大长度
+* `batch_size`：模型批处理大小
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+```python
+def get_embedding(
+    texts,
+    use_gpu=False
+)
+```
+
+用于获取输入文本的句子粒度特征与字粒度特征
+
+**参数**
+
+* `texts`：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+* `results`：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
+
+
+**代码示例**
+
+```python
+import paddlehub as hub
+
+data = [
+    '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    name='bert-large-cased',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='/path/to/parameters',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text, results[idx]))
+```
+
+参考PaddleHub 文本分类示例。https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classifcation
+
+## 服务部署
+
+PaddleHub Serving可以部署一个在线获取预训练词向量。
+
+### Step1: 启动PaddleHub Serving
+
+运行启动命令：
+
+```shell
+$ hub serving start -m bert-large-cased
+```
+
+这样就完成了一个获取预训练词向量服务化API的部署，默认端口号为8866。
+
+**NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+
+### Step2: 发送预测请求
+
+配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果
+
+```python
+import requests
+import json
+
+# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
+text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般"]]
+# 以key的方式指定text传入预测方法的时的参数，此例中为"texts"
+# 对应本地部署，则为module.get_embedding(texts=text)
+data = {"texts": text}
+# 发送post请求，content-type类型应指定json方式
+url = "http://10.12.121.132:8866/predict/bert-large-cased"
+# 指定post请求的headers为application/json方式
+headers = {"Content-Type": "application/json"}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+print(r.json())
+```
+
+##   查看代码
+
+https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
+
+
+## 依赖
+
+paddlepaddle >= 2.0.0
+
+paddlehub >= 2.0.0
+
+## 更新历史
+
+* 1.0.0
+
+  初始发布
+
+* 1.1.0
+
+  支持get_embedding与get_params_layer
+
+* 2.0.0
+
+  全面升级动态图，接口有所变化。
--- a/modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/__init__.py
+++ b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/__init__.py
--- a/modules/text/language_model/bert-large-cased/module.py
+++ b/modules/text/language_model/bert-large-cased/module.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Dict, List, Optional, Union, Tuple
+import os
+
+from paddle.dataset.common import DATA_HOME
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlehub import BertTokenizer
+from paddlehub.module.modeling_bert import BertForSequenceClassification, BertModel
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download
+
+
+@moduleinfo(
+    name="bert-large-cased",
+    version="2.0.0",
+    summary=
+    "bert_cased_L-24_H-1024_A-16, 24-layer, 1024-hidden, 16-heads, 340M parameters. The module is executed as paddle.dygraph.",
+    author="paddlepaddle",
+    author_email="",
+    type="nlp/semantic_model")
+class Bert(nn.Layer):
+    """
+    BERT model
+    """
+
+    def __init__(
+            self,
+            task=None,
+            load_checkpoint=None,
+            label_map=None,
+    ):
+        super(Bert, self).__init__()
+        # TODO(zhangxuefei): add token_classification task
+        if task == 'sequence_classification':
+            self.model = BertForSequenceClassification.from_pretrained(pretrained_model_name_or_path='bert-large-cased')
+            self.criterion = paddle.nn.loss.CrossEntropyLoss()
+            self.metric = paddle.metric.Accuracy(name='acc_accumulation')
+        elif task is None:
+            self.model = BertModel.from_pretrained(pretrained_model_name_or_path='bert-large-cased')
+        else:
+            raise RuntimeError("Unknown task %s, task should be sequence_classification" % task)
+
+        self.task = task
+        self.label_map = label_map
+
+        if load_checkpoint is not None and os.path.isfile(load_checkpoint):
+            state_dict = paddle.load(load_checkpoint)
+            self.set_state_dict(state_dict)
+            logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint))
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        result = self.model(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.task is not None:
+            logits = result
+            probs = F.softmax(logits, axis=1)
+            if labels is not None:
+                loss = self.criterion(logits, labels)
+                correct = self.metric.compute(probs, labels)
+                acc = self.metric.update(correct)
+                return probs, loss, acc
+            return probs
+        else:
+            sequence_output, pooled_output = result
+            return sequence_output, pooled_output
+
+    def get_vocab_path(self):
+        """
+        Gets the path of the module vocabulary path.
+        """
+        save_path = os.path.join(DATA_HOME, 'bert-large-cased', 'bert-large-cased-vocab.txt')
+        if not os.path.exists(save_path) or not os.path.isfile(save_path):
+            url = "https://paddle-hapi.bj.bcebos.com/models/bert/bert-large-cased-vocab.txt"
+            download(url, os.path.join(DATA_HOME, 'bert-large-cased'))
+        return save_path
+
+    def get_tokenizer(self, tokenize_chinese_chars=True):
+        """
+        Gets the tokenizer that is customized for this module.
+        Args:
+            tokenize_chinese_chars (:obj: bool , defaults to :obj: True):
+                Whether to tokenize chinese characters or not.
+        Returns:
+            tokenizer (:obj:BertTokenizer) : The tokenizer which was customized for this module.
+        """
+        return BertTokenizer(tokenize_chinese_chars=tokenize_chinese_chars, vocab_file=self.get_vocab_path())
+
+    def training_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for training, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as loss and metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'loss': avg_loss, 'metrics': {'acc': acc}}
+
+    def validation_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for validation, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'metrics': {'acc': acc}}
+
+    def predict(self, data, max_seq_len=128, batch_size=1, use_gpu=False):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The processed data whose each element is the raw text.
+            max_seq_len (:obj:`int`, `optional`, defaults to :int:`None`):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            use_gpu(obj:`bool`, defaults to `False`): Whether to use gpu to run or not.
+
+        Returns:
+            results(obj:`list`): All the predictions labels.
+        """
+        # TODO(zhangxuefei): add task token_classification task predict.
+        if self.task not in ['sequence_classification']:
+            raise RuntimeError("The predict method is for sequence_classification task, but got task %s." % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        tokenizer = self.get_tokenizer()
+
+        examples = []
+        for text in data:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, max_seq_len=max_seq_len)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], max_seq_len=max_seq_len)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+            examples.append((encoded_inputs['input_ids'], encoded_inputs['segment_ids']))
+
+        def _batchify_fn(batch):
+            input_ids = [entry[0] for entry in batch]
+            segment_ids = [entry[1] for entry in batch]
+            return input_ids, segment_ids
+
+        # Seperates data into some batches.
+        batches = []
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                batches.append(one_batch)
+                one_batch = []
+        if one_batch:
+            # The last batch whose size is less than the config batch_size setting.
+            batches.append(one_batch)
+
+        results = []
+        self.eval()
+        for batch in batches:
+            input_ids, segment_ids = _batchify_fn(batch)
+            input_ids = paddle.to_tensor(input_ids)
+            segment_ids = paddle.to_tensor(segment_ids)
+
+            # TODO(zhangxuefei): add task token_classification postprocess after prediction.
+            if self.task == 'sequence_classification':
+                probs = self(input_ids, segment_ids)
+                idx = paddle.argmax(probs, axis=1).numpy()
+                idx = idx.tolist()
+                labels = [self.label_map[i] for i in idx]
+                results.extend(labels)
+
+        return results
+
+    @serving
+    def get_embedding(self, texts, use_gpu=False):
+        if self.task is not None:
+            raise RuntimeError("The get_embedding method is only valid when task is None, but got task %s" % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+
+        tokenizer = self.get_tokenizer()
+        results = []
+        for text in texts:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, pad_to_max_seq_len=False)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], pad_to_max_seq_len=False)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+
+            input_ids = paddle.to_tensor(encoded_inputs['input_ids']).unsqueeze(0)
+            segment_ids = paddle.to_tensor(encoded_inputs['segment_ids']).unsqueeze(0)
+            sequence_output, pooled_output = self(input_ids, segment_ids)
+
+            sequence_output = sequence_output.squeeze(0)
+            pooled_output = pooled_output.squeeze(0)
+            results.append((sequence_output.numpy().tolist(), pooled_output.numpy().tolist()))
+        return results
--- a/modules/text/language_model/bert-large-uncased/README.md
+++ b/modules/text/language_model/bert-large-uncased/README.md
+```shell
+$ hub install bert-large-uncased==2.0.0
+```
+<p align="center">
+<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
+</p>
+
+更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
+
+## API
+
+```python
+def __init__(
+    task=None,
+    load_checkpoint=None,
+    label_map=None)
+```
+
+创建Module对象（动态图组网版本）。
+
+**参数**
+
+* `task`： 任务名称，可为`sequence_classification`。
+* `load_checkpoint`：使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
+* `label_map`：预测时的类别映射表。
+
+```python
+def predict(
+    data,
+    max_seq_len=128,
+    batch_size=1,
+    use_gpu=False)
+```
+
+**参数**
+
+* `data`： 待预测数据，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，
+    每个样例可以包含text\_a与text\_b。每个样例文本数量（1个或者2个）需和训练时保持一致。
+* `max_seq_len`：模型处理文本的最大长度
+* `batch_size`：模型批处理大小
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+```python
+def get_embedding(
+    texts,
+    use_gpu=False
+)
+```
+
+用于获取输入文本的句子粒度特征与字粒度特征
+
+**参数**
+
+* `texts`：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+* `results`：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
+
+
+**代码示例**
+
+```python
+import paddlehub as hub
+
+data = [
+    '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    name='bert-large-画丶cased',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='/path/to/parameters',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text, results[idx]))
+```
+
+参考PaddleHub 文本分类示例。https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classifcation
+
+## 服务部署
+
+PaddleHub Serving可以部署一个在线获取预训练词向量。
+
+### Step1: 启动PaddleHub Serving
+
+运行启动命令：
+
+```shell
+$ hub serving start -m bert-large-uncased
+```
+
+这样就完成了一个获取预训练词向量服务化API的部署，默认端口号为8866。
+
+**NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+
+### Step2: 发送预测请求
+
+配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果
+
+```python
+import requests
+import json
+
+# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
+text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般"]]
+# 以key的方式指定text传入预测方法的时的参数，此例中为"texts"
+# 对应本地部署，则为module.get_embedding(texts=text)
+data = {"texts": text}
+# 发送post请求，content-type类型应指定json方式
+url = "http://10.12.121.132:8866/predict/bert-large-uncased"
+# 指定post请求的headers为application/json方式
+headers = {"Content-Type": "application/json"}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+print(r.json())
+```
+
+##   查看代码
+
+https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
+
+
+## 依赖
+
+paddlepaddle >= 2.0.0
+
+paddlehub >= 2.0.0
+
+## 更新历史
+
+* 1.0.0
+
+  初始发布
+
+* 1.1.0
+
+  支持get_embedding与get_params_layer
+
+* 2.0.0
+
+  全面升级动态图，接口有所变化。
--- a/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/__init__.py
+++ b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/__init__.py
--- a/modules/text/language_model/bert-large-uncased/module.py
+++ b/modules/text/language_model/bert-large-uncased/module.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Dict, List, Optional, Union, Tuple
+import os
+
+from paddle.dataset.common import DATA_HOME
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlehub import BertTokenizer
+from paddlehub.module.modeling_bert import BertForSequenceClassification, BertModel
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download
+
+
+@moduleinfo(
+    name="bert-large-uncased",
+    version="2.0.0",
+    summary=
+    "bert_uncased_L-24_H-1024_A-16, 24-layer, 1024-hidden, 16-heads, 340M parameters. The module is executed as paddle.dygraph.",
+    author="paddlepaddle",
+    author_email="",
+    type="nlp/semantic_model")
+class Bert(nn.Layer):
+    """
+    BERT model
+    """
+
+    def __init__(
+            self,
+            task=None,
+            load_checkpoint=None,
+            label_map=None,
+    ):
+        super(Bert, self).__init__()
+        # TODO(zhangxuefei): add token_classification task
+        if task == 'sequence_classification':
+            self.model = BertForSequenceClassification.from_pretrained(
+                pretrained_model_name_or_path='bert-large-uncased')
+            self.criterion = paddle.nn.loss.CrossEntropyLoss()
+            self.metric = paddle.metric.Accuracy(name='acc_accumulation')
+        elif task is None:
+            self.model = BertModel.from_pretrained(pretrained_model_name_or_path='bert-large-uncased')
+        else:
+            raise RuntimeError("Unknown task %s, task should be sequence_classification" % task)
+
+        self.task = task
+        self.label_map = label_map
+
+        if load_checkpoint is not None and os.path.isfile(load_checkpoint):
+            state_dict = paddle.load(load_checkpoint)
+            self.set_state_dict(state_dict)
+            logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint))
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        result = self.model(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.task is not None:
+            logits = result
+            probs = F.softmax(logits, axis=1)
+            if labels is not None:
+                loss = self.criterion(logits, labels)
+                correct = self.metric.compute(probs, labels)
+                acc = self.metric.update(correct)
+                return probs, loss, acc
+            return probs
+        else:
+            sequence_output, pooled_output = result
+            return sequence_output, pooled_output
+
+    def get_vocab_path(self):
+        """
+        Gets the path of the module vocabulary path.
+        """
+        save_path = os.path.join(DATA_HOME, 'bert-large-uncased', 'bert-large-uncased-vocab.txt')
+        if not os.path.exists(save_path) or not os.path.isfile(save_path):
+            url = "https://paddle-hapi.bj.bcebos.com/models/bert/bert-large-uncased-vocab.txt"
+            download(url, os.path.join(DATA_HOME, 'bert-large-uncased'))
+        return save_path
+
+    def get_tokenizer(self, tokenize_chinese_chars=True):
+        """
+        Gets the tokenizer that is customized for this module.
+        Args:
+            tokenize_chinese_chars (:obj: bool , defaults to :obj: True):
+                Whether to tokenize chinese characters or not.
+        Returns:
+            tokenizer (:obj:BertTokenizer) : The tokenizer which was customized for this module.
+        """
+        return BertTokenizer(tokenize_chinese_chars=tokenize_chinese_chars, vocab_file=self.get_vocab_path())
+
+    def training_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for training, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as loss and metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'loss': avg_loss, 'metrics': {'acc': acc}}
+
+    def validation_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for validation, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'metrics': {'acc': acc}}
+
+    def predict(self, data, max_seq_len=128, batch_size=1, use_gpu=False):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The processed data whose each element is the raw text.
+            max_seq_len (:obj:`int`, `optional`, defaults to :int:`None`):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            use_gpu(obj:`bool`, defaults to `False`): Whether to use gpu to run or not.
+
+        Returns:
+            results(obj:`list`): All the predictions labels.
+        """
+        # TODO(zhangxuefei): add task token_classification task predict.
+        if self.task not in ['sequence_classification']:
+            raise RuntimeError("The predict method is for sequence_classification task, but got task %s." % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        tokenizer = self.get_tokenizer()
+
+        examples = []
+        for text in data:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, max_seq_len=max_seq_len)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], max_seq_len=max_seq_len)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+            examples.append((encoded_inputs['input_ids'], encoded_inputs['segment_ids']))
+
+        def _batchify_fn(batch):
+            input_ids = [entry[0] for entry in batch]
+            segment_ids = [entry[1] for entry in batch]
+            return input_ids, segment_ids
+
+        # Seperates data into some batches.
+        batches = []
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                batches.append(one_batch)
+                one_batch = []
+        if one_batch:
+            # The last batch whose size is less than the config batch_size setting.
+            batches.append(one_batch)
+
+        results = []
+        self.eval()
+        for batch in batches:
+            input_ids, segment_ids = _batchify_fn(batch)
+            input_ids = paddle.to_tensor(input_ids)
+            segment_ids = paddle.to_tensor(segment_ids)
+
+            # TODO(zhangxuefei): add task token_classification postprocess after prediction.
+            if self.task == 'sequence_classification':
+                probs = self(input_ids, segment_ids)
+                idx = paddle.argmax(probs, axis=1).numpy()
+                idx = idx.tolist()
+                labels = [self.label_map[i] for i in idx]
+                results.extend(labels)
+
+        return results
+
+    @serving
+    def get_embedding(self, texts, use_gpu=False):
+        if self.task is not None:
+            raise RuntimeError("The get_embedding method is only valid when task is None, but got task %s" % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+
+        tokenizer = self.get_tokenizer()
+        results = []
+        for text in texts:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, pad_to_max_seq_len=False)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], pad_to_max_seq_len=False)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+
+            input_ids = paddle.to_tensor(encoded_inputs['input_ids']).unsqueeze(0)
+            segment_ids = paddle.to_tensor(encoded_inputs['segment_ids']).unsqueeze(0)
+            sequence_output, pooled_output = self(input_ids, segment_ids)
+
+            sequence_output = sequence_output.squeeze(0)
+            pooled_output = pooled_output.squeeze(0)
+            results.append((sequence_output.numpy().tolist(), pooled_output.numpy().tolist()))
+        return results
--- a/modules/text/language_model/bert_cased_L_12_H_768_A_12/README.md
+++ b/modules/text/language_model/bert_cased_L_12_H_768_A_12/README.md
-```shell
-$ hub install bert_cased_L-12_H-768_A-12==1.1.0
-```
-<p align="center">
-<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
-</p>
-
-更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
-
-## API
-```python
-def context(
-    trainable=True,
-    max_seq_len=128
-)
-```
-用于获取Module的上下文信息，得到输入、输出以及预训练的Paddle Program副本  
-
-**参数**  
-
-> trainable：设置为True时，Module中的参数在Fine-tune时也会随之训练，否则保持不变。  
-> max_seq_len：BERT模型的最大序列长度，若序列长度不足，会通过padding方式补到**max_seq_len**, 若序列长度大于该值，则会以截断方式让序列长度为**max_seq_len**，max_seq_len可取值范围为0～512；  
-
-**返回**  
-> inputs：dict类型，有以下字段：  
-> >**input_ids**存放输入文本tokenize后各token对应BERT词汇表的word ids， shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**position_ids**存放输入文本tokenize后各token所在该文本的位置，shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**segment_ids**存放各token所在文本的标识（token属于文本1或者文本2），shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**input_mask**存放token是否为padding的标识，shape为\[batch_size, max_seq_len\]，int64类型；  
->
-> outputs：dict类型，Module的输出特征，有以下字段：  
-> >**pooled_output**字段存放句子粒度的特征，可用于文本分类等任务，shape为 \[batch_size, 768\]，int64类型；  
-> >**sequence_output**字段存放字粒度的特征，可用于序列标注等任务，shape为 \[batch_size, seq_len, 768\]，int64类型；  
->
-> program：包含该Module计算图的Program。  
-
-
-```python
-def get_embedding(
-    texts,
-    use_gpu=False,
-    batch_size=1
-)
-```
-
-用于获取输入文本的句子粒度特征与字粒度特征
-
-**参数**
-
-> texts：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
-> use_gpu：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
-
-**返回**
-
-> results：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
->
-
-```python
-def get_params_layer()
-```
-
-用于获取参数层信息，该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
-
-**参数**
-
-> 无
-
-**返回**
-
-> params_layer：dict类型，key为参数名，值为参数所在层数
-
-
-**代码示例**
-
-```python
-import paddlehub as hub
-
-# Load $ hub install bert_cased_L-12_H-768_A-12 pretrained model
-module = hub.Module(name="bert_cased_L-12_H-768_A-12")
-inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
-
-# Must feed all the tensor of bert_cased_L-12_H-768_A-12's module need
-input_ids = inputs["input_ids"]
-position_ids = inputs["position_ids"]
-segment_ids = inputs["segment_ids"]
-input_mask = inputs["input_mask"]
-
-# Use "pooled_output" for sentence-level output.
-pooled_output = outputs["pooled_output"]
-
-# Use "sequence_output" for token-level output.
-sequence_output = outputs["sequence_output"]
-
-# Use "get_embedding" to get embedding result.
-embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
-
-# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
-params_layer = module.get_params_layer()
-strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
-```
-
-
-##   查看代码
-
-https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
-
-
-
-## 依赖
-
-paddlepaddle >= 1.6.2
-
-paddlehub >= 1.6.0
-
-## 更新历史
-
-* 1.0.0
-
-  初始发布
-
-* 1.1.0
-
-  支持get_embedding与get_params_layer
--- a/modules/text/language_model/bert_cased_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/language_model/bert_cased_L_12_H_768_A_12/model/bert.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""BERT model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import six
-import json
-
-import paddle.fluid as fluid
-
-from bert_cased_L_12_H_768_A_12.model.transformer_encoder import encoder, pre_process_layer
-
-
-class BertConfig(object):
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path) as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing bert model config file '%s'" % config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict[key]
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            print('%s: %s' % (arg, value))
-        print('------------------------------------------------')
-
-
-class BertModel(object):
-    def __init__(self, src_ids, position_ids, sentence_ids, input_mask, config, weight_sharing=True, use_fp16=False):
-
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        self._sent_types = config['type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        self._weight_sharing = weight_sharing
-
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._sent_emb_name = "sent_embedding"
-        self._dtype = "float16" if use_fp16 else "float32"
-
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config['initializer_range'])
-
-        self._build_model(src_ids, position_ids, sentence_ids, input_mask)
-
-    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
-        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(input=src_ids,
-                                         size=[self._voc_size, self._emb_size],
-                                         dtype=self._dtype,
-                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
-                                                                    initializer=self._param_initializer),
-                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(input=position_ids,
-                                                  size=[self._max_position_seq_len, self._emb_size],
-                                                  dtype=self._dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
-                                                                             initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(sentence_ids,
-                                              size=[self._sent_types, self._emb_size],
-                                              dtype=self._dtype,
-                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
-                                                                         initializer=self._param_initializer))
-
-        emb_out = emb_out + position_emb_out
-        emb_out = emb_out + sent_emb_out
-
-        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
-
-        if self._dtype == "float16":
-            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
-
-        self_attn_mask = fluid.layers.matmul(x=input_mask, y=input_mask, transpose_y=True)
-        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-
-        self._enc_out = encoder(enc_input=emb_out,
-                                attn_bias=n_head_self_attn_mask,
-                                n_layer=self._n_layer,
-                                n_head=self._n_head,
-                                d_key=self._emb_size // self._n_head,
-                                d_value=self._emb_size // self._n_head,
-                                d_model=self._emb_size,
-                                d_inner_hid=self._emb_size * 4,
-                                prepostprocess_dropout=self._prepostprocess_dropout,
-                                attention_dropout=self._attention_dropout,
-                                relu_dropout=0,
-                                hidden_act=self._hidden_act,
-                                preprocess_cmd="",
-                                postprocess_cmd="dan",
-                                param_initializer=self._param_initializer,
-                                name='encoder')
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_pooled_output(self):
-        """Get the first feature of each sequence for classification"""
-
-        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-                                         size=self._emb_size,
-                                         act="tanh",
-                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
-                                                                    initializer=self._param_initializer),
-                                         bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_pretraining_output(self, mask_label, mask_pos, labels):
-        """Get the loss & accuracy for pretraining"""
-
-        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
-
-        # extract the first token feature in each sentence
-        next_sent_feat = self.get_pooled_output()
-        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
-
-        # transform: fc
-        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-                                          size=self._emb_size,
-                                          act=self._hidden_act,
-                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
-                                                                     initializer=self._param_initializer),
-                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-        # transform: layer norm
-        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-                                                initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                                                    dtype=self._dtype,
-                                                    attr=mask_lm_out_bias_attr,
-                                                    is_bias=True)
-
-        else:
-            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                                     size=self._voc_size,
-                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
-                                                                initializer=self._param_initializer),
-                                     bias_attr=mask_lm_out_bias_attr)
-
-        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
-        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-
-        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-                                           size=2,
-                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
-                                                                      initializer=self._param_initializer),
-                                           bias_attr="next_sent_fc.b_0")
-
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-                                                                                    label=labels,
-                                                                                    return_softmax=True)
-
-        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
-
-        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
-
-        loss = mean_next_sent_loss + mean_mask_lm_loss
-        return next_sent_acc, mean_mask_lm_loss, loss
--- a/modules/text/language_model/bert_cased_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_cased_L_12_H_768_A_12/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(weights,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(hidden,
-                                dropout_prob=dropout_rate,
-                                dropout_implementation="upscale_in_train",
-                                is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(out,
-                                    begin_norm_axis=len(out.shape) - 1,
-                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                                                               initializer=fluid.initializer.Constant(1.)),
-                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
-                                                              initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(out,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(pre_process_layer(enc_input,
-                                                         preprocess_cmd,
-                                                         prepostprocess_dropout,
-                                                         name=name + '_pre_att'),
-                                       None,
-                                       None,
-                                       attn_bias,
-                                       d_key,
-                                       d_value,
-                                       d_model,
-                                       n_head,
-                                       attention_dropout,
-                                       param_initializer=param_initializer,
-                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(enc_input,
-                                     attn_output,
-                                     postprocess_cmd,
-                                     prepostprocess_dropout,
-                                     name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
-                                                             preprocess_cmd,
-                                                             prepostprocess_dropout,
-                                                             name=name + '_pre_ffn'),
-                                           d_inner_hid,
-                                           d_model,
-                                           relu_dropout,
-                                           hidden_act,
-                                           param_initializer=param_initializer,
-                                           name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(enc_input,
-                                   attn_bias,
-                                   n_head,
-                                   d_key,
-                                   d_value,
-                                   d_model,
-                                   d_inner_hid,
-                                   prepostprocess_dropout,
-                                   attention_dropout,
-                                   relu_dropout,
-                                   hidden_act,
-                                   preprocess_cmd,
-                                   postprocess_cmd,
-                                   param_initializer=param_initializer,
-                                   name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
--- a/modules/text/language_model/bert_cased_L_12_H_768_A_12/module.py
+++ b/modules/text/language_model/bert_cased_L_12_H_768_A_12/module.py
-# coding:utf-8
-# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License"
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-
-from paddlehub import TransformerModule
-from paddlehub.module.module import moduleinfo
-
-from bert_cased_L_12_H_768_A_12.model.bert import BertConfig, BertModel
-
-
-@moduleinfo(
-    name="bert_cased_L-12_H-768_A-12",
-    version="1.1.0",
-    summary="bert_cased_L-12_H-768_A-12, 12-layer, 768-hidden, 12-heads, 110M parameters",
-    author="paddlepaddle",
-    author_email="paddle-dev@baidu.com",
-    type="nlp/semantic_model",
-)
-class Bert(TransformerModule):
-    def _initialize(self):
-        self.MAX_SEQ_LEN = 512
-        self.params_path = os.path.join(self.directory, "assets", "params")
-        self.vocab_path = os.path.join(self.directory, "assets", "vocab.txt")
-
-        bert_config_path = os.path.join(self.directory, "assets", "bert_config.json")
-        self.bert_config = BertConfig(bert_config_path)
-
-    def net(self, input_ids, position_ids, segment_ids, input_mask):
-        """
-        create neural network.
-
-        Args:
-            input_ids (tensor): the word ids.
-            position_ids (tensor): the position ids.
-            segment_ids (tensor): the segment ids.
-            input_mask (tensor): the padding mask.
-
-        Returns:
-            pooled_output (tensor):  sentence-level output for classification task.
-            sequence_output (tensor): token-level output for sequence task.
-        """
-        bert = BertModel(src_ids=input_ids,
-                         position_ids=position_ids,
-                         sentence_ids=segment_ids,
-                         input_mask=input_mask,
-                         config=self.bert_config,
-                         use_fp16=False)
-        pooled_output = bert.get_pooled_output()
-        sequence_output = bert.get_sequence_output()
-        return pooled_output, sequence_output
-
-
-if __name__ == '__main__':
-    test_module = Bert()
--- a/modules/text/language_model/bert_cased_L_24_H_1024_A_16/README.md
+++ b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/README.md
-```shell
-$ hub install bert_cased_L-24_H-1024_A-16==1.1.0
-```
-<p align="center">
-<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
-</p>
-
-更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
-
-## API
-```python
-def context(
-    trainable=True,
-    max_seq_len=128
-)
-```
-用于获取Module的上下文信息，得到输入、输出以及预训练的Paddle Program副本  
-
-**参数**  
-
-> trainable：设置为True时，Module中的参数在Fine-tune时也会随之训练，否则保持不变。  
-> max_seq_len：BERT模型的最大序列长度，若序列长度不足，会通过padding方式补到**max_seq_len**, 若序列长度大于该值，则会以截断方式让序列长度为**max_seq_len**，max_seq_len可取值范围为0～512；  
-
-**返回**  
-> inputs：dict类型，有以下字段：  
-> >**input_ids**存放输入文本tokenize后各token对应BERT词汇表的word ids， shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**position_ids**存放输入文本tokenize后各token所在该文本的位置，shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**segment_ids**存放各token所在文本的标识（token属于文本1或者文本2），shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**input_mask**存放token是否为padding的标识，shape为\[batch_size, max_seq_len\]，int64类型；  
->
-> outputs：dict类型，Module的输出特征，有以下字段：  
-> >**pooled_output**字段存放句子粒度的特征，可用于文本分类等任务，shape为 \[batch_size, 768\]，int64类型；  
-> >**sequence_output**字段存放字粒度的特征，可用于序列标注等任务，shape为 \[batch_size, seq_len, 768\]，int64类型；  
->
-> program：包含该Module计算图的Program。  
-
-
-
-```python
-def get_embedding(
-    texts,
-    use_gpu=False,
-    batch_size=1
-)
-```
-
-用于获取输入文本的句子粒度特征与字粒度特征
-
-**参数**
-
-> texts：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
-> use_gpu：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
-
-**返回**
-
-> results：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
->
-
-```python
-def get_params_layer()
-```
-
-用于获取参数层信息，该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
-
-**参数**
-
-> 无
-
-**返回**
-
-> params_layer：dict类型，key为参数名，值为参数所在层数
-
-**代码示例**
-
-```python
-import paddlehub as hub
-
-# Load $ hub install bert_cased_L-24_H-1024_A-16 pretrained model
-module = hub.Module(name="bert_cased_L-24_H-1024_A-16")
-inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
-
-# Must feed all the tensor of bert_cased_L-24_H-1024_A-16's module need
-input_ids = inputs["input_ids"]
-position_ids = inputs["position_ids"]
-segment_ids = inputs["segment_ids"]
-input_mask = inputs["input_mask"]
-
-# Use "pooled_output" for sentence-level output.
-pooled_output = outputs["pooled_output"]
-
-# Use "sequence_output" for token-level output.
-sequence_output = outputs["sequence_output"]
-
-# Use "get_embedding" to get embedding result.
-embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
-
-# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
-params_layer = module.get_params_layer()
-strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
-```
-
-##   查看代码
-
-https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
-
-
-## 依赖
-
-paddlepaddle >= 1.6.2
-
-paddlehub >= 1.6.0
-
-## 更新历史
-
-* 1.0.0
-
-  初始发布
-
-* 1.1.0
-
-  支持get_embedding与get_params_layer
--- a/modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/bert.py
+++ b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/bert.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""BERT model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import six
-import json
-
-import paddle.fluid as fluid
-
-from bert_cased_L_24_H_1024_A_16.model.transformer_encoder import encoder, pre_process_layer
-
-
-class BertConfig(object):
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path) as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing bert model config file '%s'" % config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict[key]
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            print('%s: %s' % (arg, value))
-        print('------------------------------------------------')
-
-
-class BertModel(object):
-    def __init__(self, src_ids, position_ids, sentence_ids, input_mask, config, weight_sharing=True, use_fp16=False):
-
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        self._sent_types = config['type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        self._weight_sharing = weight_sharing
-
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._sent_emb_name = "sent_embedding"
-        self._dtype = "float16" if use_fp16 else "float32"
-
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config['initializer_range'])
-
-        self._build_model(src_ids, position_ids, sentence_ids, input_mask)
-
-    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
-        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(input=src_ids,
-                                         size=[self._voc_size, self._emb_size],
-                                         dtype=self._dtype,
-                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
-                                                                    initializer=self._param_initializer),
-                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(input=position_ids,
-                                                  size=[self._max_position_seq_len, self._emb_size],
-                                                  dtype=self._dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
-                                                                             initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(sentence_ids,
-                                              size=[self._sent_types, self._emb_size],
-                                              dtype=self._dtype,
-                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
-                                                                         initializer=self._param_initializer))
-
-        emb_out = emb_out + position_emb_out
-        emb_out = emb_out + sent_emb_out
-
-        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
-
-        if self._dtype == "float16":
-            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
-
-        self_attn_mask = fluid.layers.matmul(x=input_mask, y=input_mask, transpose_y=True)
-        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-
-        self._enc_out = encoder(enc_input=emb_out,
-                                attn_bias=n_head_self_attn_mask,
-                                n_layer=self._n_layer,
-                                n_head=self._n_head,
-                                d_key=self._emb_size // self._n_head,
-                                d_value=self._emb_size // self._n_head,
-                                d_model=self._emb_size,
-                                d_inner_hid=self._emb_size * 4,
-                                prepostprocess_dropout=self._prepostprocess_dropout,
-                                attention_dropout=self._attention_dropout,
-                                relu_dropout=0,
-                                hidden_act=self._hidden_act,
-                                preprocess_cmd="",
-                                postprocess_cmd="dan",
-                                param_initializer=self._param_initializer,
-                                name='encoder')
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_pooled_output(self):
-        """Get the first feature of each sequence for classification"""
-
-        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-                                         size=self._emb_size,
-                                         act="tanh",
-                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
-                                                                    initializer=self._param_initializer),
-                                         bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_pretraining_output(self, mask_label, mask_pos, labels):
-        """Get the loss & accuracy for pretraining"""
-
-        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
-
-        # extract the first token feature in each sentence
-        next_sent_feat = self.get_pooled_output()
-        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
-
-        # transform: fc
-        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-                                          size=self._emb_size,
-                                          act=self._hidden_act,
-                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
-                                                                     initializer=self._param_initializer),
-                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-        # transform: layer norm
-        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-                                                initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                                                    dtype=self._dtype,
-                                                    attr=mask_lm_out_bias_attr,
-                                                    is_bias=True)
-
-        else:
-            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                                     size=self._voc_size,
-                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
-                                                                initializer=self._param_initializer),
-                                     bias_attr=mask_lm_out_bias_attr)
-
-        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
-        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-
-        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-                                           size=2,
-                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
-                                                                      initializer=self._param_initializer),
-                                           bias_attr="next_sent_fc.b_0")
-
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-                                                                                    label=labels,
-                                                                                    return_softmax=True)
-
-        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
-
-        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
-
-        loss = mean_next_sent_loss + mean_mask_lm_loss
-        return next_sent_acc, mean_mask_lm_loss, loss
--- a/modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(weights,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(hidden,
-                                dropout_prob=dropout_rate,
-                                dropout_implementation="upscale_in_train",
-                                is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(out,
-                                    begin_norm_axis=len(out.shape) - 1,
-                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                                                               initializer=fluid.initializer.Constant(1.)),
-                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
-                                                              initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(out,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(pre_process_layer(enc_input,
-                                                         preprocess_cmd,
-                                                         prepostprocess_dropout,
-                                                         name=name + '_pre_att'),
-                                       None,
-                                       None,
-                                       attn_bias,
-                                       d_key,
-                                       d_value,
-                                       d_model,
-                                       n_head,
-                                       attention_dropout,
-                                       param_initializer=param_initializer,
-                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(enc_input,
-                                     attn_output,
-                                     postprocess_cmd,
-                                     prepostprocess_dropout,
-                                     name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
-                                                             preprocess_cmd,
-                                                             prepostprocess_dropout,
-                                                             name=name + '_pre_ffn'),
-                                           d_inner_hid,
-                                           d_model,
-                                           relu_dropout,
-                                           hidden_act,
-                                           param_initializer=param_initializer,
-                                           name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(enc_input,
-                                   attn_bias,
-                                   n_head,
-                                   d_key,
-                                   d_value,
-                                   d_model,
-                                   d_inner_hid,
-                                   prepostprocess_dropout,
-                                   attention_dropout,
-                                   relu_dropout,
-                                   hidden_act,
-                                   preprocess_cmd,
-                                   postprocess_cmd,
-                                   param_initializer=param_initializer,
-                                   name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
--- a/modules/text/language_model/bert_cased_L_24_H_1024_A_16/module.py
+++ b/modules/text/language_model/bert_cased_L_24_H_1024_A_16/module.py
-# coding:utf-8
-# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License"
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-
-from paddlehub import TransformerModule
-from paddlehub.module.module import moduleinfo
-
-from bert_cased_L_24_H_1024_A_16.model.bert import BertConfig, BertModel
-
-
-@moduleinfo(
-    name="bert_cased_L-24_H-1024_A-16",
-    version="1.1.0",
-    summary="bert_cased_L-24_H-1024_A-16, 24-layer, 1024-hidden, 16-heads, 340M parameters ",
-    author="paddlepaddle",
-    author_email="paddle-dev@baidu.com",
-    type="nlp/semantic_model",
-)
-class Bert(TransformerModule):
-    def _initialize(self):
-        self.MAX_SEQ_LEN = 512
-        self.params_path = os.path.join(self.directory, "assets", "params")
-        self.vocab_path = os.path.join(self.directory, "assets", "vocab.txt")
-
-        bert_config_path = os.path.join(self.directory, "assets", "bert_config.json")
-        self.bert_config = BertConfig(bert_config_path)
-
-    def net(self, input_ids, position_ids, segment_ids, input_mask):
-        """
-        create neural network.
-
-        Args:
-            input_ids (tensor): the word ids.
-            position_ids (tensor): the position ids.
-            segment_ids (tensor): the segment ids.
-            input_mask (tensor): the padding mask.
-
-        Returns:
-            pooled_output (tensor):  sentence-level output for classification task.
-            sequence_output (tensor): token-level output for sequence task.
-        """
-        bert = BertModel(src_ids=input_ids,
-                         position_ids=position_ids,
-                         sentence_ids=segment_ids,
-                         input_mask=input_mask,
-                         config=self.bert_config,
-                         use_fp16=False)
-        pooled_output = bert.get_pooled_output()
-        sequence_output = bert.get_sequence_output()
-        return pooled_output, sequence_output
-
-
-if __name__ == '__main__':
-    test_module = Bert()
--- a/modules/text/language_model/bert_chinese_L_12_H_768_A_12/README.md
+++ b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/README.md
-```shell
-$ hub install bert_chinese_L-12_H-768_A-12==1.1.0
-```
-<p align="center">
-<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
-</p>
-
-更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
-
-## API
-```python
-def context(
-    trainable=True,
-    max_seq_len=128
-)
-```
-用于获取Module的上下文信息，得到输入、输出以及预训练的Paddle Program副本  
-
-**参数**  
-
-> trainable：设置为True时，Module中的参数在Fine-tune时也会随之训练，否则保持不变。  
-> max_seq_len：BERT模型的最大序列长度，若序列长度不足，会通过padding方式补到**max_seq_len**, 若序列长度大于该值，则会以截断方式让序列长度为**max_seq_len**，max_seq_len可取值范围为0～512；  
-
-**返回**  
-> inputs：dict类型，有以下字段：  
-> >**input_ids**存放输入文本tokenize后各token对应BERT词汇表的word ids， shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**position_ids**存放输入文本tokenize后各token所在该文本的位置，shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**segment_ids**存放各token所在文本的标识（token属于文本1或者文本2），shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**input_mask**存放token是否为padding的标识，shape为\[batch_size, max_seq_len\]，int64类型；  
->
-> outputs：dict类型，Module的输出特征，有以下字段：  
-> >**pooled_output**字段存放句子粒度的特征，可用于文本分类等任务，shape为 \[batch_size, 768\]，int64类型；  
-> >**sequence_output**字段存放字粒度的特征，可用于序列标注等任务，shape为 \[batch_size, seq_len, 768\]，int64类型；  
->
-> program：包含该Module计算图的Program。  
-
-
-
-```python
-def get_embedding(
-    texts,
-    use_gpu=False,
-    batch_size=1
-)
-```
-
-用于获取输入文本的句子粒度特征与字粒度特征
-
-**参数**
-
-> texts：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
-> use_gpu：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
-
-**返回**
-
-> results：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
->
-
-```python
-def get_params_layer()
-```
-
-用于获取参数层信息，该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
-
-**参数**
-
-> 无
-
-**返回**
-
-> params_layer：dict类型，key为参数名，值为参数所在层数
-
-**代码示例**
-
-```python
-import paddlehub as hub
-
-# Load $ hub install bert_chinese_L-12_H-768_A-12 pretrained model
-module = hub.Module(name="bert_chinese_L-12_H-768_A-12")
-inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
-
-# Must feed all the tensor of bert_chinese_L-12_H-768_A-12's module need
-input_ids = inputs["input_ids"]
-position_ids = inputs["position_ids"]
-segment_ids = inputs["segment_ids"]
-input_mask = inputs["input_mask"]
-
-# Use "pooled_output" for sentence-level output.
-pooled_output = outputs["pooled_output"]
-
-# Use "sequence_output" for token-level output.
-sequence_output = outputs["sequence_output"]
-
-# Use "get_embedding" to get embedding result.
-embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
-
-# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
-params_layer = module.get_params_layer()
-strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
-```
-
-##   查看代码
-
-https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
-
-
-
-## 依赖
-
-paddlepaddle >= 1.6.2
-
-paddlehub >= 1.6.0
-
-## 更新历史
-
-* 1.0.0
-
-  初始发布
-
-* 1.1.0
-
-  支持get_embedding与get_params_layer
--- a/modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/bert.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""BERT model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import six
-import json
-
-import paddle.fluid as fluid
-
-from bert_chinese_L_12_H_768_A_12.model.transformer_encoder import encoder, pre_process_layer
-
-
-class BertConfig(object):
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path) as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing bert model config file '%s'" % config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict[key]
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            print('%s: %s' % (arg, value))
-        print('------------------------------------------------')
-
-
-class BertModel(object):
-    def __init__(self, src_ids, position_ids, sentence_ids, input_mask, config, weight_sharing=True, use_fp16=False):
-
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        self._sent_types = config['type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        self._weight_sharing = weight_sharing
-
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._sent_emb_name = "sent_embedding"
-        self._dtype = "float16" if use_fp16 else "float32"
-
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config['initializer_range'])
-
-        self._build_model(src_ids, position_ids, sentence_ids, input_mask)
-
-    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
-        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(input=src_ids,
-                                         size=[self._voc_size, self._emb_size],
-                                         dtype=self._dtype,
-                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
-                                                                    initializer=self._param_initializer),
-                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(input=position_ids,
-                                                  size=[self._max_position_seq_len, self._emb_size],
-                                                  dtype=self._dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
-                                                                             initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(sentence_ids,
-                                              size=[self._sent_types, self._emb_size],
-                                              dtype=self._dtype,
-                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
-                                                                         initializer=self._param_initializer))
-
-        emb_out = emb_out + position_emb_out
-        emb_out = emb_out + sent_emb_out
-
-        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
-
-        if self._dtype == "float16":
-            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
-
-        self_attn_mask = fluid.layers.matmul(x=input_mask, y=input_mask, transpose_y=True)
-        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-
-        self._enc_out = encoder(enc_input=emb_out,
-                                attn_bias=n_head_self_attn_mask,
-                                n_layer=self._n_layer,
-                                n_head=self._n_head,
-                                d_key=self._emb_size // self._n_head,
-                                d_value=self._emb_size // self._n_head,
-                                d_model=self._emb_size,
-                                d_inner_hid=self._emb_size * 4,
-                                prepostprocess_dropout=self._prepostprocess_dropout,
-                                attention_dropout=self._attention_dropout,
-                                relu_dropout=0,
-                                hidden_act=self._hidden_act,
-                                preprocess_cmd="",
-                                postprocess_cmd="dan",
-                                param_initializer=self._param_initializer,
-                                name='encoder')
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_pooled_output(self):
-        """Get the first feature of each sequence for classification"""
-
-        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-                                         size=self._emb_size,
-                                         act="tanh",
-                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
-                                                                    initializer=self._param_initializer),
-                                         bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_pretraining_output(self, mask_label, mask_pos, labels):
-        """Get the loss & accuracy for pretraining"""
-
-        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
-
-        # extract the first token feature in each sentence
-        next_sent_feat = self.get_pooled_output()
-        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
-
-        # transform: fc
-        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-                                          size=self._emb_size,
-                                          act=self._hidden_act,
-                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
-                                                                     initializer=self._param_initializer),
-                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-        # transform: layer norm
-        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-                                                initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                                                    dtype=self._dtype,
-                                                    attr=mask_lm_out_bias_attr,
-                                                    is_bias=True)
-
-        else:
-            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                                     size=self._voc_size,
-                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
-                                                                initializer=self._param_initializer),
-                                     bias_attr=mask_lm_out_bias_attr)
-
-        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
-        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-
-        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-                                           size=2,
-                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
-                                                                      initializer=self._param_initializer),
-                                           bias_attr="next_sent_fc.b_0")
-
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-                                                                                    label=labels,
-                                                                                    return_softmax=True)
-
-        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
-
-        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
-
-        loss = mean_next_sent_loss + mean_mask_lm_loss
-        return next_sent_acc, mean_mask_lm_loss, loss
--- a/modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(weights,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(hidden,
-                                dropout_prob=dropout_rate,
-                                dropout_implementation="upscale_in_train",
-                                is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(out,
-                                    begin_norm_axis=len(out.shape) - 1,
-                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                                                               initializer=fluid.initializer.Constant(1.)),
-                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
-                                                              initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(out,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(pre_process_layer(enc_input,
-                                                         preprocess_cmd,
-                                                         prepostprocess_dropout,
-                                                         name=name + '_pre_att'),
-                                       None,
-                                       None,
-                                       attn_bias,
-                                       d_key,
-                                       d_value,
-                                       d_model,
-                                       n_head,
-                                       attention_dropout,
-                                       param_initializer=param_initializer,
-                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(enc_input,
-                                     attn_output,
-                                     postprocess_cmd,
-                                     prepostprocess_dropout,
-                                     name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
-                                                             preprocess_cmd,
-                                                             prepostprocess_dropout,
-                                                             name=name + '_pre_ffn'),
-                                           d_inner_hid,
-                                           d_model,
-                                           relu_dropout,
-                                           hidden_act,
-                                           param_initializer=param_initializer,
-                                           name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(enc_input,
-                                   attn_bias,
-                                   n_head,
-                                   d_key,
-                                   d_value,
-                                   d_model,
-                                   d_inner_hid,
-                                   prepostprocess_dropout,
-                                   attention_dropout,
-                                   relu_dropout,
-                                   hidden_act,
-                                   preprocess_cmd,
-                                   postprocess_cmd,
-                                   param_initializer=param_initializer,
-                                   name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
--- a/modules/text/language_model/bert_chinese_L_12_H_768_A_12/module.py
+++ b/modules/text/language_model/bert_chinese_L_12_H_768_A_12/module.py
-# coding:utf-8
-# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License"
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-
-from paddlehub import TransformerModule
-from paddlehub.module.module import moduleinfo
-
-from bert_chinese_L_12_H_768_A_12.model.bert import BertConfig, BertModel
-
-
-@moduleinfo(
-    name="bert_chinese_L-12_H-768_A-12",
-    version="1.1.0",
-    summary="bert_chinese_L-12_H-768_A-12, 12-layer, 768-hidden, 12-heads, 110M parameters ",
-    author="paddlepaddle",
-    author_email="paddle-dev@baidu.com",
-    type="nlp/semantic_model",
-)
-class BertChinese(TransformerModule):
-    def _initialize(self):
-        self.MAX_SEQ_LEN = 512
-        self.params_path = os.path.join(self.directory, "assets", "params")
-        self.vocab_path = os.path.join(self.directory, "assets", "vocab.txt")
-
-        bert_config_path = os.path.join(self.directory, "assets", "bert_config.json")
-        self.bert_config = BertConfig(bert_config_path)
-
-    def net(self, input_ids, position_ids, segment_ids, input_mask):
-        """
-        create neural network.
-
-        Args:
-            input_ids (tensor): the word ids.
-            position_ids (tensor): the position ids.
-            segment_ids (tensor): the segment ids.
-            input_mask (tensor): the padding mask.
-
-        Returns:
-            pooled_output (tensor):  sentence-level output for classification task.
-            sequence_output (tensor): token-level output for sequence task.
-        """
-        bert = BertModel(src_ids=input_ids,
-                         position_ids=position_ids,
-                         sentence_ids=segment_ids,
-                         input_mask=input_mask,
-                         config=self.bert_config,
-                         use_fp16=False)
-        pooled_output = bert.get_pooled_output()
-        sequence_output = bert.get_sequence_output()
-        return pooled_output, sequence_output
-
-
-if __name__ == '__main__':
-    test_module = BertChinese()
--- a/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/README.md
+++ b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/README.md
-```shell
-$ hub install bert_multi_cased_L-12_H-768_A-12==1.1.0
-```
-<p align="center">
-<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
-</p>
-
-更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
-
-## API
-```python
-def context(
-    trainable=True,
-    max_seq_len=128
-)
-```
-用于获取Module的上下文信息，得到输入、输出以及预训练的Paddle Program副本  
-
-**参数**  
-
-> trainable：设置为True时，Module中的参数在Fine-tune时也会随之训练，否则保持不变。  
-> max_seq_len：BERT模型的最大序列长度，若序列长度不足，会通过padding方式补到**max_seq_len**, 若序列长度大于该值，则会以截断方式让序列长度为**max_seq_len**，max_seq_len可取值范围为0～512；  
-
-**返回**  
-> inputs：dict类型，有以下字段：  
-> >**input_ids**存放输入文本tokenize后各token对应BERT词汇表的word ids， shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**position_ids**存放输入文本tokenize后各token所在该文本的位置，shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**segment_ids**存放各token所在文本的标识（token属于文本1或者文本2），shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**input_mask**存放token是否为padding的标识，shape为\[batch_size, max_seq_len\]，int64类型；  
->
-> outputs：dict类型，Module的输出特征，有以下字段：  
-> >**pooled_output**字段存放句子粒度的特征，可用于文本分类等任务，shape为 \[batch_size, 768\]，int64类型；  
-> >**sequence_output**字段存放字粒度的特征，可用于序列标注等任务，shape为 \[batch_size, seq_len, 768\]，int64类型；  
->
-> program：包含该Module计算图的Program。  
-
-
-
-```python
-def get_embedding(
-    texts,
-    use_gpu=False,
-    batch_size=1
-)
-```
-
-用于获取输入文本的句子粒度特征与字粒度特征
-
-**参数**
-
-> texts：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
-> use_gpu：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
-
-**返回**
-
-> results：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
->
-
-```python
-def get_params_layer()
-```
-
-用于获取参数层信息，该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
-
-**参数**
-
-> 无
-
-**返回**
-
-> params_layer：dict类型，key为参数名，值为参数所在层数
-
-**代码示例**
-
-```python
-import paddlehub as hub
-
-# Load $ hub install bert_multi_cased_L-12_H-768_A-12 pretrained model
-module = hub.Module(name="bert_multi_cased_L-12_H-768_A-12")
-inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
-
-# Must feed all the tensor of bert_multi_cased_L-12_H-768_A-12's module need
-input_ids = inputs["input_ids"]
-position_ids = inputs["position_ids"]
-segment_ids = inputs["segment_ids"]
-input_mask = inputs["input_mask"]
-
-# Use "pooled_output" for sentence-level output.
-pooled_output = outputs["pooled_output"]
-
-# Use "sequence_output" for token-level output.
-sequence_output = outputs["sequence_output"]
-
-# Use "get_embedding" to get embedding result.
-embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
-
-# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
-params_layer = module.get_params_layer()
-strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
-```
-
-##   查看代码
-
-https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
-
-
-
-## 依赖
-
-paddlepaddle >= 1.6.2
-
-paddlehub >= 1.6.0
-
-
-## 更新历史
-
-* 1.0.0
-
-  初始发布
-
-* 1.1.0
-
-  支持get_embedding与get_params_layer
--- a/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/bert.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""BERT model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import six
-import json
-
-import paddle.fluid as fluid
-
-from bert_multi_cased_L_12_H_768_A_12.model.transformer_encoder import encoder, pre_process_layer
-
-
-class BertConfig(object):
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path) as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing bert model config file '%s'" % config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict[key]
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            print('%s: %s' % (arg, value))
-        print('------------------------------------------------')
-
-
-class BertModel(object):
-    def __init__(self, src_ids, position_ids, sentence_ids, input_mask, config, weight_sharing=True, use_fp16=False):
-
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        self._sent_types = config['type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        self._weight_sharing = weight_sharing
-
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._sent_emb_name = "sent_embedding"
-        self._dtype = "float16" if use_fp16 else "float32"
-
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config['initializer_range'])
-
-        self._build_model(src_ids, position_ids, sentence_ids, input_mask)
-
-    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
-        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(input=src_ids,
-                                         size=[self._voc_size, self._emb_size],
-                                         dtype=self._dtype,
-                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
-                                                                    initializer=self._param_initializer),
-                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(input=position_ids,
-                                                  size=[self._max_position_seq_len, self._emb_size],
-                                                  dtype=self._dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
-                                                                             initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(sentence_ids,
-                                              size=[self._sent_types, self._emb_size],
-                                              dtype=self._dtype,
-                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
-                                                                         initializer=self._param_initializer))
-
-        emb_out = emb_out + position_emb_out
-        emb_out = emb_out + sent_emb_out
-
-        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
-
-        if self._dtype == "float16":
-            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
-
-        self_attn_mask = fluid.layers.matmul(x=input_mask, y=input_mask, transpose_y=True)
-        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-
-        self._enc_out = encoder(enc_input=emb_out,
-                                attn_bias=n_head_self_attn_mask,
-                                n_layer=self._n_layer,
-                                n_head=self._n_head,
-                                d_key=self._emb_size // self._n_head,
-                                d_value=self._emb_size // self._n_head,
-                                d_model=self._emb_size,
-                                d_inner_hid=self._emb_size * 4,
-                                prepostprocess_dropout=self._prepostprocess_dropout,
-                                attention_dropout=self._attention_dropout,
-                                relu_dropout=0,
-                                hidden_act=self._hidden_act,
-                                preprocess_cmd="",
-                                postprocess_cmd="dan",
-                                param_initializer=self._param_initializer,
-                                name='encoder')
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_pooled_output(self):
-        """Get the first feature of each sequence for classification"""
-
-        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-                                         size=self._emb_size,
-                                         act="tanh",
-                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
-                                                                    initializer=self._param_initializer),
-                                         bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_pretraining_output(self, mask_label, mask_pos, labels):
-        """Get the loss & accuracy for pretraining"""
-
-        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
-
-        # extract the first token feature in each sentence
-        next_sent_feat = self.get_pooled_output()
-        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
-
-        # transform: fc
-        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-                                          size=self._emb_size,
-                                          act=self._hidden_act,
-                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
-                                                                     initializer=self._param_initializer),
-                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-        # transform: layer norm
-        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-                                                initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                                                    dtype=self._dtype,
-                                                    attr=mask_lm_out_bias_attr,
-                                                    is_bias=True)
-
-        else:
-            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                                     size=self._voc_size,
-                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
-                                                                initializer=self._param_initializer),
-                                     bias_attr=mask_lm_out_bias_attr)
-
-        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
-        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-
-        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-                                           size=2,
-                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
-                                                                      initializer=self._param_initializer),
-                                           bias_attr="next_sent_fc.b_0")
-
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-                                                                                    label=labels,
-                                                                                    return_softmax=True)
-
-        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
-
-        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
-
-        loss = mean_next_sent_loss + mean_mask_lm_loss
-        return next_sent_acc, mean_mask_lm_loss, loss
--- a/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(weights,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(hidden,
-                                dropout_prob=dropout_rate,
-                                dropout_implementation="upscale_in_train",
-                                is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(out,
-                                    begin_norm_axis=len(out.shape) - 1,
-                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                                                               initializer=fluid.initializer.Constant(1.)),
-                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
-                                                              initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(out,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(pre_process_layer(enc_input,
-                                                         preprocess_cmd,
-                                                         prepostprocess_dropout,
-                                                         name=name + '_pre_att'),
-                                       None,
-                                       None,
-                                       attn_bias,
-                                       d_key,
-                                       d_value,
-                                       d_model,
-                                       n_head,
-                                       attention_dropout,
-                                       param_initializer=param_initializer,
-                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(enc_input,
-                                     attn_output,
-                                     postprocess_cmd,
-                                     prepostprocess_dropout,
-                                     name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
-                                                             preprocess_cmd,
-                                                             prepostprocess_dropout,
-                                                             name=name + '_pre_ffn'),
-                                           d_inner_hid,
-                                           d_model,
-                                           relu_dropout,
-                                           hidden_act,
-                                           param_initializer=param_initializer,
-                                           name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(enc_input,
-                                   attn_bias,
-                                   n_head,
-                                   d_key,
-                                   d_value,
-                                   d_model,
-                                   d_inner_hid,
-                                   prepostprocess_dropout,
-                                   attention_dropout,
-                                   relu_dropout,
-                                   hidden_act,
-                                   preprocess_cmd,
-                                   postprocess_cmd,
-                                   param_initializer=param_initializer,
-                                   name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
--- a/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/module.py
+++ b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/module.py
-# coding:utf-8
-# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License"
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-
-from paddlehub import TransformerModule
-from paddlehub.module.module import moduleinfo
-
-from bert_multi_cased_L_12_H_768_A_12.model.bert import BertConfig, BertModel
-
-
-@moduleinfo(
-    name="bert_multi_cased_L-12_H-768_A-12",
-    version="1.1.0",
-    summary="bert_multi_cased_L-12_H-768_A-12, 12-layer, 768-hidden, 12-heads, 110M parameters ",
-    author="paddlepaddle",
-    author_email="paddle-dev@baidu.com",
-    type="nlp/semantic_model",
-)
-class Bert(TransformerModule):
-    def _initialize(self):
-        self.MAX_SEQ_LEN = 512
-        self.params_path = os.path.join(self.directory, "assets", "params")
-        self.vocab_path = os.path.join(self.directory, "assets", "vocab.txt")
-
-        bert_config_path = os.path.join(self.directory, "assets", "bert_config.json")
-        self.bert_config = BertConfig(bert_config_path)
-
-    def net(self, input_ids, position_ids, segment_ids, input_mask):
-        """
-        create neural network.
-
-        Args:
-            input_ids (tensor): the word ids.
-            position_ids (tensor): the position ids.
-            segment_ids (tensor): the segment ids.
-            input_mask (tensor): the padding mask.
-
-        Returns:
-            pooled_output (tensor):  sentence-level output for classification task.
-            sequence_output (tensor): token-level output for sequence task.
-        """
-        bert = BertModel(src_ids=input_ids,
-                         position_ids=position_ids,
-                         sentence_ids=segment_ids,
-                         input_mask=input_mask,
-                         config=self.bert_config,
-                         use_fp16=False)
-        pooled_output = bert.get_pooled_output()
-        sequence_output = bert.get_sequence_output()
-        return pooled_output, sequence_output
-
-
-if __name__ == '__main__':
-    test_module = Bert()
--- a/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/README.md
+++ b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/README.md
-```shell
-$ hub install bert_multi_uncased_L-12_H-768_A-12==1.0.0
-```
-<p align="center">
-<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
-</p>
-
-更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
-
-## API
-```python
-def context(
-    trainable=True,
-    max_seq_len=128
-)
-```
-用于获取Module的上下文信息，得到输入、输出以及预训练的Paddle Program副本  
-
-**参数**  
-
-> trainable：设置为True时，Module中的参数在Fine-tune时也会随之训练，否则保持不变。  
-> max_seq_len：BERT模型的最大序列长度，若序列长度不足，会通过padding方式补到**max_seq_len**, 若序列长度大于该值，则会以截断方式让序列长度为**max_seq_len**，max_seq_len可取值范围为0～512；  
-
-**返回**  
-> inputs：dict类型，有以下字段：  
-> >**input_ids**存放输入文本tokenize后各token对应BERT词汇表的word ids， shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**position_ids**存放输入文本tokenize后各token所在该文本的位置，shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**segment_ids**存放各token所在文本的标识（token属于文本1或者文本2），shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**input_mask**存放token是否为padding的标识，shape为\[batch_size, max_seq_len\]，int64类型；  
->
-> outputs：dict类型，Module的输出特征，有以下字段：  
-> >**pooled_output**字段存放句子粒度的特征，可用于文本分类等任务，shape为 \[batch_size, 768\]，int64类型；  
-> >**sequence_output**字段存放字粒度的特征，可用于序列标注等任务，shape为 \[batch_size, seq_len, 768\]，int64类型；  
->
-> program：包含该Module计算图的Program。  
-
-
-
-```python
-def get_embedding(
-    texts,
-    use_gpu=False,
-    batch_size=1
-)
-```
-
-用于获取输入文本的句子粒度特征与字粒度特征
-
-**参数**
-
-> texts：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
-> use_gpu：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
-
-**返回**
-
-> results：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
->
-
-```python
-def get_params_layer()
-```
-
-用于获取参数层信息，该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
-
-**参数**
-
-> 无
-
-**返回**
-
-> params_layer：dict类型，key为参数名，值为参数所在层数
-
-**代码示例**
-
-```python
-import paddlehub as hub
-
-# Load $ hub install bert_multi_uncased_L-12_H-768_A-12 pretrained model
-module = hub.Module(name="bert_multi_uncased_L-12_H-768_A-12")
-inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
-
-# Must feed all the tensor of bert_multi_uncased_L-12_H-768_A-12's module need
-input_ids = inputs["input_ids"]
-position_ids = inputs["position_ids"]
-segment_ids = inputs["segment_ids"]
-input_mask = inputs["input_mask"]
-
-# Use "pooled_output" for sentence-level output.
-pooled_output = outputs["pooled_output"]
-
-# Use "sequence_output" for token-level output.
-sequence_output = outputs["sequence_output"]
-
-# Use "get_embedding" to get embedding result.
-embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
-
-# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
-params_layer = module.get_params_layer()
-strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
-```
-
-##   查看代码
-
-https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
-
-
-
-## 依赖
-
-paddlepaddle >= 1.6.2
-
-paddlehub >= 1.6.0
-
-
-## 更新历史
-
-* 1.0.0
-
-  初始发布
--- a/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/__init__.py
+++ b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/__init__.py
--- a/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/bert.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""BERT model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import six
-import json
-
-import paddle.fluid as fluid
-
-from bert_multi_uncased_L_12_H_768_A_12.model.transformer_encoder import encoder, pre_process_layer
-
-
-class BertConfig(object):
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path) as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing bert model config file '%s'" % config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict[key]
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            print('%s: %s' % (arg, value))
-        print('------------------------------------------------')
-
-
-class BertModel(object):
-    def __init__(self, src_ids, position_ids, sentence_ids, input_mask, config, weight_sharing=True, use_fp16=False):
-
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        self._sent_types = config['type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        self._weight_sharing = weight_sharing
-
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._sent_emb_name = "sent_embedding"
-        self._dtype = "float16" if use_fp16 else "float32"
-
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config['initializer_range'])
-
-        self._build_model(src_ids, position_ids, sentence_ids, input_mask)
-
-    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
-        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(input=src_ids,
-                                         size=[self._voc_size, self._emb_size],
-                                         dtype=self._dtype,
-                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
-                                                                    initializer=self._param_initializer),
-                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(input=position_ids,
-                                                  size=[self._max_position_seq_len, self._emb_size],
-                                                  dtype=self._dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
-                                                                             initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(sentence_ids,
-                                              size=[self._sent_types, self._emb_size],
-                                              dtype=self._dtype,
-                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
-                                                                         initializer=self._param_initializer))
-
-        emb_out = emb_out + position_emb_out
-        emb_out = emb_out + sent_emb_out
-
-        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
-
-        if self._dtype == "float16":
-            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
-
-        self_attn_mask = fluid.layers.matmul(x=input_mask, y=input_mask, transpose_y=True)
-        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-
-        self._enc_out = encoder(enc_input=emb_out,
-                                attn_bias=n_head_self_attn_mask,
-                                n_layer=self._n_layer,
-                                n_head=self._n_head,
-                                d_key=self._emb_size // self._n_head,
-                                d_value=self._emb_size // self._n_head,
-                                d_model=self._emb_size,
-                                d_inner_hid=self._emb_size * 4,
-                                prepostprocess_dropout=self._prepostprocess_dropout,
-                                attention_dropout=self._attention_dropout,
-                                relu_dropout=0,
-                                hidden_act=self._hidden_act,
-                                preprocess_cmd="",
-                                postprocess_cmd="dan",
-                                param_initializer=self._param_initializer,
-                                name='encoder')
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_pooled_output(self):
-        """Get the first feature of each sequence for classification"""
-
-        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-                                         size=self._emb_size,
-                                         act="tanh",
-                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
-                                                                    initializer=self._param_initializer),
-                                         bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_pretraining_output(self, mask_label, mask_pos, labels):
-        """Get the loss & accuracy for pretraining"""
-
-        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
-
-        # extract the first token feature in each sentence
-        next_sent_feat = self.get_pooled_output()
-        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
-
-        # transform: fc
-        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-                                          size=self._emb_size,
-                                          act=self._hidden_act,
-                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
-                                                                     initializer=self._param_initializer),
-                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-        # transform: layer norm
-        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-                                                initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                                                    dtype=self._dtype,
-                                                    attr=mask_lm_out_bias_attr,
-                                                    is_bias=True)
-
-        else:
-            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                                     size=self._voc_size,
-                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
-                                                                initializer=self._param_initializer),
-                                     bias_attr=mask_lm_out_bias_attr)
-
-        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
-        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-
-        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-                                           size=2,
-                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
-                                                                      initializer=self._param_initializer),
-                                           bias_attr="next_sent_fc.b_0")
-
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-                                                                                    label=labels,
-                                                                                    return_softmax=True)
-
-        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
-
-        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
-
-        loss = mean_next_sent_loss + mean_mask_lm_loss
-        return next_sent_acc, mean_mask_lm_loss, loss
--- a/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(weights,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(hidden,
-                                dropout_prob=dropout_rate,
-                                dropout_implementation="upscale_in_train",
-                                is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(out,
-                                    begin_norm_axis=len(out.shape) - 1,
-                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                                                               initializer=fluid.initializer.Constant(1.)),
-                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
-                                                              initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(out,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(pre_process_layer(enc_input,
-                                                         preprocess_cmd,
-                                                         prepostprocess_dropout,
-                                                         name=name + '_pre_att'),
-                                       None,
-                                       None,
-                                       attn_bias,
-                                       d_key,
-                                       d_value,
-                                       d_model,
-                                       n_head,
-                                       attention_dropout,
-                                       param_initializer=param_initializer,
-                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(enc_input,
-                                     attn_output,
-                                     postprocess_cmd,
-                                     prepostprocess_dropout,
-                                     name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
-                                                             preprocess_cmd,
-                                                             prepostprocess_dropout,
-                                                             name=name + '_pre_ffn'),
-                                           d_inner_hid,
-                                           d_model,
-                                           relu_dropout,
-                                           hidden_act,
-                                           param_initializer=param_initializer,
-                                           name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(enc_input,
-                                   attn_bias,
-                                   n_head,
-                                   d_key,
-                                   d_value,
-                                   d_model,
-                                   d_inner_hid,
-                                   prepostprocess_dropout,
-                                   attention_dropout,
-                                   relu_dropout,
-                                   hidden_act,
-                                   preprocess_cmd,
-                                   postprocess_cmd,
-                                   param_initializer=param_initializer,
-                                   name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
--- a/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/module.py
+++ b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/module.py
-# coding:utf-8
-# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License"
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-
-from paddlehub import TransformerModule
-from paddlehub.module.module import moduleinfo
-
-from bert_multi_uncased_L_12_H_768_A_12.model.bert import BertConfig, BertModel
-
-
-@moduleinfo(
-    name="bert_multi_uncased_L-12_H-768_A-12",
-    version="1.0.0",
-    summary="bert_multi_uncased_L-12_H-768_A-12, 12-layer, 768-hidden, 12-heads, 110M parameters ",
-    author="paddlepaddle",
-    author_email="paddle-dev@baidu.com",
-    type="nlp/semantic_model",
-)
-class Bert(TransformerModule):
-    def _initialize(self):
-        self.MAX_SEQ_LEN = 512
-        self.params_path = os.path.join(self.directory, "assets", "params")
-        self.vocab_path = os.path.join(self.directory, "assets", "vocab.txt")
-
-        bert_config_path = os.path.join(self.directory, "assets", "bert_config.json")
-        self.bert_config = BertConfig(bert_config_path)
-
-    def net(self, input_ids, position_ids, segment_ids, input_mask):
-        """
-        create neural network.
-
-        Args:
-            input_ids (tensor): the word ids.
-            position_ids (tensor): the position ids.
-            segment_ids (tensor): the segment ids.
-            input_mask (tensor): the padding mask.
-
-        Returns:
-            pooled_output (tensor):  sentence-level output for classification task.
-            sequence_output (tensor): token-level output for sequence task.
-        """
-        bert = BertModel(src_ids=input_ids,
-                         position_ids=position_ids,
-                         sentence_ids=segment_ids,
-                         input_mask=input_mask,
-                         config=self.bert_config,
-                         use_fp16=False)
-        pooled_output = bert.get_pooled_output()
-        sequence_output = bert.get_sequence_output()
-        return pooled_output, sequence_output
-
-
-if __name__ == '__main__':
-    test_module = Bert()
--- a/modules/text/language_model/bert_uncased_L_12_H_768_A_12/README.md
+++ b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/README.md
-```shell
-$ hub install bert_uncased_L_12_H_768_A_12==1.1.0
-```
-<p align="center">
-<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
-</p>
-
-更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
-
-## API
-```python
-def context(
-    trainable=True,
-    max_seq_len=128
-)
-```
-用于获取Module的上下文信息，得到输入、输出以及预训练的Paddle Program副本  
-
-**参数**  
-
-> trainable：设置为True时，Module中的参数在Fine-tune时也会随之训练，否则保持不变。  
-> max_seq_len：BERT模型的最大序列长度，若序列长度不足，会通过padding方式补到**max_seq_len**, 若序列长度大于该值，则会以截断方式让序列长度为**max_seq_len**，max_seq_len可取值范围为0～512；  
-
-**返回**  
-> inputs：dict类型，有以下字段：  
-> >**input_ids**存放输入文本tokenize后各token对应BERT词汇表的word ids， shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**position_ids**存放输入文本tokenize后各token所在该文本的位置，shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**segment_ids**存放各token所在文本的标识（token属于文本1或者文本2），shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**input_mask**存放token是否为padding的标识，shape为\[batch_size, max_seq_len\]，int64类型；  
->
-> outputs：dict类型，Module的输出特征，有以下字段：  
-> >**pooled_output**字段存放句子粒度的特征，可用于文本分类等任务，shape为 \[batch_size, 768\]，int64类型；  
-> >**sequence_output**字段存放字粒度的特征，可用于序列标注等任务，shape为 \[batch_size, seq_len, 768\]，int64类型；  
->
-> program：包含该Module计算图的Program。  
-
-
-
-```python
-def get_embedding(
-    texts,
-    use_gpu=False,
-    batch_size=1
-)
-```
-
-用于获取输入文本的句子粒度特征与字粒度特征
-
-**参数**
-
-> texts：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
-> use_gpu：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
-
-**返回**
-
-> results：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
->
-
-```python
-def get_params_layer()
-```
-
-用于获取参数层信息，该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
-
-**参数**
-
-> 无
-
-**返回**
-
-> params_layer：dict类型，key为参数名，值为参数所在层数
-
-**代码示例**
-
-```python
-import paddlehub as hub
-
-# Load $ hub install bert_uncased_L_12_H_768_A_12 pretrained model
-module = hub.Module(name="bert_uncased_L_12_H_768_A_12")
-inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
-
-# Must feed all the tensor of bert_uncased_L_12_H_768_A_12's module need
-input_ids = inputs["input_ids"]
-position_ids = inputs["position_ids"]
-segment_ids = inputs["segment_ids"]
-input_mask = inputs["input_mask"]
-
-# Use "pooled_output" for sentence-level output.
-pooled_output = outputs["pooled_output"]
-
-# Use "sequence_output" for token-level output.
-sequence_output = outputs["sequence_output"]
-
-# Use "get_embedding" to get embedding result.
-embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
-
-# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
-params_layer = module.get_params_layer()
-strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
-```
-
-##   查看代码
-
-https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
-
-
-
-## 依赖
-
-paddlepaddle >= 1.6.2
-
-paddlehub >= 1.6.0
-
-
-## 更新历史
-
-* 1.0.0
-
-  初始发布
-
-* 1.1.0
-
-  支持get_embedding与get_params_layer
--- a/modules/text/language_model/bert_uncased_L_12_H_768_A_12/__init__.py
+++ b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/__init__.py
--- a/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/__init__.py
+++ b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/__init__.py
--- a/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/bert.py
+++ b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/bert.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""BERT model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import six
-import json
-
-import paddle.fluid as fluid
-
-from bert_uncased_L_12_H_768_A_12.model.transformer_encoder import encoder, pre_process_layer
-
-
-class BertConfig(object):
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path) as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing bert model config file '%s'" % config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict[key]
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            print('%s: %s' % (arg, value))
-        print('------------------------------------------------')
-
-
-class BertModel(object):
-    def __init__(self, src_ids, position_ids, sentence_ids, input_mask, config, weight_sharing=True, use_fp16=False):
-
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        self._sent_types = config['type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        self._weight_sharing = weight_sharing
-
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._sent_emb_name = "sent_embedding"
-        self._dtype = "float16" if use_fp16 else "float32"
-
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config['initializer_range'])
-
-        self._build_model(src_ids, position_ids, sentence_ids, input_mask)
-
-    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
-        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(input=src_ids,
-                                         size=[self._voc_size, self._emb_size],
-                                         dtype=self._dtype,
-                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
-                                                                    initializer=self._param_initializer),
-                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(input=position_ids,
-                                                  size=[self._max_position_seq_len, self._emb_size],
-                                                  dtype=self._dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
-                                                                             initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(sentence_ids,
-                                              size=[self._sent_types, self._emb_size],
-                                              dtype=self._dtype,
-                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
-                                                                         initializer=self._param_initializer))
-
-        emb_out = emb_out + position_emb_out
-        emb_out = emb_out + sent_emb_out
-
-        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
-
-        if self._dtype == "float16":
-            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
-
-        self_attn_mask = fluid.layers.matmul(x=input_mask, y=input_mask, transpose_y=True)
-        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-
-        self._enc_out = encoder(enc_input=emb_out,
-                                attn_bias=n_head_self_attn_mask,
-                                n_layer=self._n_layer,
-                                n_head=self._n_head,
-                                d_key=self._emb_size // self._n_head,
-                                d_value=self._emb_size // self._n_head,
-                                d_model=self._emb_size,
-                                d_inner_hid=self._emb_size * 4,
-                                prepostprocess_dropout=self._prepostprocess_dropout,
-                                attention_dropout=self._attention_dropout,
-                                relu_dropout=0,
-                                hidden_act=self._hidden_act,
-                                preprocess_cmd="",
-                                postprocess_cmd="dan",
-                                param_initializer=self._param_initializer,
-                                name='encoder')
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_pooled_output(self):
-        """Get the first feature of each sequence for classification"""
-
-        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-                                         size=self._emb_size,
-                                         act="tanh",
-                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
-                                                                    initializer=self._param_initializer),
-                                         bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_pretraining_output(self, mask_label, mask_pos, labels):
-        """Get the loss & accuracy for pretraining"""
-
-        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
-
-        # extract the first token feature in each sentence
-        next_sent_feat = self.get_pooled_output()
-        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
-
-        # transform: fc
-        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-                                          size=self._emb_size,
-                                          act=self._hidden_act,
-                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
-                                                                     initializer=self._param_initializer),
-                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-        # transform: layer norm
-        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-                                                initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                                                    dtype=self._dtype,
-                                                    attr=mask_lm_out_bias_attr,
-                                                    is_bias=True)
-
-        else:
-            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                                     size=self._voc_size,
-                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
-                                                                initializer=self._param_initializer),
-                                     bias_attr=mask_lm_out_bias_attr)
-
-        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
-        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-
-        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-                                           size=2,
-                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
-                                                                      initializer=self._param_initializer),
-                                           bias_attr="next_sent_fc.b_0")
-
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-                                                                                    label=labels,
-                                                                                    return_softmax=True)
-
-        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
-
-        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
-
-        loss = mean_next_sent_loss + mean_mask_lm_loss
-        return next_sent_acc, mean_mask_lm_loss, loss
--- a/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(weights,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(hidden,
-                                dropout_prob=dropout_rate,
-                                dropout_implementation="upscale_in_train",
-                                is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(out,
-                                    begin_norm_axis=len(out.shape) - 1,
-                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                                                               initializer=fluid.initializer.Constant(1.)),
-                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
-                                                              initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(out,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(pre_process_layer(enc_input,
-                                                         preprocess_cmd,
-                                                         prepostprocess_dropout,
-                                                         name=name + '_pre_att'),
-                                       None,
-                                       None,
-                                       attn_bias,
-                                       d_key,
-                                       d_value,
-                                       d_model,
-                                       n_head,
-                                       attention_dropout,
-                                       param_initializer=param_initializer,
-                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(enc_input,
-                                     attn_output,
-                                     postprocess_cmd,
-                                     prepostprocess_dropout,
-                                     name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
-                                                             preprocess_cmd,
-                                                             prepostprocess_dropout,
-                                                             name=name + '_pre_ffn'),
-                                           d_inner_hid,
-                                           d_model,
-                                           relu_dropout,
-                                           hidden_act,
-                                           param_initializer=param_initializer,
-                                           name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(enc_input,
-                                   attn_bias,
-                                   n_head,
-                                   d_key,
-                                   d_value,
-                                   d_model,
-                                   d_inner_hid,
-                                   prepostprocess_dropout,
-                                   attention_dropout,
-                                   relu_dropout,
-                                   hidden_act,
-                                   preprocess_cmd,
-                                   postprocess_cmd,
-                                   param_initializer=param_initializer,
-                                   name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
--- a/modules/text/language_model/bert_uncased_L_12_H_768_A_12/module.py
+++ b/modules/text/language_model/bert_uncased_L_12_H_768_A_12/module.py
-# coding:utf-8
-# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License"
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-
-from paddlehub import TransformerModule
-from paddlehub.module.module import moduleinfo
-
-from bert_uncased_L_12_H_768_A_12.model.bert import BertConfig, BertModel
-
-
-@moduleinfo(
-    name="bert_uncased_L-12_H-768_A-12",
-    version="1.1.0",
-    summary="bert_uncased_L-12_H-768_A-12, 12-layer, 768-hidden, 12-heads, 110M parameters",
-    author="paddlepaddle",
-    author_email="paddle-dev@baidu.com",
-    type="nlp/semantic_model",
-)
-class Bert(TransformerModule):
-    def _initialize(self):
-        self.MAX_SEQ_LEN = 512
-        self.params_path = os.path.join(self.directory, "assets", "params")
-        self.vocab_path = os.path.join(self.directory, "assets", "vocab.txt")
-
-        bert_config_path = os.path.join(self.directory, "assets", "bert_config.json")
-        self.bert_config = BertConfig(bert_config_path)
-
-    def net(self, input_ids, position_ids, segment_ids, input_mask):
-        """
-        create neural network.
-
-        Args:
-            input_ids (tensor): the word ids.
-            position_ids (tensor): the position ids.
-            segment_ids (tensor): the segment ids.
-            input_mask (tensor): the padding mask.
-
-        Returns:
-            pooled_output (tensor):  sentence-level output for classification task.
-            sequence_output (tensor): token-level output for sequence task.
-        """
-        bert = BertModel(src_ids=input_ids,
-                         position_ids=position_ids,
-                         sentence_ids=segment_ids,
-                         input_mask=input_mask,
-                         config=self.bert_config,
-                         use_fp16=False)
-        pooled_output = bert.get_pooled_output()
-        sequence_output = bert.get_sequence_output()
-        return pooled_output, sequence_output
-
-
-if __name__ == '__main__':
-    test_module = Bert()
--- a/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/README.md
+++ b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/README.md
-```shell
-$ hub install bert_uncased_L-24_H-1024_A-16==1.1.0
-```
-<p align="center">
-<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
-</p>
-
-更多详情请参考[BERT论文](https://arxiv.org/abs/1810.04805)
-
-## API
-```python
-def context(
-    trainable=True,
-    max_seq_len=128
-)
-```
-用于获取Module的上下文信息，得到输入、输出以及预训练的Paddle Program副本  
-
-**参数**  
-
-> trainable：设置为True时，Module中的参数在Fine-tune时也会随之训练，否则保持不变。  
-> max_seq_len：BERT模型的最大序列长度，若序列长度不足，会通过padding方式补到**max_seq_len**, 若序列长度大于该值，则会以截断方式让序列长度为**max_seq_len**，max_seq_len可取值范围为0～512；  
-
-**返回**  
-> inputs：dict类型，有以下字段：  
-> >**input_ids**存放输入文本tokenize后各token对应BERT词汇表的word ids， shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**position_ids**存放输入文本tokenize后各token所在该文本的位置，shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**segment_ids**存放各token所在文本的标识（token属于文本1或者文本2），shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**input_mask**存放token是否为padding的标识，shape为\[batch_size, max_seq_len\]，int64类型；  
->
-> outputs：dict类型，Module的输出特征，有以下字段：  
-> >**pooled_output**字段存放句子粒度的特征，可用于文本分类等任务，shape为 \[batch_size, 768\]，int64类型；  
-> >**sequence_output**字段存放字粒度的特征，可用于序列标注等任务，shape为 \[batch_size, seq_len, 768\]，int64类型；  
->
-> program：包含该Module计算图的Program。  
-
-
-
-```python
-def get_embedding(
-    texts,
-    use_gpu=False,
-    batch_size=1
-)
-```
-
-用于获取输入文本的句子粒度特征与字粒度特征
-
-**参数**
-
-> texts：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
-> use_gpu：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
-
-**返回**
-
-> results：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
->
-
-```python
-def get_params_layer()
-```
-
-用于获取参数层信息，该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
-
-**参数**
-
-> 无
-
-**返回**
-
-> params_layer：dict类型，key为参数名，值为参数所在层数
-
-**代码示例**
-
-```python
-import paddlehub as hub
-
-# Load $ hub install bert_uncased_L-24_H-1024_A-16 pretrained model
-module = hub.Module(name="bert_uncased_L-24_H-1024_A-16")
-inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
-
-# Must feed all the tensor of bert_uncased_L-24_H-1024_A-16's module need
-input_ids = inputs["input_ids"]
-position_ids = inputs["position_ids"]
-segment_ids = inputs["segment_ids"]
-input_mask = inputs["input_mask"]
-
-# Use "pooled_output" for sentence-level output.
-pooled_output = outputs["pooled_output"]
-
-# Use "sequence_output" for token-level output.
-sequence_output = outputs["sequence_output"]
-
-# Use "get_embedding" to get embedding result.
-embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
-
-# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
-params_layer = module.get_params_layer()
-strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
-```
-
-##   查看代码
-
-https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
-
-
-
-## 依赖
-
-paddlepaddle >= 1.6.2
-
-paddlehub >= 1.6.0
-
-## 更新历史
-
-* 1.0.0
-
-  初始发布
-
-* 1.1.0
-
-  支持get_embedding与get_params_layer
--- a/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/__init__.py
+++ b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/__init__.py
--- a/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/__init__.py
+++ b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/__init__.py
--- a/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/bert.py
+++ b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/bert.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""BERT model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import six
-import json
-
-import paddle.fluid as fluid
-
-from bert_uncased_L_24_H_1024_A_16.model.transformer_encoder import encoder, pre_process_layer
-
-
-class BertConfig(object):
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path) as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing bert model config file '%s'" % config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict[key]
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            print('%s: %s' % (arg, value))
-        print('------------------------------------------------')
-
-
-class BertModel(object):
-    def __init__(self, src_ids, position_ids, sentence_ids, input_mask, config, weight_sharing=True, use_fp16=False):
-
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        self._sent_types = config['type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        self._weight_sharing = weight_sharing
-
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._sent_emb_name = "sent_embedding"
-        self._dtype = "float16" if use_fp16 else "float32"
-
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config['initializer_range'])
-
-        self._build_model(src_ids, position_ids, sentence_ids, input_mask)
-
-    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
-        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(input=src_ids,
-                                         size=[self._voc_size, self._emb_size],
-                                         dtype=self._dtype,
-                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
-                                                                    initializer=self._param_initializer),
-                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(input=position_ids,
-                                                  size=[self._max_position_seq_len, self._emb_size],
-                                                  dtype=self._dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
-                                                                             initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(sentence_ids,
-                                              size=[self._sent_types, self._emb_size],
-                                              dtype=self._dtype,
-                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
-                                                                         initializer=self._param_initializer))
-
-        emb_out = emb_out + position_emb_out
-        emb_out = emb_out + sent_emb_out
-
-        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
-
-        if self._dtype == "float16":
-            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
-
-        self_attn_mask = fluid.layers.matmul(x=input_mask, y=input_mask, transpose_y=True)
-        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-
-        self._enc_out = encoder(enc_input=emb_out,
-                                attn_bias=n_head_self_attn_mask,
-                                n_layer=self._n_layer,
-                                n_head=self._n_head,
-                                d_key=self._emb_size // self._n_head,
-                                d_value=self._emb_size // self._n_head,
-                                d_model=self._emb_size,
-                                d_inner_hid=self._emb_size * 4,
-                                prepostprocess_dropout=self._prepostprocess_dropout,
-                                attention_dropout=self._attention_dropout,
-                                relu_dropout=0,
-                                hidden_act=self._hidden_act,
-                                preprocess_cmd="",
-                                postprocess_cmd="dan",
-                                param_initializer=self._param_initializer,
-                                name='encoder')
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_pooled_output(self):
-        """Get the first feature of each sequence for classification"""
-
-        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-                                         size=self._emb_size,
-                                         act="tanh",
-                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
-                                                                    initializer=self._param_initializer),
-                                         bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_pretraining_output(self, mask_label, mask_pos, labels):
-        """Get the loss & accuracy for pretraining"""
-
-        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
-
-        # extract the first token feature in each sentence
-        next_sent_feat = self.get_pooled_output()
-        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
-
-        # transform: fc
-        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-                                          size=self._emb_size,
-                                          act=self._hidden_act,
-                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
-                                                                     initializer=self._param_initializer),
-                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-        # transform: layer norm
-        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-                                                initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                                                    dtype=self._dtype,
-                                                    attr=mask_lm_out_bias_attr,
-                                                    is_bias=True)
-
-        else:
-            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                                     size=self._voc_size,
-                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
-                                                                initializer=self._param_initializer),
-                                     bias_attr=mask_lm_out_bias_attr)
-
-        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
-        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-
-        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-                                           size=2,
-                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
-                                                                      initializer=self._param_initializer),
-                                           bias_attr="next_sent_fc.b_0")
-
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-                                                                                    label=labels,
-                                                                                    return_softmax=True)
-
-        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
-
-        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
-
-        loss = mean_next_sent_loss + mean_mask_lm_loss
-        return next_sent_acc, mean_mask_lm_loss, loss
--- a/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/transformer_encoder.py
+++ b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(weights,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(hidden,
-                                dropout_prob=dropout_rate,
-                                dropout_implementation="upscale_in_train",
-                                is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(out,
-                                    begin_norm_axis=len(out.shape) - 1,
-                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                                                               initializer=fluid.initializer.Constant(1.)),
-                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
-                                                              initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(out,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(pre_process_layer(enc_input,
-                                                         preprocess_cmd,
-                                                         prepostprocess_dropout,
-                                                         name=name + '_pre_att'),
-                                       None,
-                                       None,
-                                       attn_bias,
-                                       d_key,
-                                       d_value,
-                                       d_model,
-                                       n_head,
-                                       attention_dropout,
-                                       param_initializer=param_initializer,
-                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(enc_input,
-                                     attn_output,
-                                     postprocess_cmd,
-                                     prepostprocess_dropout,
-                                     name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
-                                                             preprocess_cmd,
-                                                             prepostprocess_dropout,
-                                                             name=name + '_pre_ffn'),
-                                           d_inner_hid,
-                                           d_model,
-                                           relu_dropout,
-                                           hidden_act,
-                                           param_initializer=param_initializer,
-                                           name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(enc_input,
-                                   attn_bias,
-                                   n_head,
-                                   d_key,
-                                   d_value,
-                                   d_model,
-                                   d_inner_hid,
-                                   prepostprocess_dropout,
-                                   attention_dropout,
-                                   relu_dropout,
-                                   hidden_act,
-                                   preprocess_cmd,
-                                   postprocess_cmd,
-                                   param_initializer=param_initializer,
-                                   name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
--- a/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/module.py
+++ b/modules/text/language_model/bert_uncased_L_24_H_1024_A_16/module.py
-# coding:utf-8
-# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License"
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-
-from paddlehub import TransformerModule
-from paddlehub.module.module import moduleinfo
-
-from bert_uncased_L_24_H_1024_A_16.model.bert import BertConfig, BertModel
-
-
-@moduleinfo(
-    name="bert_uncased_L-24_H-1024_A-16",
-    version="1.1.0",
-    summary="bert_uncased_L-24_H-1024_A-16, 24-layer, 1024-hidden, 16-heads, 340M parameters ",
-    author="paddlepaddle",
-    author_email="paddle-dev@baidu.com",
-    type="nlp/semantic_model",
-)
-class Bert(TransformerModule):
-    def _initialize(self):
-        self.MAX_SEQ_LEN = 512
-        self.params_path = os.path.join(self.directory, "assets", "params")
-        self.vocab_path = os.path.join(self.directory, "assets", "vocab.txt")
-
-        bert_config_path = os.path.join(self.directory, "assets", "bert_config.json")
-        self.bert_config = BertConfig(bert_config_path)
-
-    def net(self, input_ids, position_ids, segment_ids, input_mask):
-        """
-        create neural network.
-
-        Args:
-            input_ids (tensor): the word ids.
-            position_ids (tensor): the position ids.
-            segment_ids (tensor): the segment ids.
-            input_mask (tensor): the padding mask.
-
-        Returns:
-            pooled_output (tensor):  sentence-level output for classification task.
-            sequence_output (tensor): token-level output for sequence task.
-        """
-        bert = BertModel(src_ids=input_ids,
-                         position_ids=position_ids,
-                         sentence_ids=segment_ids,
-                         input_mask=input_mask,
-                         config=self.bert_config,
-                         use_fp16=False)
-        pooled_output = bert.get_pooled_output()
-        sequence_output = bert.get_sequence_output()
-        return pooled_output, sequence_output
-
-
-if __name__ == '__main__':
-    test_module = Bert()
--- a/modules/text/language_model/chinese_roberta_wwm_ext/README.md
+++ b/modules/text/language_model/chinese_roberta_wwm_ext/README.md
-```shell
-$ hub install chinese-roberta-wwm-ext==1.0.0
-```
-<p align="center">
-<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
-</p>
-
-更多详情请参考[RoBERTa论文](https://arxiv.org/abs/1907.11692)、[Chinese-BERT-wwm技术报告](https://arxiv.org/abs/1906.08101)
-
-## API
-```python
-def context(
-    trainable=True,
-    max_seq_len=128
-)
-```
-用于获取Module的上下文信息，得到输入、输出以及预训练的Paddle Program副本  
-
-**参数**  
-
-> trainable：设置为True时，Module中的参数在Fine-tune时也会随之训练，否则保持不变。  
-> max_seq_len：BERT模型的最大序列长度，若序列长度不足，会通过padding方式补到**max_seq_len**, 若序列长度大于该值，则会以截断方式让序列长度为**max_seq_len**，max_seq_len可取值范围为0～512；  
-
-**返回**  
-> inputs：dict类型，有以下字段：  
-> >**input_ids**存放输入文本tokenize后各token对应BERT词汇表的word ids， shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**position_ids**存放输入文本tokenize后各token所在该文本的位置，shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**segment_ids**存放各token所在文本的标识（token属于文本1或者文本2），shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**input_mask**存放token是否为padding的标识，shape为\[batch_size, max_seq_len\]，int64类型；  
->
-> outputs：dict类型，Module的输出特征，有以下字段：  
-> >**pooled_output**字段存放句子粒度的特征，可用于文本分类等任务，shape为 \[batch_size, 768\]，int64类型；  
-> >**sequence_output**字段存放字粒度的特征，可用于序列标注等任务，shape为 \[batch_size, seq_len, 768\]，int64类型；  
->
-> program：包含该Module计算图的Program。  
-
-
-
-```python
-def get_embedding(
-    texts,
-    use_gpu=False,
-    batch_size=1
-)
-```
-
-用于获取输入文本的句子粒度特征与字粒度特征
-
-**参数**
-
-> texts：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
-> use_gpu：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
-
-**返回**
-
-> results：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
->
-
-```python
-def get_params_layer()
-```
-
-用于获取参数层信息，该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
-
-**参数**
-
-> 无
-
-**返回**
-
-> params_layer：dict类型，key为参数名，值为参数所在层数
-
-**代码示例**
-
-```python
-import paddlehub as hub
-
-# Load $ hub install chinese-roberta-wwm-ext pretrained model
-module = hub.Module(name="chinese-roberta-wwm-ext")
-inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
-
-# Must feed all the tensor of chinese-roberta-wwm-ext's module need
-input_ids = inputs["input_ids"]
-position_ids = inputs["position_ids"]
-segment_ids = inputs["segment_ids"]
-input_mask = inputs["input_mask"]
-
-# Use "pooled_output" for sentence-level output.
-pooled_output = outputs["pooled_output"]
-
-# Use "sequence_output" for token-level output.
-sequence_output = outputs["sequence_output"]
-
-# Use "get_embedding" to get embedding result.
-embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
-
-# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
-params_layer = module.get_params_layer()
-strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
-```
-
-## 查看代码
-https://github.com/ymcui/Chinese-BERT-wwm
-
-
-## 贡献者
-
-[ymcui](https://github.com/ymcui)
-
-## 依赖
-
-paddlepaddle >= 1.6.2
-
-paddlehub >= 1.6.0
-
-## 更新历史
-
-* 1.0.0
-
-  初始发布
--- a/modules/text/language_model/chinese_roberta_wwm_ext/__init__.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext/__init__.py
--- a/modules/text/language_model/chinese_roberta_wwm_ext/model/__init__.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext/model/__init__.py
--- a/modules/text/language_model/chinese_roberta_wwm_ext/model/bert.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext/model/bert.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""BERT model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import six
-import json
-
-import paddle.fluid as fluid
-
-from chinese_roberta_wwm_ext.model.transformer_encoder import encoder, pre_process_layer
-
-
-class BertConfig(object):
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path) as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing bert model config file '%s'" % config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict[key]
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            print('%s: %s' % (arg, value))
-        print('------------------------------------------------')
-
-
-class BertModel(object):
-    def __init__(self, src_ids, position_ids, sentence_ids, input_mask, config, weight_sharing=True, use_fp16=False):
-
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        self._sent_types = config['type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        self._weight_sharing = weight_sharing
-
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._sent_emb_name = "sent_embedding"
-        self._dtype = "float16" if use_fp16 else "float32"
-
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config['initializer_range'])
-
-        self._build_model(src_ids, position_ids, sentence_ids, input_mask)
-
-    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
-        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(input=src_ids,
-                                         size=[self._voc_size, self._emb_size],
-                                         dtype=self._dtype,
-                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
-                                                                    initializer=self._param_initializer),
-                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(input=position_ids,
-                                                  size=[self._max_position_seq_len, self._emb_size],
-                                                  dtype=self._dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
-                                                                             initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(sentence_ids,
-                                              size=[self._sent_types, self._emb_size],
-                                              dtype=self._dtype,
-                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
-                                                                         initializer=self._param_initializer))
-
-        emb_out = emb_out + position_emb_out
-        emb_out = emb_out + sent_emb_out
-
-        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
-
-        if self._dtype == "float16":
-            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
-
-        self_attn_mask = fluid.layers.matmul(x=input_mask, y=input_mask, transpose_y=True)
-        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-
-        self._enc_out = encoder(enc_input=emb_out,
-                                attn_bias=n_head_self_attn_mask,
-                                n_layer=self._n_layer,
-                                n_head=self._n_head,
-                                d_key=self._emb_size // self._n_head,
-                                d_value=self._emb_size // self._n_head,
-                                d_model=self._emb_size,
-                                d_inner_hid=self._emb_size * 4,
-                                prepostprocess_dropout=self._prepostprocess_dropout,
-                                attention_dropout=self._attention_dropout,
-                                relu_dropout=0,
-                                hidden_act=self._hidden_act,
-                                preprocess_cmd="",
-                                postprocess_cmd="dan",
-                                param_initializer=self._param_initializer,
-                                name='encoder')
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_pooled_output(self):
-        """Get the first feature of each sequence for classification"""
-
-        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-                                         size=self._emb_size,
-                                         act="tanh",
-                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
-                                                                    initializer=self._param_initializer),
-                                         bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_pretraining_output(self, mask_label, mask_pos, labels):
-        """Get the loss & accuracy for pretraining"""
-
-        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
-
-        # extract the first token feature in each sentence
-        next_sent_feat = self.get_pooled_output()
-        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
-
-        # transform: fc
-        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-                                          size=self._emb_size,
-                                          act=self._hidden_act,
-                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
-                                                                     initializer=self._param_initializer),
-                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-        # transform: layer norm
-        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-                                                initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                                                    dtype=self._dtype,
-                                                    attr=mask_lm_out_bias_attr,
-                                                    is_bias=True)
-
-        else:
-            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                                     size=self._voc_size,
-                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
-                                                                initializer=self._param_initializer),
-                                     bias_attr=mask_lm_out_bias_attr)
-
-        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
-        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-
-        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-                                           size=2,
-                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
-                                                                      initializer=self._param_initializer),
-                                           bias_attr="next_sent_fc.b_0")
-
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-                                                                                    label=labels,
-                                                                                    return_softmax=True)
-
-        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
-
-        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
-
-        loss = mean_next_sent_loss + mean_mask_lm_loss
-        return next_sent_acc, mean_mask_lm_loss, loss
--- a/modules/text/language_model/chinese_roberta_wwm_ext/model/transformer_encoder.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(weights,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(hidden,
-                                dropout_prob=dropout_rate,
-                                dropout_implementation="upscale_in_train",
-                                is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(out,
-                                    begin_norm_axis=len(out.shape) - 1,
-                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                                                               initializer=fluid.initializer.Constant(1.)),
-                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
-                                                              initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(out,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(pre_process_layer(enc_input,
-                                                         preprocess_cmd,
-                                                         prepostprocess_dropout,
-                                                         name=name + '_pre_att'),
-                                       None,
-                                       None,
-                                       attn_bias,
-                                       d_key,
-                                       d_value,
-                                       d_model,
-                                       n_head,
-                                       attention_dropout,
-                                       param_initializer=param_initializer,
-                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(enc_input,
-                                     attn_output,
-                                     postprocess_cmd,
-                                     prepostprocess_dropout,
-                                     name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
-                                                             preprocess_cmd,
-                                                             prepostprocess_dropout,
-                                                             name=name + '_pre_ffn'),
-                                           d_inner_hid,
-                                           d_model,
-                                           relu_dropout,
-                                           hidden_act,
-                                           param_initializer=param_initializer,
-                                           name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(enc_input,
-                                   attn_bias,
-                                   n_head,
-                                   d_key,
-                                   d_value,
-                                   d_model,
-                                   d_inner_hid,
-                                   prepostprocess_dropout,
-                                   attention_dropout,
-                                   relu_dropout,
-                                   hidden_act,
-                                   preprocess_cmd,
-                                   postprocess_cmd,
-                                   param_initializer=param_initializer,
-                                   name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
--- a/modules/text/language_model/chinese_roberta_wwm_ext/module.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext/module.py
-# coding:utf-8
-# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License"
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-
-from paddlehub import TransformerModule
-from paddlehub.module.module import moduleinfo
-
-from chinese_roberta_wwm_ext.model.bert import BertConfig, BertModel
-
-
-@moduleinfo(
-    name="chinese-roberta-wwm-ext",
-    version="1.0.0",
-    summary="chinese-roberta-wwm-ext, 12-layer, 768-hidden, 12-heads, 110M parameters ",
-    author="ymcui",
-    author_email="ymcui@ir.hit.edu.cn",
-    type="nlp/semantic_model",
-)
-class BertWwm(TransformerModule):
-    def _initialize(self):
-        self.MAX_SEQ_LEN = 512
-        self.params_path = os.path.join(self.directory, "assets", "params")
-        self.vocab_path = os.path.join(self.directory, "assets", "vocab.txt")
-
-        bert_config_path = os.path.join(self.directory, "assets", "bert_config.json")
-        self.bert_config = BertConfig(bert_config_path)
-
-    def net(self, input_ids, position_ids, segment_ids, input_mask):
-        """
-        create neural network.
-
-        Args:
-            input_ids (tensor): the word ids.
-            position_ids (tensor): the position ids.
-            segment_ids (tensor): the segment ids.
-            input_mask (tensor): the padding mask.
-
-        Returns:
-            pooled_output (tensor):  sentence-level output for classification task.
-            sequence_output (tensor): token-level output for sequence task.
-        """
-        bert = BertModel(src_ids=input_ids,
-                         position_ids=position_ids,
-                         sentence_ids=segment_ids,
-                         input_mask=input_mask,
-                         config=self.bert_config,
-                         use_fp16=False)
-        pooled_output = bert.get_pooled_output()
-        sequence_output = bert.get_sequence_output()
-        return pooled_output, sequence_output
-
-
-if __name__ == '__main__':
-    test_module = BertWwm()
--- a/modules/text/language_model/chinese_roberta_wwm_ext_large/README.md
+++ b/modules/text/language_model/chinese_roberta_wwm_ext_large/README.md
-```shell
-$ hub install chinese-roberta-wwm-ext-large==1.0.0
-```
-<p align="center">
-<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
-</p>
-
-更多详情请参考[RoBERTa论文](https://arxiv.org/abs/1907.11692)、[Chinese-BERT-wwm技术报告](https://arxiv.org/abs/1906.08101)
-
-## API
-```python
-def context(
-    trainable=True,
-    max_seq_len=128
-)
-```
-用于获取Module的上下文信息，得到输入、输出以及预训练的Paddle Program副本  
-
-**参数**  
-
-> trainable：设置为True时，Module中的参数在Fine-tune时也会随之训练，否则保持不变。  
-> max_seq_len：BERT模型的最大序列长度，若序列长度不足，会通过padding方式补到**max_seq_len**, 若序列长度大于该值，则会以截断方式让序列长度为**max_seq_len**，max_seq_len可取值范围为0～512；  
-
-**返回**  
-> inputs：dict类型，有以下字段：  
-> >**input_ids**存放输入文本tokenize后各token对应BERT词汇表的word ids， shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**position_ids**存放输入文本tokenize后各token所在该文本的位置，shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**segment_ids**存放各token所在文本的标识（token属于文本1或者文本2），shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**input_mask**存放token是否为padding的标识，shape为\[batch_size, max_seq_len\]，int64类型；  
->
-> outputs：dict类型，Module的输出特征，有以下字段：  
-> >**pooled_output**字段存放句子粒度的特征，可用于文本分类等任务，shape为 \[batch_size, 768\]，int64类型；  
-> >**sequence_output**字段存放字粒度的特征，可用于序列标注等任务，shape为 \[batch_size, seq_len, 768\]，int64类型；  
->
-> program：包含该Module计算图的Program。  
-
-
-
-```python
-def get_embedding(
-    texts,
-    use_gpu=False,
-    batch_size=1
-)
-```
-
-用于获取输入文本的句子粒度特征与字粒度特征
-
-**参数**
-
-> texts：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
-> use_gpu：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
-
-**返回**
-
-> results：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
->
-
-```python
-def get_params_layer()
-```
-
-用于获取参数层信息，该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
-
-**参数**
-
-> 无
-
-**返回**
-
-> params_layer：dict类型，key为参数名，值为参数所在层数
-
-**代码示例**
-
-```python
-import paddlehub as hub
-
-# Load $ hub install chinese-roberta-wwm-ext-large pretrained model
-module = hub.Module(name="chinese-roberta-wwm-ext-large")
-inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
-
-# Must feed all the tensor of chinese-roberta-wwm-ext-large's module need
-input_ids = inputs["input_ids"]
-position_ids = inputs["position_ids"]
-segment_ids = inputs["segment_ids"]
-input_mask = inputs["input_mask"]
-
-# Use "pooled_output" for sentence-level output.
-pooled_output = outputs["pooled_output"]
-
-# Use "sequence_output" for token-level output.
-sequence_output = outputs["sequence_output"]
-
-# Use "get_embedding" to get embedding result.
-embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
-
-# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
-params_layer = module.get_params_layer()
-strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
-```
-
-## 查看代码
-https://github.com/ymcui/Chinese-BERT-wwm
-
-
-## 贡献者
-
-[ymcui](https://github.com/ymcui)
-
-## 依赖
-
-paddlepaddle >= 1.6.2
-
-paddlehub >= 1.6.0
-
-
-## 更新历史
-
-* 1.0.0
-
-  初始发布
--- a/modules/text/language_model/chinese_roberta_wwm_ext_large/__init__.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext_large/__init__.py
--- a/modules/text/language_model/chinese_roberta_wwm_ext_large/model/__init__.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext_large/model/__init__.py
--- a/modules/text/language_model/chinese_roberta_wwm_ext_large/model/bert.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext_large/model/bert.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""BERT model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import six
-import json
-
-import paddle.fluid as fluid
-
-from chinese_roberta_wwm_ext_large.model.transformer_encoder import encoder, pre_process_layer
-
-
-class BertConfig(object):
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path) as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing bert model config file '%s'" % config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict[key]
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            print('%s: %s' % (arg, value))
-        print('------------------------------------------------')
-
-
-class BertModel(object):
-    def __init__(self, src_ids, position_ids, sentence_ids, input_mask, config, weight_sharing=True, use_fp16=False):
-
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        self._sent_types = config['type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        self._weight_sharing = weight_sharing
-
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._sent_emb_name = "sent_embedding"
-        self._dtype = "float16" if use_fp16 else "float32"
-
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config['initializer_range'])
-
-        self._build_model(src_ids, position_ids, sentence_ids, input_mask)
-
-    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
-        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(input=src_ids,
-                                         size=[self._voc_size, self._emb_size],
-                                         dtype=self._dtype,
-                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
-                                                                    initializer=self._param_initializer),
-                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(input=position_ids,
-                                                  size=[self._max_position_seq_len, self._emb_size],
-                                                  dtype=self._dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
-                                                                             initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(sentence_ids,
-                                              size=[self._sent_types, self._emb_size],
-                                              dtype=self._dtype,
-                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
-                                                                         initializer=self._param_initializer))
-
-        emb_out = emb_out + position_emb_out
-        emb_out = emb_out + sent_emb_out
-
-        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
-
-        if self._dtype == "float16":
-            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
-
-        self_attn_mask = fluid.layers.matmul(x=input_mask, y=input_mask, transpose_y=True)
-        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-
-        self._enc_out = encoder(enc_input=emb_out,
-                                attn_bias=n_head_self_attn_mask,
-                                n_layer=self._n_layer,
-                                n_head=self._n_head,
-                                d_key=self._emb_size // self._n_head,
-                                d_value=self._emb_size // self._n_head,
-                                d_model=self._emb_size,
-                                d_inner_hid=self._emb_size * 4,
-                                prepostprocess_dropout=self._prepostprocess_dropout,
-                                attention_dropout=self._attention_dropout,
-                                relu_dropout=0,
-                                hidden_act=self._hidden_act,
-                                preprocess_cmd="",
-                                postprocess_cmd="dan",
-                                param_initializer=self._param_initializer,
-                                name='encoder')
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_pooled_output(self):
-        """Get the first feature of each sequence for classification"""
-
-        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-                                         size=self._emb_size,
-                                         act="tanh",
-                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
-                                                                    initializer=self._param_initializer),
-                                         bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_pretraining_output(self, mask_label, mask_pos, labels):
-        """Get the loss & accuracy for pretraining"""
-
-        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
-
-        # extract the first token feature in each sentence
-        next_sent_feat = self.get_pooled_output()
-        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
-
-        # transform: fc
-        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-                                          size=self._emb_size,
-                                          act=self._hidden_act,
-                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
-                                                                     initializer=self._param_initializer),
-                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-        # transform: layer norm
-        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-                                                initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                                                    dtype=self._dtype,
-                                                    attr=mask_lm_out_bias_attr,
-                                                    is_bias=True)
-
-        else:
-            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                                     size=self._voc_size,
-                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
-                                                                initializer=self._param_initializer),
-                                     bias_attr=mask_lm_out_bias_attr)
-
-        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
-        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-
-        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-                                           size=2,
-                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
-                                                                      initializer=self._param_initializer),
-                                           bias_attr="next_sent_fc.b_0")
-
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-                                                                                    label=labels,
-                                                                                    return_softmax=True)
-
-        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
-
-        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
-
-        loss = mean_next_sent_loss + mean_mask_lm_loss
-        return next_sent_acc, mean_mask_lm_loss, loss
--- a/modules/text/language_model/chinese_roberta_wwm_ext_large/model/transformer_encoder.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext_large/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(weights,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(hidden,
-                                dropout_prob=dropout_rate,
-                                dropout_implementation="upscale_in_train",
-                                is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(out,
-                                    begin_norm_axis=len(out.shape) - 1,
-                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                                                               initializer=fluid.initializer.Constant(1.)),
-                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
-                                                              initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(out,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(pre_process_layer(enc_input,
-                                                         preprocess_cmd,
-                                                         prepostprocess_dropout,
-                                                         name=name + '_pre_att'),
-                                       None,
-                                       None,
-                                       attn_bias,
-                                       d_key,
-                                       d_value,
-                                       d_model,
-                                       n_head,
-                                       attention_dropout,
-                                       param_initializer=param_initializer,
-                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(enc_input,
-                                     attn_output,
-                                     postprocess_cmd,
-                                     prepostprocess_dropout,
-                                     name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
-                                                             preprocess_cmd,
-                                                             prepostprocess_dropout,
-                                                             name=name + '_pre_ffn'),
-                                           d_inner_hid,
-                                           d_model,
-                                           relu_dropout,
-                                           hidden_act,
-                                           param_initializer=param_initializer,
-                                           name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(enc_input,
-                                   attn_bias,
-                                   n_head,
-                                   d_key,
-                                   d_value,
-                                   d_model,
-                                   d_inner_hid,
-                                   prepostprocess_dropout,
-                                   attention_dropout,
-                                   relu_dropout,
-                                   hidden_act,
-                                   preprocess_cmd,
-                                   postprocess_cmd,
-                                   param_initializer=param_initializer,
-                                   name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
--- a/modules/text/language_model/chinese_roberta_wwm_ext_large/module.py
+++ b/modules/text/language_model/chinese_roberta_wwm_ext_large/module.py
-# coding:utf-8
-# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License"
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-
-from paddlehub import TransformerModule
-from paddlehub.module.module import moduleinfo
-
-from chinese_roberta_wwm_ext_large.model.bert import BertConfig, BertModel
-
-
-@moduleinfo(
-    name="chinese-roberta-wwm-ext-large",
-    version="1.0.0",
-    summary="chinese-roberta-wwm-ext-large, 24-layer, 1024-hidden, 16-heads, 340M parameters ",
-    author="ymcui",
-    author_email="ymcui@ir.hit.edu.cn",
-    type="nlp/semantic_model",
-)
-class BertWwm(TransformerModule):
-    def _initialize(self):
-        self.MAX_SEQ_LEN = 512
-        self.params_path = os.path.join(self.directory, "assets", "params")
-        self.vocab_path = os.path.join(self.directory, "assets", "vocab.txt")
-
-        bert_config_path = os.path.join(self.directory, "assets", "bert_config.json")
-        self.bert_config = BertConfig(bert_config_path)
-
-    def net(self, input_ids, position_ids, segment_ids, input_mask):
-        """
-        create neural network.
-
-        Args:
-            input_ids (tensor): the word ids.
-            position_ids (tensor): the position ids.
-            segment_ids (tensor): the segment ids.
-            input_mask (tensor): the padding mask.
-
-        Returns:
-            pooled_output (tensor):  sentence-level output for classification task.
-            sequence_output (tensor): token-level output for sequence task.
-        """
-        bert = BertModel(src_ids=input_ids,
-                         position_ids=position_ids,
-                         sentence_ids=segment_ids,
-                         input_mask=input_mask,
-                         config=self.bert_config,
-                         use_fp16=False)
-        pooled_output = bert.get_pooled_output()
-        sequence_output = bert.get_sequence_output()
-        return pooled_output, sequence_output
-
-
-if __name__ == '__main__':
-    test_module = BertWwm()
--- a/modules/text/language_model/ernie/README.md
+++ b/modules/text/language_model/ernie/README.md
 ```shell
-$ hub install ernie==1.2.0
+$ hub install ernie==2.0.0
 ```
 ## 在线体验
 <a class="ant-btn large" href="https://aistudio.baidu.com/aistudio/projectDetail/79380" target="_blank">AI Studio 快速体验</a>
@@ -19,111 +19,131 @@ $ hub install ernie==1.2.0
 更多详情请参考[ERNIE论文](https://arxiv.org/abs/1904.09223)

 ## API
+
 ```python
-def context(
-    trainable=True,
-    max_seq_len=128
-)
+def __init__(
+    task=None,
+    load_checkpoint=None,
+    label_map=None)
 ```
-用于获取Module的上下文信息，得到输入、输出以及预训练的Paddle Program副本  
-
-**参数**  

-> trainable：设置为True时，Module中的参数在Fine-tune时也会随之训练，否则保持不变。  
-> max_seq_len：BERT模型的最大序列长度，若序列长度不足，会通过padding方式补到**max_seq_len**, 若序列长度大于该值，则会以截断方式让序列长度为**max_seq_len**，max_seq_len可取值范围为0～512；  
-
-**返回**  
-> inputs：dict类型，有以下字段：  
-> >**input_ids**存放输入文本tokenize后各token对应BERT词汇表的word ids， shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**position_ids**存放输入文本tokenize后各token所在该文本的位置，shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**segment_ids**存放各token所在文本的标识（token属于文本1或者文本2），shape为\[batch_size, max_seq_len\]，int64类型；  
-> >**input_mask**存放token是否为padding的标识，shape为\[batch_size, max_seq_len\]，int64类型；  
->
-> outputs：dict类型，Module的输出特征，有以下字段：  
-> >**pooled_output**字段存放句子粒度的特征，可用于文本分类等任务，shape为 \[batch_size, 768\]，int64类型；  
-> >**sequence_output**字段存放字粒度的特征，可用于序列标注等任务，shape为 \[batch_size, seq_len, 768\]，int64类型；  
->
-> program：包含该Module计算图的Program。  
+创建Module对象（动态图组网版本）。

+**参数**

+* `task`： 任务名称，可为`sequence_classification`。
+* `load_checkpoint`：使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
+* `label_map`：预测时的类别映射表。

 ```python
-def get_embedding(
-    texts,
-    use_gpu=False,
-    batch_size=1
-)
+def predict(
+    data,
+    max_seq_len=128,
+    batch_size=1,
+    use_gpu=False)
 ```

-用于获取输入文本的句子粒度特征与字粒度特征
-
 **参数**

-> texts：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
-> use_gpu：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+* `data`： 待预测数据，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，
+    每个样例可以包含text\_a与text\_b。每个样例文本数量（1个或者2个）需和训练时保持一致。
+* `max_seq_len`：模型处理文本的最大长度
+* `batch_size`：模型批处理大小
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。

 **返回**

-> results：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
->
-
 ```python
-def get_params_layer()
+def get_embedding(
+    texts,
+    use_gpu=False
+)
 ```

-用于获取参数层信息，该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
+用于获取输入文本的句子粒度特征与字粒度特征

 **参数**

-> 无
+* `texts`：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。

 **返回**

-> params_layer：dict类型，key为参数名，值为参数所在层数
+* `results`：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
+

 **代码示例**

 ```python
 import paddlehub as hub

-# Load $ hub install ernie pretrained model
-module = hub.Module(name="ernie")
-inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
+data = [
+    '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    name='ernie',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='/path/to/parameters',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text, results[idx]))
+```

-# Must feed all the tensor of ernie's module need
-input_ids = inputs["input_ids"]
-position_ids = inputs["position_ids"]
-segment_ids = inputs["segment_ids"]
-input_mask = inputs["input_mask"]
+参考PaddleHub 文本分类示例。https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classifcation

-# Use "pooled_output" for sentence-level output.
-pooled_output = outputs["pooled_output"]
+## 服务部署

-# Use "sequence_output" for token-level output.
-sequence_output = outputs["sequence_output"]
+PaddleHub Serving可以部署一个在线获取预训练词向量。

-# Use "get_embedding" to get embedding result.
-embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
+### Step1: 启动PaddleHub Serving

-# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
-params_layer = module.get_params_layer()
-strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
+运行启动命令：
+
+```shell
+$ hub serving start -m ernie
 ```
-利用该PaddleHub Module Fine-tune示例，可参考[文本分类](https://github.com/PaddlePaddle/PaddleHub/tree/release/v1.2/demo/text-classification)、[序列标注](https://github.com/PaddlePaddle/PaddleHub/tree/release/v1.2/demo/sequence-labeling)。

-`Note`：建议该PaddleHub Module在**GPU**环境中运行。如出现显存不足，可以将**batch_size**或**max_seq_len**调小。
+这样就完成了一个获取预训练词向量服务化API的部署，默认端口号为8866。
+
+**NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。

+### Step2: 发送预测请求
+
+配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果
+
+```python
+import requests
+import json
+
+# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
+text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般"]]
+# 以key的方式指定text传入预测方法的时的参数，此例中为"texts"
+# 对应本地部署，则为module.get_embedding(texts=text)
+data = {"texts": text}
+# 发送post请求，content-type类型应指定json方式
+url = "http://10.12.121.132:8866/predict/ernie"
+# 指定post请求的headers为application/json方式
+headers = {"Content-Type": "application/json"}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+print(r.json())
+```

 ##   查看代码

 https://github.com/PaddlePaddle/ERNIE

-
 ## 依赖

-paddlepaddle >= 1.6.2
+paddlepaddle >= 2.0.0

-paddlehub >= 1.6.0
+paddlehub >= 2.0.0

 ## 更新历史

@@ -146,3 +166,7 @@ paddlehub >= 1.6.0
 * 1.2.0

  支持get_embedding与get_params_layer
+
+* 2.0.0
+
+  全面升级动态图版本，接口有所变化
--- a/modules/text/language_model/ernie/__init__.py
+++ b/modules/text/language_model/ernie/__init__.py
--- a/modules/text/language_model/ernie/model/ernie.py
+++ b/modules/text/language_model/ernie/model/ernie.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Ernie model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import unicode_literals
-from __future__ import absolute_import
-
-import json
-
-import six
-import paddle.fluid as fluid
-from io import open
-from paddlehub.common.logger import logger
-
-from ernie.model.transformer_encoder import encoder, pre_process_layer
-
-
-class ErnieConfig(object):
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path, 'r', encoding='utf8') as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing Ernie model config file '%s'" % config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict.get(key, None)
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            logger.info('%s: %s' % (arg, value))
-        logger.info('------------------------------------------------')
-
-
-class ErnieModel(object):
-    def __init__(self, src_ids, position_ids, sentence_ids, input_mask, config, weight_sharing=True, use_fp16=False):
-
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        self._sent_types = config['type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        self._weight_sharing = weight_sharing
-
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._sent_emb_name = "sent_embedding"
-        self._dtype = "float16" if use_fp16 else "float32"
-
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config['initializer_range'])
-
-        self._build_model(src_ids, position_ids, sentence_ids, input_mask)
-
-    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
-        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(input=src_ids,
-                                         size=[self._voc_size, self._emb_size],
-                                         dtype=self._dtype,
-                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
-                                                                    initializer=self._param_initializer),
-                                         is_sparse=False)
-        position_emb_out = fluid.layers.embedding(input=position_ids,
-                                                  size=[self._max_position_seq_len, self._emb_size],
-                                                  dtype=self._dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
-                                                                             initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(sentence_ids,
-                                              size=[self._sent_types, self._emb_size],
-                                              dtype=self._dtype,
-                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
-                                                                         initializer=self._param_initializer))
-
-        emb_out = emb_out + position_emb_out
-        emb_out = emb_out + sent_emb_out
-
-        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
-
-        if self._dtype == "float16":
-            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
-        self_attn_mask = fluid.layers.matmul(x=input_mask, y=input_mask, transpose_y=True)
-
-        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-
-        self._enc_out = encoder(enc_input=emb_out,
-                                attn_bias=n_head_self_attn_mask,
-                                n_layer=self._n_layer,
-                                n_head=self._n_head,
-                                d_key=self._emb_size // self._n_head,
-                                d_value=self._emb_size // self._n_head,
-                                d_model=self._emb_size,
-                                d_inner_hid=self._emb_size * 4,
-                                prepostprocess_dropout=self._prepostprocess_dropout,
-                                attention_dropout=self._attention_dropout,
-                                relu_dropout=0,
-                                hidden_act=self._hidden_act,
-                                preprocess_cmd="",
-                                postprocess_cmd="dan",
-                                param_initializer=self._param_initializer,
-                                name='encoder')
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_pooled_output(self):
-        """Get the first feature of each sequence for classification"""
-        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-                                         size=self._emb_size,
-                                         act="tanh",
-                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
-                                                                    initializer=self._param_initializer),
-                                         bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_pretraining_output(self, mask_label, mask_pos, labels):
-        """Get the loss & accuracy for pretraining"""
-
-        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
-
-        # extract the first token feature in each sentence
-        next_sent_feat = self.get_pooled_output()
-        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
-
-        # transform: fc
-        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-                                          size=self._emb_size,
-                                          act=self._hidden_act,
-                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
-                                                                     initializer=self._param_initializer),
-                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-        # transform: layer norm
-        mask_trans_feat = pre_process_layer(mask_trans_feat, 'n', name='mask_lm_trans')
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-                                                initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                                                    dtype=self._dtype,
-                                                    attr=mask_lm_out_bias_attr,
-                                                    is_bias=True)
-
-        else:
-            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                                     size=self._voc_size,
-                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
-                                                                initializer=self._param_initializer),
-                                     bias_attr=mask_lm_out_bias_attr)
-
-        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
-        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-
-        next_sent_fc_out = fluid.layers.fc(input=next_sent_feat,
-                                           size=2,
-                                           param_attr=fluid.ParamAttr(name="next_sent_fc.w_0",
-                                                                      initializer=self._param_initializer),
-                                           bias_attr="next_sent_fc.b_0")
-
-        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(logits=next_sent_fc_out,
-                                                                                    label=labels,
-                                                                                    return_softmax=True)
-
-        next_sent_acc = fluid.layers.accuracy(input=next_sent_softmax, label=labels)
-
-        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
-
-        loss = mean_next_sent_loss + mean_mask_lm_loss
-        return next_sent_acc, mean_mask_lm_loss, loss
--- a/modules/text/language_model/ernie/model/transformer_encoder.py
+++ b/modules/text/language_model/ernie/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(weights,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(hidden,
-                                dropout_prob=dropout_rate,
-                                dropout_implementation="upscale_in_train",
-                                is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(out,
-                                    begin_norm_axis=len(out.shape) - 1,
-                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                                                               initializer=fluid.initializer.Constant(1.)),
-                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
-                                                              initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(out,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(pre_process_layer(enc_input,
-                                                         preprocess_cmd,
-                                                         prepostprocess_dropout,
-                                                         name=name + '_pre_att'),
-                                       None,
-                                       None,
-                                       attn_bias,
-                                       d_key,
-                                       d_value,
-                                       d_model,
-                                       n_head,
-                                       attention_dropout,
-                                       param_initializer=param_initializer,
-                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(enc_input,
-                                     attn_output,
-                                     postprocess_cmd,
-                                     prepostprocess_dropout,
-                                     name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
-                                                             preprocess_cmd,
-                                                             prepostprocess_dropout,
-                                                             name=name + '_pre_ffn'),
-                                           d_inner_hid,
-                                           d_model,
-                                           relu_dropout,
-                                           hidden_act,
-                                           param_initializer=param_initializer,
-                                           name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(enc_input,
-                                   attn_bias,
-                                   n_head,
-                                   d_key,
-                                   d_value,
-                                   d_model,
-                                   d_inner_hid,
-                                   prepostprocess_dropout,
-                                   attention_dropout,
-                                   relu_dropout,
-                                   hidden_act,
-                                   preprocess_cmd,
-                                   postprocess_cmd,
-                                   param_initializer=param_initializer,
-                                   name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
--- a/modules/text/language_model/ernie/module.py
+++ b/modules/text/language_model/ernie/module.py
-# coding:utf-8
-# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
 #
-# Licensed under the Apache License, Version 2.0 (the "License"
+# Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
@@ -12,65 +11,210 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
+from typing import Dict, List, Optional, Union, Tuple
 import os

-from paddlehub import TransformerModule
-from paddlehub.module.module import moduleinfo
+from paddle.dataset.common import DATA_HOME
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F

-from ernie.model.ernie import ErnieModel, ErnieConfig
+from paddlehub import BertTokenizer
+from paddlehub.module.modeling_ernie import ErnieModel, ErnieForSequenceClassification
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download


 @moduleinfo(
    name="ernie",
-    version="1.2.0",
-    summary="Baidu's ERNIE, Enhanced Representation through kNowledge IntEgration, max_seq_len=512 when predtrained",
-    author="baidu-nlp",
+    version="2.0.0",
+    summary=
+    "Baidu's ERNIE, Enhanced Representation through kNowledge IntEgration, max_seq_len=512 when predtrained. The module is executed as paddle.dygraph.",
+    author="paddlepaddle",
    author_email="",
-    type="nlp/semantic_model",
-)
-class Ernie(TransformerModule):
-    def _initialize(self):
-        ernie_config_path = os.path.join(self.directory, "assets", "ernie_config.json")
-        self.ernie_config = ErnieConfig(ernie_config_path)
-        self.MAX_SEQ_LEN = 512
-        self.params_path = os.path.join(self.directory, "assets", "params")
-        self.vocab_path = os.path.join(self.directory, "assets", "vocab.txt")\
-
-    def net(self, input_ids, position_ids, segment_ids, input_mask):
+    type="nlp/semantic_model")
+class Ernie(nn.Layer):
+    """
+    Ernie model
+    """
+
+    def __init__(
+            self,
+            task=None,
+            load_checkpoint=None,
+            label_map=None,
+    ):
+        super(Ernie, self).__init__()
+        # TODO(zhangxuefei): add token_classification task
+        if task == 'sequence_classification':
+            self.model = ErnieForSequenceClassification.from_pretrained(pretrained_model_name_or_path='ernie')
+            self.criterion = paddle.nn.loss.CrossEntropyLoss()
+            self.metric = paddle.metric.Accuracy(name='acc_accumulation')
+        elif task is None:
+            self.model = ErnieModel.from_pretrained(pretrained_model_name_or_path='ernie')
+        else:
+            raise RuntimeError("Unknown task %s, task should be sequence_classification" % task)
+
+        self.task = task
+        self.label_map = label_map
+
+        if load_checkpoint is not None and os.path.isfile(load_checkpoint):
+            state_dict = paddle.load(load_checkpoint)
+            self.set_state_dict(state_dict)
+            logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint))
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        result = self.model(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.task is not None:
+            logits = result
+            probs = F.softmax(logits, axis=1)
+            if labels is not None:
+                loss = self.criterion(logits, labels)
+                correct = self.metric.compute(probs, labels)
+                acc = self.metric.update(correct)
+                return probs, loss, acc
+            return probs
+        else:
+            sequence_output, pooled_output = result
+            return sequence_output, pooled_output
+
+    def get_vocab_path(self):
+        """
+        Gets the path of the module vocabulary path.
+        """
+        save_path = os.path.join(DATA_HOME, 'ernie', 'vocab.txt')
+        if not os.path.exists(save_path) or not os.path.isfile(save_path):
+            url = "https://paddlenlp.bj.bcebos.com/models/transformers/ernie/vocab.txt"
+            download(url, os.path.join(DATA_HOME, 'ernie'))
+        return save_path
+
+    def get_tokenizer(self, tokenize_chinese_chars=True):
+        """
+        Gets the tokenizer that is customized for this module.
+        Args:
+            tokenize_chinese_chars (:obj: bool , defaults to :obj: True):
+                Whether to tokenize chinese characters or not.
+        Returns:
+            tokenizer (:obj:BertTokenizer) : The tokenizer which was customized for this module.
+        """
+        return BertTokenizer(tokenize_chinese_chars=tokenize_chinese_chars, vocab_file=self.get_vocab_path())
+
+    def training_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for training, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as loss and metrics.
        """
-        create neural network.
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'loss': avg_loss, 'metrics': {'acc': acc}}
+
+    def validation_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for validation, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'metrics': {'acc': acc}}
+
+    def predict(self, data, max_seq_len=128, batch_size=1, use_gpu=False):
+        """
+        Predicts the data labels.

        Args:
-            input_ids (tensor): the word ids.
-            position_ids (tensor): the position ids.
-            segment_ids (tensor): the segment ids.
-            input_mask (tensor): the padding mask.
+            data (obj:`List(str)`): The processed data whose each element is the raw text.
+            max_seq_len (:obj:`int`, `optional`, defaults to :int:`None`):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            use_gpu(obj:`bool`, defaults to `False`): Whether to use gpu to run or not.

        Returns:
-            pooled_output (tensor):  sentence-level output for classification task.
-            sequence_output (tensor): token-level output for sequence task.
+            results(obj:`list`): All the predictions labels.
        """
-        self.ernie_config._config_dict['use_task_id'] = False
-        ernie = ErnieModel(src_ids=input_ids,
-                           position_ids=position_ids,
-                           sentence_ids=segment_ids,
-                           input_mask=input_mask,
-                           config=self.ernie_config,
-                           use_fp16=False)
-        pooled_output = ernie.get_pooled_output()
-        sequence_output = ernie.get_sequence_output()
-        return pooled_output, sequence_output
-
-    def param_prefix(self):
-        return "@HUB_ernie-stable@"
-
-
-if __name__ == '__main__':
-    test_module = Ernie()
+        # TODO(zhangxuefei): add task token_classification task predict.
+        if self.task not in ['sequence_classification']:
+            raise RuntimeError("The predict method is for sequence_classification task, but got task %s." % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        tokenizer = self.get_tokenizer()
+
+        examples = []
+        for text in data:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, max_seq_len=max_seq_len)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], max_seq_len=max_seq_len)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+            examples.append((encoded_inputs['input_ids'], encoded_inputs['segment_ids']))
+
+        def _batchify_fn(batch):
+            input_ids = [entry[0] for entry in batch]
+            segment_ids = [entry[1] for entry in batch]
+            return input_ids, segment_ids
+
+        # Seperates data into some batches.
+        batches = []
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                batches.append(one_batch)
+                one_batch = []
+        if one_batch:
+            # The last batch whose size is less than the config batch_size setting.
+            batches.append(one_batch)
+
+        results = []
+        self.eval()
+        for batch in batches:
+            input_ids, segment_ids = _batchify_fn(batch)
+            input_ids = paddle.to_tensor(input_ids)
+            segment_ids = paddle.to_tensor(segment_ids)
+
+            # TODO(zhangxuefei): add task token_classification postprocess after prediction.
+            if self.task == 'sequence_classification':
+                probs = self(input_ids, segment_ids)
+                idx = paddle.argmax(probs, axis=1).numpy()
+                idx = idx.tolist()
+                labels = [self.label_map[i] for i in idx]
+                results.extend(labels)
+
+        return results
+
+    @serving
+    def get_embedding(self, texts, use_gpu=False):
+        if self.task is not None:
+            raise RuntimeError("The get_embedding method is only valid when task is None, but got task %s" % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+
+        tokenizer = self.get_tokenizer()
+        results = []
+        for text in texts:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, pad_to_max_seq_len=False)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], pad_to_max_seq_len=False)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+
+            input_ids = paddle.to_tensor(encoded_inputs['input_ids']).unsqueeze(0)
+            segment_ids = paddle.to_tensor(encoded_inputs['segment_ids']).unsqueeze(0)
+            sequence_output, pooled_output = self(input_ids, segment_ids)
+
+            sequence_output = sequence_output.squeeze(0)
+            pooled_output = pooled_output.squeeze(0)
+            results.append((sequence_output.numpy().tolist(), pooled_output.numpy().tolist()))
+        return results
--- a/modules/text/language_model/ernie_tiny/README.md
+++ b/modules/text/language_model/ernie_tiny/README.md
 ```shell
-$ hub install ernie_tiny==1.1.0
+$ hub install ernie_tiny==2.0.0
 ```
+## 在线体验
+<a class="ant-btn large" href="https://aistudio.baidu.com/aistudio/projectDetail/79380" target="_blank">AI Studio 快速体验</a>
+
+
 <p align="center">
-<img src="https://paddlehub.bj.bcebos.com/paddlehub-img%2Fernie_tiny_framework.PNG" hspace='10'/> <br />
+<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/ernie_network_1.png" hspace='10'/> <br />
 </p>

+
+<p align="center">
+<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/ernie_network_2.png" hspace='10'/> <br />
+</p>
+
+
+更多详情请参考[ERNIE论文](https://arxiv.org/abs/1904.09223)
+
 ## API
+
 ```python
-def context(
-    trainable=True,
-    max_seq_len=128
-)
+def __init__(
+    task=None,
+    load_checkpoint=None,
+    label_map=None)
 ```
-用于获取Module的上下文信息，得到输入、输出以及预训练的Paddle Program副本
+
+创建Module对象（动态图组网版本）。

 **参数**
-> trainable：设置为True时，Module中的参数在Fine-tune时也会随之训练，否则保持不变。
-> max_seq_len：ERNIE模型的最大序列长度，若序列长度不足，会通过padding方式补到**max_seq_len**, 若序列长度大于该值，则会以截断方式让序列长度为**max_seq_len**，max_seq_len可取值范围为0～512；

-**返回**
-> inputs：dict类型，有以下字段：
-> >**input_ids**字段存放Token Embedding，shape为\[batch_size, max_seq_len\]，int64类型；
-> >**position_ids**字段存放Position Embedding，shape为\[batch_size, max_seq_len\]，int64类型；
-> >**segment_ids**字段存放Sentence Embedding，shape为\[batch_size, max_seq_len\]，int64类型；
-> >**input_mask**字段存放token是否为padding的标识，shape为\[batch_size, max_seq_len\]，int64类型；
->
-> outputs：dict类型，Module的输出特征，有以下字段：
-> >**pooled_output**字段存放句子粒度的特征，可用于文本分类等任务，shape为 \[batch_size, 768\]，int64类型；
-> >**sequence_output**字段存放字粒度的特征，可用于序列标注等任务，shape为 \[batch_size, seq_len, 768\]，int64类型；
->
->  program：包含该Module计算图的Program。
+* `task`： 任务名称，可为`sequence_classification`。
+* `load_checkpoint`：使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
+* `label_map`：预测时的类别映射表。

+```python
+def predict(
+    data,
+    max_seq_len=128,
+    batch_size=1,
+    use_gpu=False)
+```

+**参数**
+
+* `data`： 待预测数据，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，
+    每个样例可以包含text\_a与text\_b。每个样例文本数量（1个或者2个）需和训练时保持一致。
+* `max_seq_len`：模型处理文本的最大长度
+* `batch_size`：模型批处理大小
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。

+**返回**

 ```python
 def get_embedding(
    texts,
-    use_gpu=False,
-    batch_size=1
+    use_gpu=False
 )
 ```

@@ -46,70 +63,86 @@ def get_embedding(

 **参数**

-> texts：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
-> use_gpu：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+* `texts`：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。

 **返回**

-> results：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
->
+* `results`：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。

-```python
-def get_params_layer()
-```

-用于获取参数层信息，该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
+**代码示例**

-**参数**
+```python
+import paddlehub as hub

-> 无
+data = [
+    '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    name='ernie_tiny',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='/path/to/parameters',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text, results[idx]))
+```

-**返回**
+参考PaddleHub 文本分类示例。https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classifcation

-> params_layer：dict类型，key为参数名，值为参数所在层数
+## 服务部署

-**代码示例**
+PaddleHub Serving可以部署一个在线获取预训练词向量。

-```python
-import paddlehub as hub
+### Step1: 启动PaddleHub Serving

-# Load ernie pretrained model
-module = hub.Module(name="ernie_tiny")
-inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
+运行启动命令：

-# Must feed all the tensor of ernie's module need
-input_ids = inputs["input_ids"]
-position_ids = inputs["position_ids"]
-segment_ids = inputs["segment_ids"]
-input_mask = inputs["input_mask"]
+```shell
+$ hub serving start -m ernie_tiny
+```

-# Use "pooled_output" for sentence-level output.
-pooled_output = outputs["pooled_output"]
+这样就完成了一个获取预训练词向量服务化API的部署，默认端口号为8866。

-# Use "sequence_output" for token-level output.
-sequence_output = outputs["sequence_output"]
+**NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。

-# Use "get_embedding" to get embedding result.
-embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
+### Step2: 发送预测请求

-# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
-params_layer = module.get_params_layer()
-strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
-```
-利用该PaddleHub Module Fine-tune示例，可参考[文本分类](https://github.com/PaddlePaddle/PaddleHub/tree/release/v1.8/demo/text_classification)。
+配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果

-**Note**：建议该PaddleHub Module在**GPU**环境中运行。如出现显存不足，可以将**batch_size**或**max_seq_len**调小。
+```python
+import requests
+import json
+
+# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
+text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般"]]
+# 以key的方式指定text传入预测方法的时的参数，此例中为"texts"
+# 对应本地部署，则为module.get_embedding(texts=text)
+data = {"texts": text}
+# 发送post请求，content-type类型应指定json方式
+url = "http://10.12.121.132:8866/predict/ernie_tiny"
+# 指定post请求的headers为application/json方式
+headers = {"Content-Type": "application/json"}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+print(r.json())
+```

 ##   查看代码

 https://github.com/PaddlePaddle/ERNIE

-
 ## 依赖

-paddlepaddle >= 1.6.2
+paddlepaddle >= 2.0.0

-paddlehub >= 1.6.0
+paddlehub >= 2.0.0

 ## 更新历史

@@ -124,3 +157,7 @@ paddlehub >= 1.6.0
 * 1.1.0

  支持get_embedding与get_params_layer
+
+* 2.0.0
+
+  全面升级动态图版本，接口有所变化
--- a/modules/text/language_model/ernie_tiny/model/ernie.py
+++ b/modules/text/language_model/ernie_tiny/model/ernie.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Ernie model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import unicode_literals
-from __future__ import absolute_import
-
-import json
-import six
-
-import paddle.fluid as fluid
-from io import open
-from paddlehub.common.logger import logger
-
-from ernie_tiny.model.transformer_encoder import encoder, pre_process_layer
-
-
-class ErnieConfig(object):
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path, 'r', encoding='utf8') as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing Ernie model config file '%s'" % config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict.get(key, None)
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            logger.info('%s: %s' % (arg, value))
-        logger.info('------------------------------------------------')
-
-
-class ErnieModel(object):
-    def __init__(self,
-                 src_ids,
-                 position_ids,
-                 sentence_ids,
-                 task_ids,
-                 input_mask,
-                 config,
-                 weight_sharing=True,
-                 use_fp16=False):
-
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        if config['sent_type_vocab_size']:  # line 47: return self._config_dict.get(key, None)
-            self._sent_types = config['sent_type_vocab_size']
-        else:
-            self._sent_types = config['type_vocab_size']
-
-        self._use_task_id = config['use_task_id']
-        if self._use_task_id:
-            self._task_types = config['task_type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        self._weight_sharing = weight_sharing
-
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._sent_emb_name = "sent_embedding"
-        self._task_emb_name = "task_embedding"
-        self._dtype = "float16" if use_fp16 else "float32"
-        self._emb_dtype = "float32"
-
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config['initializer_range'])
-
-        self._build_model(src_ids, position_ids, sentence_ids, task_ids, input_mask)
-
-    def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask):
-        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(input=src_ids,
-                                         size=[self._voc_size, self._emb_size],
-                                         dtype=self._emb_dtype,
-                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
-                                                                    initializer=self._param_initializer),
-                                         is_sparse=False)
-
-        position_emb_out = fluid.layers.embedding(input=position_ids,
-                                                  size=[self._max_position_seq_len, self._emb_size],
-                                                  dtype=self._emb_dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
-                                                                             initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(sentence_ids,
-                                              size=[self._sent_types, self._emb_size],
-                                              dtype=self._emb_dtype,
-                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
-                                                                         initializer=self._param_initializer))
-
-        emb_out = emb_out + position_emb_out
-        emb_out = emb_out + sent_emb_out
-
-        if self._use_task_id:
-            task_emb_out = fluid.layers.embedding(task_ids,
-                                                  size=[self._task_types, self._emb_size],
-                                                  dtype=self._emb_dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._task_emb_name,
-                                                                             initializer=self._param_initializer))
-
-            emb_out = emb_out + task_emb_out
-
-        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
-
-        if self._dtype == "float16":
-            emb_out = fluid.layers.cast(x=emb_out, dtype=self._dtype)
-            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
-        self_attn_mask = fluid.layers.matmul(x=input_mask, y=input_mask, transpose_y=True)
-
-        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-
-        self._enc_out = encoder(enc_input=emb_out,
-                                attn_bias=n_head_self_attn_mask,
-                                n_layer=self._n_layer,
-                                n_head=self._n_head,
-                                d_key=self._emb_size // self._n_head,
-                                d_value=self._emb_size // self._n_head,
-                                d_model=self._emb_size,
-                                d_inner_hid=self._emb_size * 4,
-                                prepostprocess_dropout=self._prepostprocess_dropout,
-                                attention_dropout=self._attention_dropout,
-                                relu_dropout=0,
-                                hidden_act=self._hidden_act,
-                                preprocess_cmd="",
-                                postprocess_cmd="dan",
-                                param_initializer=self._param_initializer,
-                                name='encoder')
-        if self._dtype == "float16":
-            self._enc_out = fluid.layers.cast(x=self._enc_out, dtype=self._emb_dtype)
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_pooled_output(self):
-        """Get the first feature of each sequence for classification"""
-        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-                                         size=self._emb_size,
-                                         act="tanh",
-                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
-                                                                    initializer=self._param_initializer),
-                                         bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_lm_output(self, mask_label, mask_pos):
-        """Get the loss & accuracy for pretraining"""
-
-        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
-
-        # extract the first token feature in each sentence
-        self.next_sent_feat = self.get_pooled_output()
-        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
-
-        # transform: fc
-        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-                                          size=self._emb_size,
-                                          act=self._hidden_act,
-                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
-                                                                     initializer=self._param_initializer),
-                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-
-        # transform: layer norm
-        mask_trans_feat = fluid.layers.layer_norm(mask_trans_feat,
-                                                  begin_norm_axis=len(mask_trans_feat.shape) - 1,
-                                                  param_attr=fluid.ParamAttr(
-                                                      name='mask_lm_trans_layer_norm_scale',
-                                                      initializer=fluid.initializer.Constant(1.)),
-                                                  bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias',
-                                                                            initializer=fluid.initializer.Constant(1.)))
-        # transform: layer norm
-        #mask_trans_feat = pre_process_layer(
-        #    mask_trans_feat, 'n', name='mask_lm_trans')
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-                                                initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                                                    dtype=self._emb_dtype,
-                                                    attr=mask_lm_out_bias_attr,
-                                                    is_bias=True)
-
-        else:
-            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                                     size=self._voc_size,
-                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
-                                                                initializer=self._param_initializer),
-                                     bias_attr=mask_lm_out_bias_attr)
-
-        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
-        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-
-        return mean_mask_lm_loss
-
-    def get_task_output(self, task, task_labels):
-        task_fc_out = fluid.layers.fc(input=self.next_sent_feat,
-                                      size=task["num_labels"],
-                                      param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0",
-                                                                 initializer=self._param_initializer),
-                                      bias_attr=task["task_name"] + "_fc.b_0")
-        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(logits=task_fc_out,
-                                                                          label=task_labels,
-                                                                          return_softmax=True)
-        task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels)
-        mean_task_loss = fluid.layers.mean(task_loss)
-        return mean_task_loss, task_acc
--- a/modules/text/language_model/ernie_tiny/model/transformer_encoder.py
+++ b/modules/text/language_model/ernie_tiny/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(weights,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(hidden,
-                                dropout_prob=dropout_rate,
-                                dropout_implementation="upscale_in_train",
-                                is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(out,
-                                    begin_norm_axis=len(out.shape) - 1,
-                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                                                               initializer=fluid.initializer.Constant(1.)),
-                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
-                                                              initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(out,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(pre_process_layer(enc_input,
-                                                         preprocess_cmd,
-                                                         prepostprocess_dropout,
-                                                         name=name + '_pre_att'),
-                                       None,
-                                       None,
-                                       attn_bias,
-                                       d_key,
-                                       d_value,
-                                       d_model,
-                                       n_head,
-                                       attention_dropout,
-                                       param_initializer=param_initializer,
-                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(enc_input,
-                                     attn_output,
-                                     postprocess_cmd,
-                                     prepostprocess_dropout,
-                                     name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
-                                                             preprocess_cmd,
-                                                             prepostprocess_dropout,
-                                                             name=name + '_pre_ffn'),
-                                           d_inner_hid,
-                                           d_model,
-                                           relu_dropout,
-                                           hidden_act,
-                                           param_initializer=param_initializer,
-                                           name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(enc_input,
-                                   attn_bias,
-                                   n_head,
-                                   d_key,
-                                   d_value,
-                                   d_model,
-                                   d_inner_hid,
-                                   prepostprocess_dropout,
-                                   attention_dropout,
-                                   relu_dropout,
-                                   hidden_act,
-                                   preprocess_cmd,
-                                   postprocess_cmd,
-                                   param_initializer=param_initializer,
-                                   name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
--- a/modules/text/language_model/ernie_tiny/module.py
+++ b/modules/text/language_model/ernie_tiny/module.py
-# coding:utf-8
-# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
 #
-# Licensed under the Apache License, Version 2.0 (the "License"
+# Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
@@ -12,68 +11,219 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
+from typing import Dict, List, Optional, Union, Tuple
 import os

-from paddlehub import TransformerModule
-from paddlehub.module.module import moduleinfo
+from paddle.dataset.common import DATA_HOME
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F

-from ernie_tiny.model.ernie import ErnieModel, ErnieConfig
+from paddlehub import ErnieTinyTokenizer
+from paddlehub.module.modeling_ernie import ErnieModel, ErnieForSequenceClassification
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download


 @moduleinfo(
    name="ernie_tiny",
-    version="1.1.0",
+    version="2.0.0",
    summary="Baidu's ERNIE-tiny, Enhanced Representation through kNowledge IntEgration, tiny version, max_seq_len=512",
-    author="baidu-nlp",
+    author="paddlepaddle",
    author_email="",
-    type="nlp/semantic_model",
-)
-class ErnieTiny(TransformerModule):
-    def _initialize(self):
-        ernie_config_path = os.path.join(self.directory, "assets", "ernie_tiny_config.json")
-        self.ernie_config = ErnieConfig(ernie_config_path)
-        self.MAX_SEQ_LEN = 512
-        self.params_path = os.path.join(self.directory, "assets", "params")
-        self.vocab_path = os.path.join(self.directory, "assets", "vocab.txt")
-        self.spm_path = os.path.join(self.directory, "assets", "spm_cased_simp_sampled.model")
-        self.word_dict_path = os.path.join(self.directory, "assets", "dict.wordseg.pickle")
-
-    def net(self, input_ids, position_ids, segment_ids, input_mask):
+    type="nlp/semantic_model")
+class ErnieTiny(nn.Layer):
+    """
+    Ernie model
+    """
+
+    def __init__(
+            self,
+            task=None,
+            load_checkpoint=None,
+            label_map=None,
+    ):
+        super(ErnieTiny, self).__init__()
+        # TODO(zhangxuefei): add token_classification task
+        if task == 'sequence_classification':
+            self.model = ErnieForSequenceClassification.from_pretrained(pretrained_model_name_or_path='ernie_tiny')
+            self.criterion = paddle.nn.loss.CrossEntropyLoss()
+            self.metric = paddle.metric.Accuracy(name='acc_accumulation')
+        elif task is None:
+            self.model = ErnieModel.from_pretrained(pretrained_model_name_or_path='ernie_tiny')
+        else:
+            raise RuntimeError("Unknown task %s, task should be sequence_classification" % task)
+
+        self.task = task
+        self.label_map = label_map
+
+        if load_checkpoint is not None and os.path.isfile(load_checkpoint):
+            state_dict = paddle.load(load_checkpoint)
+            self.set_state_dict(state_dict)
+            logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint))
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        result = self.model(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.task is not None:
+            logits = result
+            probs = F.softmax(logits, axis=1)
+            if labels is not None:
+                loss = self.criterion(logits, labels)
+                correct = self.metric.compute(probs, labels)
+                acc = self.metric.update(correct)
+                return probs, loss, acc
+            return probs
+        else:
+            sequence_output, pooled_output = result
+            return sequence_output, pooled_output
+
+    def get_vocab_path(self):
+        """
+        Gets the path of the module vocabulary path.
+        """
+        save_path = os.path.join(DATA_HOME, 'ernie_tiny', 'vocab.txt')
+        if not os.path.exists(save_path) or not os.path.isfile(save_path):
+            url = "https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/vocab.txt"
+            download(url, os.path.join(DATA_HOME, 'ernie_tiny'))
+        return save_path
+
+    def get_tokenizer(self, tokenize_chinese_chars=True):
+        """
+        Gets the tokenizer that is customized for this module.
+        Args:
+            tokenize_chinese_chars (:obj: bool , defaults to :obj: True):
+                Whether to tokenize chinese characters or not.
+        Returns:
+            tokenizer (:obj:BertTokenizer) : The tokenizer which was customized for this module.
+        """
+        spm_path = os.path.join(DATA_HOME, 'ernie_tiny', 'spm_cased_simp_sampled.model')
+        if not os.path.exists(spm_path) or not os.path.isfile(spm_path):
+            url = "https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/spm_cased_simp_sampled.model"
+            download(url, os.path.join(DATA_HOME, 'ernie_tiny'))
+
+        word_dict_path = os.path.join(DATA_HOME, 'ernie_tiny', 'dict.wordseg.pickle')
+        if not os.path.exists(word_dict_path) or not os.path.isfile(word_dict_path):
+            url = "https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/dict.wordseg.pickle"
+            download(url, os.path.join(DATA_HOME, 'ernie_tiny'))
+
+        return ErnieTinyTokenizer(self.get_vocab_path(), spm_path, word_dict_path)
+
+    def training_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for training, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as loss and metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'loss': avg_loss, 'metrics': {'acc': acc}}
+
+    def validation_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for validation, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'metrics': {'acc': acc}}
+
+    def predict(self, data, max_seq_len=128, batch_size=1, use_gpu=False):
        """
-        create neural network.
+        Predicts the data labels.

        Args:
-            input_ids (tensor): the word ids.
-            position_ids (tensor): the position ids.
-            segment_ids (tensor): the segment ids.
-            input_mask (tensor): the padding mask.
+            data (obj:`List(str)`): The processed data whose each element is the raw text.
+            max_seq_len (:obj:`int`, `optional`, defaults to :int:`None`):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            use_gpu(obj:`bool`, defaults to `False`): Whether to use gpu to run or not.

        Returns:
-            pooled_output (tensor):  sentence-level output for classification task.
-            sequence_output (tensor): token-level output for sequence task.
+            results(obj:`list`): All the predictions labels.
        """
-        self.ernie_config._config_dict['use_task_id'] = False
-        ernie = ErnieModel(src_ids=input_ids,
-                           position_ids=position_ids,
-                           sentence_ids=segment_ids,
-                           task_ids=None,
-                           input_mask=input_mask,
-                           config=self.ernie_config,
-                           use_fp16=False)
-        pooled_output = ernie.get_pooled_output()
-        sequence_output = ernie.get_sequence_output()
-        return pooled_output, sequence_output
-
-    def param_prefix(self):
-        return "@HUB_ernie-tiny@"
-
-
-if __name__ == '__main__':
-    test_module = ErnieTiny()
+        # TODO(zhangxuefei): add task token_classification task predict.
+        if self.task not in ['sequence_classification']:
+            raise RuntimeError("The predict method is for sequence_classification task, but got task %s." % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        tokenizer = self.get_tokenizer()
+
+        examples = []
+        for text in data:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, max_seq_len=max_seq_len)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], max_seq_len=max_seq_len)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+            examples.append((encoded_inputs['input_ids'], encoded_inputs['segment_ids']))
+
+        def _batchify_fn(batch):
+            input_ids = [entry[0] for entry in batch]
+            segment_ids = [entry[1] for entry in batch]
+            return input_ids, segment_ids
+
+        # Seperates data into some batches.
+        batches = []
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                batches.append(one_batch)
+                one_batch = []
+        if one_batch:
+            # The last batch whose size is less than the config batch_size setting.
+            batches.append(one_batch)
+
+        results = []
+        self.eval()
+        for batch in batches:
+            input_ids, segment_ids = _batchify_fn(batch)
+            input_ids = paddle.to_tensor(input_ids)
+            segment_ids = paddle.to_tensor(segment_ids)
+
+            # TODO(zhangxuefei): add task token_classification postprocess after prediction.
+            if self.task == 'sequence_classification':
+                probs = self(input_ids, segment_ids)
+                idx = paddle.argmax(probs, axis=1).numpy()
+                idx = idx.tolist()
+                labels = [self.label_map[i] for i in idx]
+                results.extend(labels)
+
+        return results
+
+    @serving
+    def get_embedding(self, texts, use_gpu=False):
+        if self.task is not None:
+            raise RuntimeError("The get_embedding method is only valid when task is None, but got task %s" % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+
+        tokenizer = self.get_tokenizer()
+        results = []
+        for text in texts:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, pad_to_max_seq_len=False)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], pad_to_max_seq_len=False)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+
+            input_ids = paddle.to_tensor(encoded_inputs['input_ids']).unsqueeze(0)
+            segment_ids = paddle.to_tensor(encoded_inputs['segment_ids']).unsqueeze(0)
+            sequence_output, pooled_output = self(input_ids, segment_ids)
+
+            sequence_output = sequence_output.squeeze(0)
+            pooled_output = pooled_output.squeeze(0)
+            results.append((sequence_output.numpy().tolist(), pooled_output.numpy().tolist()))
+        return results
--- a/modules/text/language_model/ernie_v2_eng_base/README.md
+++ b/modules/text/language_model/ernie_v2_eng_base/README.md
+
 ```shell
-$ hub install ernie_v2_eng_base==1.1.0
+$ hub install ernie_v2_eng_base==2.0.0
 ```
+
 <p align="center">
 <img src="https://bj.bcebos.com/paddlehub/paddlehub-img/ernie2.0_arch.png" hspace='10'/> <br />
 </p>
@@ -12,39 +14,44 @@ $ hub install ernie_v2_eng_base==1.1.0
 更多详情请参考[ERNIE论文](https://arxiv.org/abs/1907.12412)

 ## API
+
 ```python
-def context(
-    trainable=True,
-    max_seq_len=128
-)
+def __init__(
+    task=None,
+    load_checkpoint=None,
+    label_map=None)
 ```
-用于获取Module的上下文信息，得到输入、输出以及预训练的Paddle Program副本
+
+创建Module对象（动态图组网版本）。

 **参数**
-> trainable：设置为True时，Module中的参数在Fine-tune时也会随之训练，否则保持不变。
-> max_seq_len：ERNIE模型的最大序列长度，若序列长度不足，会通过padding方式补到**max_seq_len**, 若序列长度大于该值，则会以截断方式让序列长度为**max_seq_len**，max_seq_len可取值范围为0～512；

-**返回**
-> inputs：dict类型，有以下字段：
-> >**input_ids**字段存放Token Embedding，shape为\[batch_size, max_seq_len\]，int64类型；
-> >**position_ids**字段存放Position Embedding，shape为\[batch_size, max_seq_len\]，int64类型；
-> >**segment_ids**字段存放Sentence Embedding，shape为\[batch_size, max_seq_len\]，int64类型；
-> >**input_mask**字段存放token是否为padding的标识，shape为\[batch_size, max_seq_len\]，int64类型；
->
-> outputs：dict类型，Module的输出特征，有以下字段：
-> >**pooled_output**字段存放句子粒度的特征，可用于文本分类等任务，shape为 \[batch_size, 768\]，int64类型；
-> >**sequence_output**字段存放字粒度的特征，可用于序列标注等任务，shape为 \[batch_size, seq_len, 768\]，int64类型；
->
->  program：包含该Module计算图的Program。
+* `task`： 任务名称，可为`sequence_classification`。
+* `load_checkpoint`：使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
+* `label_map`：预测时的类别映射表。

+```python
+def predict(
+    data,
+    max_seq_len=128,
+    batch_size=1,
+    use_gpu=False)
+```

+**参数**
+
+* `data`： 待预测数据，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，
+    每个样例可以包含text\_a与text\_b。每个样例文本数量（1个或者2个）需和训练时保持一致。
+* `max_seq_len`：模型处理文本的最大长度
+* `batch_size`：模型批处理大小
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。

+**返回**

 ```python
 def get_embedding(
    texts,
-    use_gpu=False,
-    batch_size=1
+    use_gpu=False
 )
 ```

@@ -52,70 +59,86 @@ def get_embedding(

 **参数**

-> texts：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
-> use_gpu：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+* `texts`：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。

 **返回**

-> results：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
->
+* `results`：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。

-```python
-def get_params_layer()
-```

-用于获取参数层信息，该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
+**代码示例**

-**参数**
+```python
+import paddlehub as hub

-> 无
+data = [
+    '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    name='ernie_v2_eng_base',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='/path/to/parameters',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text, results[idx]))
+```

-**返回**
+参考PaddleHub 文本分类示例。https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classifcation

-> params_layer：dict类型，key为参数名，值为参数所在层数
+## 服务部署

-**代码示例**
+PaddleHub Serving可以部署一个在线获取预训练词向量。

-```python
-import paddlehub as hub
+### Step1: 启动PaddleHub Serving

-# Load ernie pretrained model
-module = hub.Module(name="ernie_v2_eng_base")
-inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
+运行启动命令：

-# Must feed all the tensor of ernie's module need
-input_ids = inputs["input_ids"]
-position_ids = inputs["position_ids"]
-segment_ids = inputs["segment_ids"]
-input_mask = inputs["input_mask"]
+```shell
+$ hub serving start -m ernie_v2_eng_base
+```

-# Use "pooled_output" for sentence-level output.
-pooled_output = outputs["pooled_output"]
+这样就完成了一个获取预训练词向量服务化API的部署，默认端口号为8866。

-# Use "sequence_output" for token-level output.
-sequence_output = outputs["sequence_output"]
+**NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。

-# Use "get_embedding" to get embedding result.
-embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
+### Step2: 发送预测请求

-# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
-params_layer = module.get_params_layer()
-strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
-```
-利用该PaddleHub Module Fine-tune示例，可参考[文本分类](https://github.com/PaddlePaddle/PaddleHub/tree/release/v1.4.0/demo/text-classification)。
+配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果

-**Note**：建议该PaddleHub Module在**GPU**环境中运行。如出现显存不足，可以将**batch_size**或**max_seq_len**调小。
+```python
+import requests
+import json
+
+# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
+text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般"]]
+# 以key的方式指定text传入预测方法的时的参数，此例中为"texts"
+# 对应本地部署，则为module.get_embedding(texts=text)
+data = {"texts": text}
+# 发送post请求，content-type类型应指定json方式
+url = "http://10.12.121.132:8866/predict/ernie_v2_eng_base"
+# 指定post请求的headers为application/json方式
+headers = {"Content-Type": "application/json"}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+print(r.json())
+```

 ##   查看代码

 https://github.com/PaddlePaddle/ERNIE

-
 ## 依赖

-paddlepaddle >= 1.6.2
+paddlepaddle >= 2.0.0

-paddlehub >= 1.6.0
+paddlehub >= 2.0.0

 ## 更新历史

@@ -130,3 +153,7 @@ paddlehub >= 1.6.0
 * 1.1.0

  支持get_embedding与get_params_layer
+
+* 2.0.0
+
+  全面升级动态图版本，接口有所变化
--- a/modules/text/language_model/ernie_v2_eng_base/model/__init__.py
+++ b/modules/text/language_model/ernie_v2_eng_base/model/__init__.py
--- a/modules/text/language_model/ernie_v2_eng_base/model/ernie.py
+++ b/modules/text/language_model/ernie_v2_eng_base/model/ernie.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Ernie model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import unicode_literals
-from __future__ import absolute_import
-
-import json
-
-import six
-import paddle.fluid as fluid
-from io import open
-from paddlehub.common.logger import logger
-
-from ernie_v2_eng_base.model.transformer_encoder import encoder, pre_process_layer
-
-
-class ErnieConfig(object):
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path, 'r', encoding='utf8') as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing Ernie model config file '%s'" % config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict.get(key, None)
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            logger.info('%s: %s' % (arg, value))
-        logger.info('------------------------------------------------')
-
-
-class ErnieModel(object):
-    def __init__(self,
-                 src_ids,
-                 position_ids,
-                 sentence_ids,
-                 task_ids,
-                 input_mask,
-                 config,
-                 weight_sharing=True,
-                 use_fp16=False):
-
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        if config['sent_type_vocab_size']:
-            self._sent_types = config['sent_type_vocab_size']
-        else:
-            self._sent_types = config['type_vocab_size']
-
-        self._use_task_id = config['use_task_id']
-        if self._use_task_id:
-            self._task_types = config['task_type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        self._weight_sharing = weight_sharing
-
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._sent_emb_name = "sent_embedding"
-        self._task_emb_name = "task_embedding"
-        self._dtype = "float16" if use_fp16 else "float32"
-        self._emb_dtype = "float32"
-
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config['initializer_range'])
-
-        self._build_model(src_ids, position_ids, sentence_ids, task_ids, input_mask)
-
-    def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask):
-        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(input=src_ids,
-                                         size=[self._voc_size, self._emb_size],
-                                         dtype=self._emb_dtype,
-                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
-                                                                    initializer=self._param_initializer),
-                                         is_sparse=False)
-
-        position_emb_out = fluid.layers.embedding(input=position_ids,
-                                                  size=[self._max_position_seq_len, self._emb_size],
-                                                  dtype=self._emb_dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
-                                                                             initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(sentence_ids,
-                                              size=[self._sent_types, self._emb_size],
-                                              dtype=self._emb_dtype,
-                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
-                                                                         initializer=self._param_initializer))
-
-        emb_out = emb_out + position_emb_out
-        emb_out = emb_out + sent_emb_out
-
-        if self._use_task_id:
-            task_emb_out = fluid.layers.embedding(task_ids,
-                                                  size=[self._task_types, self._emb_size],
-                                                  dtype=self._emb_dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._task_emb_name,
-                                                                             initializer=self._param_initializer))
-
-            emb_out = emb_out + task_emb_out
-
-        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
-
-        if self._dtype == "float16":
-            emb_out = fluid.layers.cast(x=emb_out, dtype=self._dtype)
-            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
-        self_attn_mask = fluid.layers.matmul(x=input_mask, y=input_mask, transpose_y=True)
-
-        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-
-        self._enc_out = encoder(enc_input=emb_out,
-                                attn_bias=n_head_self_attn_mask,
-                                n_layer=self._n_layer,
-                                n_head=self._n_head,
-                                d_key=self._emb_size // self._n_head,
-                                d_value=self._emb_size // self._n_head,
-                                d_model=self._emb_size,
-                                d_inner_hid=self._emb_size * 4,
-                                prepostprocess_dropout=self._prepostprocess_dropout,
-                                attention_dropout=self._attention_dropout,
-                                relu_dropout=0,
-                                hidden_act=self._hidden_act,
-                                preprocess_cmd="",
-                                postprocess_cmd="dan",
-                                param_initializer=self._param_initializer,
-                                name='encoder')
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_pooled_output(self):
-        """Get the first feature of each sequence for classification"""
-        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        if self._dtype == "float16":
-            next_sent_feat = fluid.layers.cast(x=next_sent_feat, dtype=self._emb_dtype)
-        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-                                         size=self._emb_size,
-                                         act="tanh",
-                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
-                                                                    initializer=self._param_initializer),
-                                         bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_lm_output(self, mask_label, mask_pos):
-        """Get the loss & accuracy for pretraining"""
-
-        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
-
-        # extract the first token feature in each sentence
-        self.next_sent_feat = self.get_pooled_output()
-        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
-        if self._dtype == "float16":
-            mask_feat = fluid.layers.cast(x=mask_feat, dtype=self._emb_dtype)
-
-        # transform: fc
-        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-                                          size=self._emb_size,
-                                          act=self._hidden_act,
-                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
-                                                                     initializer=self._param_initializer),
-                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-
-        # transform: layer norm
-        mask_trans_feat = fluid.layers.layer_norm(mask_trans_feat,
-                                                  begin_norm_axis=len(mask_trans_feat.shape) - 1,
-                                                  param_attr=fluid.ParamAttr(
-                                                      name='mask_lm_trans_layer_norm_scale',
-                                                      initializer=fluid.initializer.Constant(1.)),
-                                                  bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias',
-                                                                            initializer=fluid.initializer.Constant(1.)))
-        # transform: layer norm
-        #mask_trans_feat = pre_process_layer(
-        #    mask_trans_feat, 'n', name='mask_lm_trans')
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-                                                initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                                                    dtype=self._emb_dtype,
-                                                    attr=mask_lm_out_bias_attr,
-                                                    is_bias=True)
-
-        else:
-            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                                     size=self._voc_size,
-                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
-                                                                initializer=self._param_initializer),
-                                     bias_attr=mask_lm_out_bias_attr)
-
-        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
-        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-
-        return mean_mask_lm_loss
-
-    def get_task_output(self, task, task_labels):
-        task_fc_out = fluid.layers.fc(input=self.next_sent_feat,
-                                      size=task["num_labels"],
-                                      param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0",
-                                                                 initializer=self._param_initializer),
-                                      bias_attr=task["task_name"] + "_fc.b_0")
-        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(logits=task_fc_out,
-                                                                          label=task_labels,
-                                                                          return_softmax=True)
-        task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels)
-        mean_task_loss = fluid.layers.mean(task_loss)
-        return mean_task_loss, task_acc
--- a/modules/text/language_model/ernie_v2_eng_base/model/transformer_encoder.py
+++ b/modules/text/language_model/ernie_v2_eng_base/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(weights,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(hidden,
-                                dropout_prob=dropout_rate,
-                                dropout_implementation="upscale_in_train",
-                                is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(out,
-                                    begin_norm_axis=len(out.shape) - 1,
-                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                                                               initializer=fluid.initializer.Constant(1.)),
-                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
-                                                              initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(out,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(pre_process_layer(enc_input,
-                                                         preprocess_cmd,
-                                                         prepostprocess_dropout,
-                                                         name=name + '_pre_att'),
-                                       None,
-                                       None,
-                                       attn_bias,
-                                       d_key,
-                                       d_value,
-                                       d_model,
-                                       n_head,
-                                       attention_dropout,
-                                       param_initializer=param_initializer,
-                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(enc_input,
-                                     attn_output,
-                                     postprocess_cmd,
-                                     prepostprocess_dropout,
-                                     name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
-                                                             preprocess_cmd,
-                                                             prepostprocess_dropout,
-                                                             name=name + '_pre_ffn'),
-                                           d_inner_hid,
-                                           d_model,
-                                           relu_dropout,
-                                           hidden_act,
-                                           param_initializer=param_initializer,
-                                           name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(enc_input,
-                                   attn_bias,
-                                   n_head,
-                                   d_key,
-                                   d_value,
-                                   d_model,
-                                   d_inner_hid,
-                                   prepostprocess_dropout,
-                                   attention_dropout,
-                                   relu_dropout,
-                                   hidden_act,
-                                   preprocess_cmd,
-                                   postprocess_cmd,
-                                   param_initializer=param_initializer,
-                                   name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
--- a/modules/text/language_model/ernie_v2_eng_base/module.py
+++ b/modules/text/language_model/ernie_v2_eng_base/module.py
-# coding:utf-8
-# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
 #
-# Licensed under the Apache License, Version 2.0 (the "License"
+# Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
@@ -12,64 +11,211 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
+from typing import Dict, List, Optional, Union, Tuple
 import os

-from paddlehub import TransformerModule
-from paddlehub.module.module import moduleinfo
+from paddle.dataset.common import DATA_HOME
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F

-from ernie_v2_eng_base.model.ernie import ErnieModel, ErnieConfig
+from paddlehub import BertTokenizer
+from paddlehub.module.modeling_ernie import ErnieModel, ErnieForSequenceClassification
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download


 @moduleinfo(
    name="ernie_v2_eng_base",
-    version="1.1.0",
+    version="2.0.0",
    summary=
-    "Baidu's ERNIE 2.0, Enhanced Representation through kNowledge IntEgration, A Continual Pre-training Framework for Language Understanding. 12-layer, 768-hidden, 12-heads, 110M parameters.",
-    author="baidu-nlp",
+    "Baidu's ERNIE 2.0, Enhanced Representation through kNowledge IntEgration, max_seq_len=512 when predtrained. The module is executed as paddle.dygraph.",
+    author="paddlepaddle",
    author_email="",
-    type="nlp/semantic_model",
-)
-class ErnieV2EngBase(TransformerModule):
-    def _initialize(self):
-        ernie_config_path = os.path.join(self.directory, "assets", "ernie_config.json")
-        self.ernie_config = ErnieConfig(ernie_config_path)
-        self.MAX_SEQ_LEN = 512
-        self.params_path = os.path.join(self.directory, "assets", "params")
-        self.vocab_path = os.path.join(self.directory, "assets", "vocab.txt")\
-
-    def net(self, input_ids, position_ids, segment_ids, input_mask):
+    type="nlp/semantic_model")
+class ErnieV2(nn.Layer):
+    """
+    Ernie model
+    """
+
+    def __init__(
+            self,
+            task=None,
+            load_checkpoint=None,
+            label_map=None,
+    ):
+        super(ErnieV2, self).__init__()
+        # TODO(zhangxuefei): add token_classification task
+        if task == 'sequence_classification':
+            self.model = ErnieForSequenceClassification.from_pretrained(
+                pretrained_model_name_or_path='ernie_v2_eng_base')
+            self.criterion = paddle.nn.loss.CrossEntropyLoss()
+            self.metric = paddle.metric.Accuracy(name='acc_accumulation')
+        elif task is None:
+            self.model = ErnieModel.from_pretrained(pretrained_model_name_or_path='ernie_v2_eng_base')
+        else:
+            raise RuntimeError("Unknown task %s, task should be sequence_classification" % task)
+
+        self.task = task
+        self.label_map = label_map
+
+        if load_checkpoint is not None and os.path.isfile(load_checkpoint):
+            state_dict = paddle.load(load_checkpoint)
+            self.set_state_dict(state_dict)
+            logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint))
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        result = self.model(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.task is not None:
+            logits = result
+            probs = F.softmax(logits, axis=1)
+            if labels is not None:
+                loss = self.criterion(logits, labels)
+                correct = self.metric.compute(probs, labels)
+                acc = self.metric.update(correct)
+                return probs, loss, acc
+            return probs
+        else:
+            sequence_output, pooled_output = result
+            return sequence_output, pooled_output
+
+    def get_vocab_path(self):
+        """
+        Gets the path of the module vocabulary path.
+        """
+        save_path = os.path.join(DATA_HOME, 'ernie', 'vocab.txt')
+        if not os.path.exists(save_path) or not os.path.isfile(save_path):
+            url = "https://paddlenlp.bj.bcebos.com/models/transformers/ernie_v2_eng_base/vocab.txt"
+            download(url, os.path.join(DATA_HOME, 'ernie'))
+        return save_path
+
+    def get_tokenizer(self, tokenize_chinese_chars=True):
+        """
+        Gets the tokenizer that is customized for this module.
+        Args:
+            tokenize_chinese_chars (:obj: bool , defaults to :obj: True):
+                Whether to tokenize chinese characters or not.
+        Returns:
+            tokenizer (:obj:BertTokenizer) : The tokenizer which was customized for this module.
+        """
+        return BertTokenizer(tokenize_chinese_chars=tokenize_chinese_chars, vocab_file=self.get_vocab_path())
+
+    def training_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for training, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as loss and metrics.
        """
-        create neural network.
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'loss': avg_loss, 'metrics': {'acc': acc}}
+
+    def validation_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for validation, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'metrics': {'acc': acc}}
+
+    def predict(self, data, max_seq_len=128, batch_size=1, use_gpu=False):
+        """
+        Predicts the data labels.

        Args:
-            input_ids (tensor): the word ids.
-            position_ids (tensor): the position ids.
-            segment_ids (tensor): the segment ids.
-            input_mask (tensor): the padding mask.
+            data (obj:`List(str)`): The processed data whose each element is the raw text.
+            max_seq_len (:obj:`int`, `optional`, defaults to :int:`None`):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            use_gpu(obj:`bool`, defaults to `False`): Whether to use gpu to run or not.

        Returns:
-            pooled_output (tensor):  sentence-level output for classification task.
-            sequence_output (tensor): token-level output for sequence task.
+            results(obj:`list`): All the predictions labels.
        """
-        self.ernie_config._config_dict['use_task_id'] = False
-        ernie = ErnieModel(src_ids=input_ids,
-                           position_ids=position_ids,
-                           sentence_ids=segment_ids,
-                           task_ids=None,
-                           input_mask=input_mask,
-                           config=self.ernie_config,
-                           use_fp16=False)
-        pooled_output = ernie.get_pooled_output()
-        sequence_output = ernie.get_sequence_output()
-        return pooled_output, sequence_output
-
-
-if __name__ == '__main__':
-    test_module = ErnieV2EngBase()
+        # TODO(zhangxuefei): add task token_classification task predict.
+        if self.task not in ['sequence_classification']:
+            raise RuntimeError("The predict method is for sequence_classification task, but got task %s." % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        tokenizer = self.get_tokenizer()
+
+        examples = []
+        for text in data:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, max_seq_len=max_seq_len)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], max_seq_len=max_seq_len)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+            examples.append((encoded_inputs['input_ids'], encoded_inputs['segment_ids']))
+
+        def _batchify_fn(batch):
+            input_ids = [entry[0] for entry in batch]
+            segment_ids = [entry[1] for entry in batch]
+            return input_ids, segment_ids
+
+        # Seperates data into some batches.
+        batches = []
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                batches.append(one_batch)
+                one_batch = []
+        if one_batch:
+            # The last batch whose size is less than the config batch_size setting.
+            batches.append(one_batch)
+
+        results = []
+        self.eval()
+        for batch in batches:
+            input_ids, segment_ids = _batchify_fn(batch)
+            input_ids = paddle.to_tensor(input_ids)
+            segment_ids = paddle.to_tensor(segment_ids)
+
+            # TODO(zhangxuefei): add task token_classification postprocess after prediction.
+            if self.task == 'sequence_classification':
+                probs = self(input_ids, segment_ids)
+                idx = paddle.argmax(probs, axis=1).numpy()
+                idx = idx.tolist()
+                labels = [self.label_map[i] for i in idx]
+                results.extend(labels)
+
+        return results
+
+    @serving
+    def get_embedding(self, texts, use_gpu=False):
+        if self.task is not None:
+            raise RuntimeError("The get_embedding method is only valid when task is None, but got task %s" % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+
+        tokenizer = self.get_tokenizer()
+        results = []
+        for text in texts:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, pad_to_max_seq_len=False)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], pad_to_max_seq_len=False)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+
+            input_ids = paddle.to_tensor(encoded_inputs['input_ids']).unsqueeze(0)
+            segment_ids = paddle.to_tensor(encoded_inputs['segment_ids']).unsqueeze(0)
+            sequence_output, pooled_output = self(input_ids, segment_ids)
+
+            sequence_output = sequence_output.squeeze(0)
+            pooled_output = pooled_output.squeeze(0)
+            results.append((sequence_output.numpy().tolist(), pooled_output.numpy().tolist()))
+        return results
--- a/modules/text/language_model/ernie_v2_eng_large/README.md
+++ b/modules/text/language_model/ernie_v2_eng_large/README.md
+
 ```shell
-$ hub install ernie_v2_eng_large==1.1.0
+$ hub install ernie_v2_eng_large==2.0.0
 ```
+
 <p align="center">
 <img src="https://bj.bcebos.com/paddlehub/paddlehub-img/ernie2.0_arch.png" hspace='10'/> <br />
 </p>
@@ -12,40 +14,44 @@ $ hub install ernie_v2_eng_large==1.1.0
 更多详情请参考[ERNIE论文](https://arxiv.org/abs/1907.12412)

 ## API
+
 ```python
-def context(
-    trainable=True,
-    max_seq_len=128
-)
+def __init__(
+    task=None,
+    load_checkpoint=None,
+    label_map=None)
 ```
-用于获取Module的上下文信息，得到输入、输出以及预训练的Paddle Program副本
+
+创建Module对象（动态图组网版本）。

 **参数**
-> trainable：设置为True时，Module中的参数在Fine-tune时也会随之训练，否则保持不变。
-> max_seq_len：ERNIE模型的最大序列长度，若序列长度不足，会通过padding方式补到**max_seq_len**, 若序列长度大于该值，则会以截断方式让序列长度为**max_seq_len**，max_seq_len可取值范围为0～512；

-**返回**
-> inputs：dict类型，有以下字段：
-> >**input_ids**对应于上图的Token Embedding，shape为\[batch_size, max_seq_len\]，int64类型；
-> >**position_ids**对应于上图的Position Embedding，shape为\[batch_size, max_seq_len\]，int64类型；
-> >**segment_ids**对应于上图的Sentence Embedding，shape为\[batch_size, max_seq_len\]，int64类型；
-> >**input_mask**存放token是否为padding的标识，shape为\[batch_size, max_seq_len\]，int64类型；
-> >**task_id**对应于上图的Task Embedding的标识，shape为\[batch_size, max_seq_len\]，int64类型；
->
-> outputs：dict类型，Module的输出特征，有以下字段：
-> >**pooled_output**字段存放句子粒度的特征，可用于文本分类等任务，shape为 \[batch_size, 768\]，int64类型；
-> >**sequence_output**字段存放字粒度的特征，可用于序列标注等任务，shape为\[batch_size, seq_len, 768\]，int64类型；
->
->  program：包含该Module计算图的Program。
+* `task`： 任务名称，可为`sequence_classification`。
+* `load_checkpoint`：使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
+* `label_map`：预测时的类别映射表。
+
+```python
+def predict(
+    data,
+    max_seq_len=128,
+    batch_size=1,
+    use_gpu=False)
+```

+**参数**

+* `data`： 待预测数据，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，
+    每个样例可以包含text\_a与text\_b。每个样例文本数量（1个或者2个）需和训练时保持一致。
+* `max_seq_len`：模型处理文本的最大长度
+* `batch_size`：模型批处理大小
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。

+**返回**

 ```python
 def get_embedding(
    texts,
-    use_gpu=False,
-    batch_size=1
+    use_gpu=False
 )
 ```

@@ -53,73 +59,86 @@ def get_embedding(

 **参数**

-> texts：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
-> use_gpu：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+* `texts`：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。

 **返回**

-> results：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
->
+* `results`：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。

-```python
-def get_params_layer()
-```

-用于获取参数层信息，该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
+**代码示例**

-**参数**
+```python
+import paddlehub as hub

-> 无
+data = [
+    '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    name='ernie_v2_eng_large',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='/path/to/parameters',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text, results[idx]))
+```

-**返回**
+参考PaddleHub 文本分类示例。https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classifcation

-> params_layer：dict类型，key为参数名，值为参数所在层数
+## 服务部署

-**代码示例**
+PaddleHub Serving可以部署一个在线获取预训练词向量。

-```python
-import paddlehub as hub
+### Step1: 启动PaddleHub Serving

-# Load ernie pretrained model
-module = hub.Module(name="ernie_v2_eng_large")
-inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
+运行启动命令：

-# Must feed all the following tensor of ernie's module need
-input_ids = inputs["input_ids"]
-position_ids = inputs["position_ids"]
-segment_ids = inputs["segment_ids"]
-input_mask = inputs["input_mask"]
-# task_ids is not necessary during finetuning
-task_ids = inputs["task_ids"]
+```shell
+$ hub serving start -m ernie_v2_eng_large
+```

-# Use "pooled_output" for sentence-level output.
-pooled_output = outputs["pooled_output"]
+这样就完成了一个获取预训练词向量服务化API的部署，默认端口号为8866。

-# Use "sequence_output" for token-level output.
-sequence_output = outputs["sequence_output"]
+**NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。

-# Use "get_embedding" to get embedding result.
-embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
+### Step2: 发送预测请求

-# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
-params_layer = module.get_params_layer()
-strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
-```
-利用该PaddleHub Module Fine-tune示例，可参考[文本分类](https://github.com/PaddlePaddle/PaddleHub/tree/release/v1.2/demo/text-classification)。
+配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果

-`Note`：建议该PaddleHub Module在**GPU**环境中运行。如出现显存不足，可以将**batch_size**或**max_seq_len**调小。
+```python
+import requests
+import json
+
+# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
+text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般"]]
+# 以key的方式指定text传入预测方法的时的参数，此例中为"texts"
+# 对应本地部署，则为module.get_embedding(texts=text)
+data = {"texts": text}
+# 发送post请求，content-type类型应指定json方式
+url = "http://10.12.121.132:8866/predict/ernie_v2_eng_large"
+# 指定post请求的headers为application/json方式
+headers = {"Content-Type": "application/json"}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+print(r.json())
+```

 ##   查看代码

 https://github.com/PaddlePaddle/ERNIE

-
 ## 依赖

-paddlepaddle >= 1.6.2
-
-paddlehub >= 1.6.0
+paddlepaddle >= 2.0.0

+paddlehub >= 2.0.0

 ## 更新历史

@@ -134,3 +153,7 @@ paddlehub >= 1.6.0
 * 1.1.0

  支持get_embedding与get_params_layer
+
+* 2.0.0
+
+  全面升级动态图版本，接口有所变化
--- a/modules/text/language_model/ernie_v2_eng_large/model/__init__.py
+++ b/modules/text/language_model/ernie_v2_eng_large/model/__init__.py
--- a/modules/text/language_model/ernie_v2_eng_large/model/ernie.py
+++ b/modules/text/language_model/ernie_v2_eng_large/model/ernie.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Ernie model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import unicode_literals
-from __future__ import absolute_import
-
-import json
-
-import six
-import paddle.fluid as fluid
-from io import open
-from paddlehub.common.logger import logger
-
-from ernie_v2_eng_large.model.transformer_encoder import encoder, pre_process_layer
-
-
-class ErnieConfig(object):
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path, 'r', encoding='utf8') as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing Ernie model config file '%s'" % config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict.get(key, None)
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            logger.info('%s: %s' % (arg, value))
-        logger.info('------------------------------------------------')
-
-
-class ErnieModel(object):
-    def __init__(self,
-                 src_ids,
-                 position_ids,
-                 sentence_ids,
-                 task_ids,
-                 input_mask,
-                 config,
-                 weight_sharing=True,
-                 use_fp16=False):
-
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        if config['sent_type_vocab_size']:
-            self._sent_types = config['sent_type_vocab_size']
-        else:
-            self._sent_types = config['type_vocab_size']
-
-        self._use_task_id = config['use_task_id']
-        if self._use_task_id:
-            self._task_types = config['task_type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        self._weight_sharing = weight_sharing
-
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._sent_emb_name = "sent_embedding"
-        self._task_emb_name = "task_embedding"
-        self._dtype = "float16" if use_fp16 else "float32"
-        self._emb_dtype = "float32"
-
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config['initializer_range'])
-
-        self._build_model(src_ids, position_ids, sentence_ids, task_ids, input_mask)
-
-    def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask):
-        # padding id in vocabulary must be set to 0
-        emb_out = fluid.layers.embedding(input=src_ids,
-                                         size=[self._voc_size, self._emb_size],
-                                         dtype=self._emb_dtype,
-                                         param_attr=fluid.ParamAttr(name=self._word_emb_name,
-                                                                    initializer=self._param_initializer),
-                                         is_sparse=False)
-
-        position_emb_out = fluid.layers.embedding(input=position_ids,
-                                                  size=[self._max_position_seq_len, self._emb_size],
-                                                  dtype=self._emb_dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._pos_emb_name,
-                                                                             initializer=self._param_initializer))
-
-        sent_emb_out = fluid.layers.embedding(sentence_ids,
-                                              size=[self._sent_types, self._emb_size],
-                                              dtype=self._emb_dtype,
-                                              param_attr=fluid.ParamAttr(name=self._sent_emb_name,
-                                                                         initializer=self._param_initializer))
-
-        emb_out = emb_out + position_emb_out
-        emb_out = emb_out + sent_emb_out
-
-        if self._use_task_id:
-            task_emb_out = fluid.layers.embedding(task_ids,
-                                                  size=[self._task_types, self._emb_size],
-                                                  dtype=self._emb_dtype,
-                                                  param_attr=fluid.ParamAttr(name=self._task_emb_name,
-                                                                             initializer=self._param_initializer))
-
-            emb_out = emb_out + task_emb_out
-
-        emb_out = pre_process_layer(emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
-
-        if self._dtype == "float16":
-            emb_out = fluid.layers.cast(x=emb_out, dtype=self._dtype)
-            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
-        self_attn_mask = fluid.layers.matmul(x=input_mask, y=input_mask, transpose_y=True)
-
-        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-
-        self._enc_out = encoder(enc_input=emb_out,
-                                attn_bias=n_head_self_attn_mask,
-                                n_layer=self._n_layer,
-                                n_head=self._n_head,
-                                d_key=self._emb_size // self._n_head,
-                                d_value=self._emb_size // self._n_head,
-                                d_model=self._emb_size,
-                                d_inner_hid=self._emb_size * 4,
-                                prepostprocess_dropout=self._prepostprocess_dropout,
-                                attention_dropout=self._attention_dropout,
-                                relu_dropout=0,
-                                hidden_act=self._hidden_act,
-                                preprocess_cmd="",
-                                postprocess_cmd="dan",
-                                param_initializer=self._param_initializer,
-                                name='encoder')
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_pooled_output(self):
-        """Get the first feature of each sequence for classification"""
-        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
-        if self._dtype == "float16":
-            next_sent_feat = fluid.layers.cast(x=next_sent_feat, dtype=self._emb_dtype)
-        next_sent_feat = fluid.layers.fc(input=next_sent_feat,
-                                         size=self._emb_size,
-                                         act="tanh",
-                                         param_attr=fluid.ParamAttr(name="pooled_fc.w_0",
-                                                                    initializer=self._param_initializer),
-                                         bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_lm_output(self, mask_label, mask_pos):
-        """Get the loss & accuracy for pretraining"""
-
-        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
-
-        # extract the first token feature in each sentence
-        self.next_sent_feat = self.get_pooled_output()
-        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
-        if self._dtype == "float16":
-            mask_feat = fluid.layers.cast(x=mask_feat, dtype=self._emb_dtype)
-
-        # transform: fc
-        mask_trans_feat = fluid.layers.fc(input=mask_feat,
-                                          size=self._emb_size,
-                                          act=self._hidden_act,
-                                          param_attr=fluid.ParamAttr(name='mask_lm_trans_fc.w_0',
-                                                                     initializer=self._param_initializer),
-                                          bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-
-        # transform: layer norm
-        mask_trans_feat = fluid.layers.layer_norm(mask_trans_feat,
-                                                  begin_norm_axis=len(mask_trans_feat.shape) - 1,
-                                                  param_attr=fluid.ParamAttr(
-                                                      name='mask_lm_trans_layer_norm_scale',
-                                                      initializer=fluid.initializer.Constant(1.)),
-                                                  bias_attr=fluid.ParamAttr(name='mask_lm_trans_layer_norm_bias',
-                                                                            initializer=fluid.initializer.Constant(1.)))
-        # transform: layer norm
-        #mask_trans_feat = pre_process_layer(
-        #    mask_trans_feat, 'n', name='mask_lm_trans')
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(name="mask_lm_out_fc.b_0",
-                                                initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = fluid.layers.matmul(x=mask_trans_feat,
-                                         y=fluid.default_main_program().global_block().var(self._word_emb_name),
-                                         transpose_y=True)
-            fc_out += fluid.layers.create_parameter(shape=[self._voc_size],
-                                                    dtype=self._emb_dtype,
-                                                    attr=mask_lm_out_bias_attr,
-                                                    is_bias=True)
-
-        else:
-            fc_out = fluid.layers.fc(input=mask_trans_feat,
-                                     size=self._voc_size,
-                                     param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0",
-                                                                initializer=self._param_initializer),
-                                     bias_attr=mask_lm_out_bias_attr)
-
-        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
-        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
-
-        return mean_mask_lm_loss
-
-    def get_task_output(self, task, task_labels):
-        task_fc_out = fluid.layers.fc(input=self.next_sent_feat,
-                                      size=task["num_labels"],
-                                      param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0",
-                                                                 initializer=self._param_initializer),
-                                      bias_attr=task["task_name"] + "_fc.b_0")
-        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(logits=task_fc_out,
-                                                                          label=task_labels,
-                                                                          return_softmax=True)
-        task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels)
-        mean_task_loss = fluid.layers.mean(task_loss)
-        return mean_task_loss, task_acc
--- a/modules/text/language_model/ernie_v2_eng_large/model/transformer_encoder.py
+++ b/modules/text/language_model/ernie_v2_eng_large/model/transformer_encoder.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    keys = queries if keys is None else keys
-    values = keys if values is None else values
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
-        raise ValueError("Inputs: quries, keys and values should all be 3-D tensors.")
-
-    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, and values.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_query_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_key_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(name=name + '_value_fc.w_0', initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        return q, k, v
-
-    def __split_heads(x, n_head):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
-
-    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
-        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
-        if attn_bias:
-            product += attn_bias
-        weights = layers.softmax(product)
-        if dropout_rate:
-            weights = layers.dropout(weights,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-        out = layers.matmul(weights, v)
-        return out
-
-    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
-
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
-
-    q = __split_heads(q, n_head)
-    k = __split_heads(k, n_head)
-    v = __split_heads(v, n_head)
-
-    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(name=name + '_output_fc.w_0', initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(name=name + '_fc_0.w_0', initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(hidden,
-                                dropout_prob=dropout_rate,
-                                dropout_implementation="upscale_in_train",
-                                is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(out,
-                                    begin_norm_axis=len(out.shape) - 1,
-                                    param_attr=fluid.ParamAttr(name=name + '_layer_norm_scale',
-                                                               initializer=fluid.initializer.Constant(1.)),
-                                    bias_attr=fluid.ParamAttr(name=name + '_layer_norm_bias',
-                                                              initializer=fluid.initializer.Constant(0.)))
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(out,
-                                     dropout_prob=dropout_rate,
-                                     dropout_implementation="upscale_in_train",
-                                     is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(pre_process_layer(enc_input,
-                                                         preprocess_cmd,
-                                                         prepostprocess_dropout,
-                                                         name=name + '_pre_att'),
-                                       None,
-                                       None,
-                                       attn_bias,
-                                       d_key,
-                                       d_value,
-                                       d_model,
-                                       n_head,
-                                       attention_dropout,
-                                       param_initializer=param_initializer,
-                                       name=name + '_multi_head_att')
-    attn_output = post_process_layer(enc_input,
-                                     attn_output,
-                                     postprocess_cmd,
-                                     prepostprocess_dropout,
-                                     name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(pre_process_layer(attn_output,
-                                                             preprocess_cmd,
-                                                             prepostprocess_dropout,
-                                                             name=name + '_pre_ffn'),
-                                           d_inner_hid,
-                                           d_model,
-                                           relu_dropout,
-                                           hidden_act,
-                                           param_initializer=param_initializer,
-                                           name=name + '_ffn')
-    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
-
-
-def encoder(enc_input,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            name=''):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    for i in range(n_layer):
-        enc_output = encoder_layer(enc_input,
-                                   attn_bias,
-                                   n_head,
-                                   d_key,
-                                   d_value,
-                                   d_model,
-                                   d_inner_hid,
-                                   prepostprocess_dropout,
-                                   attention_dropout,
-                                   relu_dropout,
-                                   hidden_act,
-                                   preprocess_cmd,
-                                   postprocess_cmd,
-                                   param_initializer=param_initializer,
-                                   name=name + '_layer_' + str(i))
-        enc_input = enc_output
-    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
-
-    return enc_output
--- a/modules/text/language_model/ernie_v2_eng_large/module.py
+++ b/modules/text/language_model/ernie_v2_eng_large/module.py
-# coding:utf-8
-# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
 #
-# Licensed under the Apache License, Version 2.0 (the "License"
+# Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
@@ -12,64 +11,211 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
+from typing import Dict, List, Optional, Union, Tuple
 import os

-from paddlehub import TransformerModule
-from paddlehub.module.module import moduleinfo
+from paddle.dataset.common import DATA_HOME
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F

-from ernie_v2_eng_large.model.ernie import ErnieModel, ErnieConfig
+from paddlehub import BertTokenizer
+from paddlehub.module.modeling_ernie import ErnieModel, ErnieForSequenceClassification
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download


 @moduleinfo(
    name="ernie_v2_eng_large",
-    version="1.1.0",
+    version="2.0.0",
    summary=
-    "Baidu's ERNIE 2.0, Enhanced Representation through kNowledge IntEgration, A Continual Pre-training Framework for Language Understanding. 12-layer, 768-hidden, 12-heads, 110M parameters.",
-    author="baidu-nlp",
+    "Baidu's ERNIE 2.0, Enhanced Representation through kNowledge IntEgration, max_seq_len=512 when predtrained. The module is executed as paddle.dygraph.",
+    author="paddlepaddle",
    author_email="",
-    type="nlp/semantic_model",
-)
-class ErnieV2EngLarge(TransformerModule):
-    def _initialize(self):
-        ernie_config_path = os.path.join(self.directory, "assets", "ernie_config.json")
-        self.ernie_config = ErnieConfig(ernie_config_path)
-        self.MAX_SEQ_LEN = 512
-        self.params_path = os.path.join(self.directory, "assets", "params")
-        self.vocab_path = os.path.join(self.directory, "assets", "vocab.txt")\
-
-    def net(self, input_ids, position_ids, segment_ids, input_mask):
+    type="nlp/semantic_model")
+class ErnieV2(nn.Layer):
+    """
+    Ernie model
+    """
+
+    def __init__(
+            self,
+            task=None,
+            load_checkpoint=None,
+            label_map=None,
+    ):
+        super(ErnieV2, self).__init__()
+        # TODO(zhangxuefei): add token_classification task
+        if task == 'sequence_classification':
+            self.model = ErnieForSequenceClassification.from_pretrained(
+                pretrained_model_name_or_path='ernie_v2_eng_large')
+            self.criterion = paddle.nn.loss.CrossEntropyLoss()
+            self.metric = paddle.metric.Accuracy(name='acc_accumulation')
+        elif task is None:
+            self.model = ErnieModel.from_pretrained(pretrained_model_name_or_path='ernie_v2_eng_large')
+        else:
+            raise RuntimeError("Unknown task %s, task should be sequence_classification" % task)
+
+        self.task = task
+        self.label_map = label_map
+
+        if load_checkpoint is not None and os.path.isfile(load_checkpoint):
+            state_dict = paddle.load(load_checkpoint)
+            self.set_state_dict(state_dict)
+            logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint))
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        result = self.model(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.task is not None:
+            logits = result
+            probs = F.softmax(logits, axis=1)
+            if labels is not None:
+                loss = self.criterion(logits, labels)
+                correct = self.metric.compute(probs, labels)
+                acc = self.metric.update(correct)
+                return probs, loss, acc
+            return probs
+        else:
+            sequence_output, pooled_output = result
+            return sequence_output, pooled_output
+
+    def get_vocab_path(self):
+        """
+        Gets the path of the module vocabulary path.
+        """
+        save_path = os.path.join(DATA_HOME, 'ernie', 'vocab.txt')
+        if not os.path.exists(save_path) or not os.path.isfile(save_path):
+            url = "https://paddlenlp.bj.bcebos.com/models/transformers/ernie_v2_eng_large/vocab.txt"
+            download(url, os.path.join(DATA_HOME, 'ernie'))
+        return save_path
+
+    def get_tokenizer(self, tokenize_chinese_chars=True):
+        """
+        Gets the tokenizer that is customized for this module.
+        Args:
+            tokenize_chinese_chars (:obj: bool , defaults to :obj: True):
+                Whether to tokenize chinese characters or not.
+        Returns:
+            tokenizer (:obj:BertTokenizer) : The tokenizer which was customized for this module.
+        """
+        return BertTokenizer(tokenize_chinese_chars=tokenize_chinese_chars, vocab_file=self.get_vocab_path())
+
+    def training_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for training, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as loss and metrics.
        """
-        create neural network.
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'loss': avg_loss, 'metrics': {'acc': acc}}
+
+    def validation_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for validation, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'metrics': {'acc': acc}}
+
+    def predict(self, data, max_seq_len=128, batch_size=1, use_gpu=False):
+        """
+        Predicts the data labels.

        Args:
-            input_ids (tensor): the word ids.
-            position_ids (tensor): the position ids.
-            segment_ids (tensor): the segment ids.
-            input_mask (tensor): the padding mask.
+            data (obj:`List(str)`): The processed data whose each element is the raw text.
+            max_seq_len (:obj:`int`, `optional`, defaults to :int:`None`):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            use_gpu(obj:`bool`, defaults to `False`): Whether to use gpu to run or not.

        Returns:
-            pooled_output (tensor):  sentence-level output for classification task.
-            sequence_output (tensor): token-level output for sequence task.
+            results(obj:`list`): All the predictions labels.
        """
-        self.ernie_config._config_dict['use_task_id'] = False
-        ernie = ErnieModel(src_ids=input_ids,
-                           position_ids=position_ids,
-                           sentence_ids=segment_ids,
-                           task_ids=None,
-                           input_mask=input_mask,
-                           config=self.ernie_config,
-                           use_fp16=False)
-        pooled_output = ernie.get_pooled_output()
-        sequence_output = ernie.get_sequence_output()
-        return pooled_output, sequence_output
-
-
-if __name__ == '__main__':
-    test_module = ErnieV2EngLarge()
+        # TODO(zhangxuefei): add task token_classification task predict.
+        if self.task not in ['sequence_classification']:
+            raise RuntimeError("The predict method is for sequence_classification task, but got task %s." % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        tokenizer = self.get_tokenizer()
+
+        examples = []
+        for text in data:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, max_seq_len=max_seq_len)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], max_seq_len=max_seq_len)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+            examples.append((encoded_inputs['input_ids'], encoded_inputs['segment_ids']))
+
+        def _batchify_fn(batch):
+            input_ids = [entry[0] for entry in batch]
+            segment_ids = [entry[1] for entry in batch]
+            return input_ids, segment_ids
+
+        # Seperates data into some batches.
+        batches = []
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                batches.append(one_batch)
+                one_batch = []
+        if one_batch:
+            # The last batch whose size is less than the config batch_size setting.
+            batches.append(one_batch)
+
+        results = []
+        self.eval()
+        for batch in batches:
+            input_ids, segment_ids = _batchify_fn(batch)
+            input_ids = paddle.to_tensor(input_ids)
+            segment_ids = paddle.to_tensor(segment_ids)
+
+            # TODO(zhangxuefei): add task token_classification postprocess after prediction.
+            if self.task == 'sequence_classification':
+                probs = self(input_ids, segment_ids)
+                idx = paddle.argmax(probs, axis=1).numpy()
+                idx = idx.tolist()
+                labels = [self.label_map[i] for i in idx]
+                results.extend(labels)
+
+        return results
+
+    @serving
+    def get_embedding(self, texts, use_gpu=False):
+        if self.task is not None:
+            raise RuntimeError("The get_embedding method is only valid when task is None, but got task %s" % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+
+        tokenizer = self.get_tokenizer()
+        results = []
+        for text in texts:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, pad_to_max_seq_len=False)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], pad_to_max_seq_len=False)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+
+            input_ids = paddle.to_tensor(encoded_inputs['input_ids']).unsqueeze(0)
+            segment_ids = paddle.to_tensor(encoded_inputs['segment_ids']).unsqueeze(0)
+            sequence_output, pooled_output = self(input_ids, segment_ids)
+
+            sequence_output = sequence_output.squeeze(0)
+            pooled_output = pooled_output.squeeze(0)
+            results.append((sequence_output.numpy().tolist(), pooled_output.numpy().tolist()))
+        return results
--- a/modules/text/language_model/roberta-wwm-ext-large/README.md
+++ b/modules/text/language_model/roberta-wwm-ext-large/README.md
+```shell
+$ hub install roberta-wwm-ext-large==2.0.0
+```
+<p align="center">
+<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
+</p>
+
+更多详情请参考[RoBERTa论文](https://arxiv.org/abs/1907.11692)、[Chinese-BERT-wwm技术报告](https://arxiv.org/abs/1906.08101)
+
+## API
+
+```python
+def __init__(
+    task=None,
+    load_checkpoint=None,
+    label_map=None)
+```
+
+创建Module对象（动态图组网版本）。
+
+**参数**
+
+* `task`： 任务名称，可为`sequence_classification`。
+* `load_checkpoint`：使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
+* `label_map`：预测时的类别映射表。
+
+```python
+def predict(
+    data,
+    max_seq_len=128,
+    batch_size=1,
+    use_gpu=False)
+```
+
+**参数**
+
+* `data`： 待预测数据，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，
+    每个样例可以包含text\_a与text\_b。每个样例文本数量（1个或者2个）需和训练时保持一致。
+* `max_seq_len`：模型处理文本的最大长度
+* `batch_size`：模型批处理大小
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+```python
+def get_embedding(
+    texts,
+    use_gpu=False
+)
+```
+
+用于获取输入文本的句子粒度特征与字粒度特征
+
+**参数**
+
+* `texts`：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+* `results`：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
+
+
+**代码示例**
+
+```python
+import paddlehub as hub
+
+data = [
+    '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    name='roberta-wwm-ext-large',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='/path/to/parameters',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text, results[idx]))
+```
+
+参考PaddleHub 文本分类示例。https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classifcation
+
+## 服务部署
+
+PaddleHub Serving可以部署一个在线获取预训练词向量。
+
+### Step1: 启动PaddleHub Serving
+
+运行启动命令：
+
+```shell
+$ hub serving start -m roberta-wwm-ext-large
+```
+
+这样就完成了一个获取预训练词向量服务化API的部署，默认端口号为8866。
+
+**NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+
+### Step2: 发送预测请求
+
+配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果
+
+```python
+import requests
+import json
+
+# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
+text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般"]]
+# 以key的方式指定text传入预测方法的时的参数，此例中为"texts"
+# 对应本地部署，则为module.get_embedding(texts=text)
+data = {"texts": text}
+# 发送post请求，content-type类型应指定json方式
+url = "http://10.12.121.132:8866/predict/roberta-wwm-ext-large"
+# 指定post请求的headers为application/json方式
+headers = {"Content-Type": "application/json"}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+print(r.json())
+```
+
+##   查看代码
+
+https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
+
+
+## 依赖
+
+paddlepaddle >= 2.0.0
+
+paddlehub >= 2.0.0
+
+## 更新历史
+
+* 1.0.0
+
+  初始发布
+
+* 2.0.0
+
+  全面升级动态图，接口有所变化。
--- a/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/__init__.py
+++ b/modules/text/language_model/bert_multi_cased_L_12_H_768_A_12/model/__init__.py
--- a/modules/text/language_model/roberta-wwm-ext-large/module.py
+++ b/modules/text/language_model/roberta-wwm-ext-large/module.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Dict, List, Optional, Union, Tuple
+import os
+
+from paddle.dataset.common import DATA_HOME
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlehub import BertTokenizer
+from paddlehub.module.modeling_roberta import RobertaModel, RobertaForSequenceClassification
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download
+
+
+@moduleinfo(
+    name="roberta-wwm-ext-large",
+    version="2.0.0",
+    summary=
+    "chinese-roberta-wwm-ext-large, 24-layer, 1024-hidden, 16-heads, 340M parameters. The module is executed as paddle.dygraph.",
+    author="ymcui",
+    author_email="ymcui@ir.hit.edu.cn",
+    type="nlp/semantic_model",
+)
+class Roberta(nn.Layer):
+    """
+    RoBERTa model
+    """
+
+    def __init__(
+            self,
+            task=None,
+            load_checkpoint=None,
+            label_map=None,
+    ):
+        super(Roberta, self).__init__()
+        # TODO(zhangxuefei): add token_classification task
+        if task == 'sequence_classification':
+            self.model = RobertaForSequenceClassification.from_pretrained(
+                pretrained_model_name_or_path='roberta-wwm-ext-large')
+            self.criterion = paddle.nn.loss.CrossEntropyLoss()
+            self.metric = paddle.metric.Accuracy(name='acc_accumulation')
+        elif task is None:
+            self.model = RobertaModel.from_pretrained(pretrained_model_name_or_path='roberta-wwm-ext-large')
+        else:
+            raise RuntimeError("Unknown task %s, task should be sequence_classification" % task)
+
+        self.task = task
+        self.label_map = label_map
+
+        if load_checkpoint is not None and os.path.isfile(load_checkpoint):
+            state_dict = paddle.load(load_checkpoint)
+            self.set_state_dict(state_dict)
+            logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint))
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        result = self.model(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.task is not None:
+            logits = result
+            probs = F.softmax(logits, axis=1)
+            if labels is not None:
+                loss = self.criterion(logits, labels)
+                correct = self.metric.compute(probs, labels)
+                acc = self.metric.update(correct)
+                return probs, loss, acc
+            return probs
+        else:
+            sequence_output, pooled_output = result
+            return sequence_output, pooled_output
+
+    def get_vocab_path(self):
+        """
+        Gets the path of the module vocabulary path.
+        """
+        save_path = os.path.join(DATA_HOME, 'roberta-wwm-ext-large', 'vocab.txt')
+        if not os.path.exists(save_path) or not os.path.isfile(save_path):
+            url = "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/vocab.txt"
+            download(url, os.path.join(DATA_HOME, 'roberta-wwm-ext-large'))
+        return save_path
+
+    def get_tokenizer(self, tokenize_chinese_chars=True):
+        """
+        Gets the tokenizer that is customized for this module.
+        Args:
+            tokenize_chinese_chars (:obj: bool , defaults to :obj: True):
+                Whether to tokenize chinese characters or not.
+        Returns:
+            tokenizer (:obj:BertTokenizer) : The tokenizer which was customized for this module.
+        """
+        return BertTokenizer(tokenize_chinese_chars=tokenize_chinese_chars, vocab_file=self.get_vocab_path())
+
+    def training_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for training, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as loss and metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'loss': avg_loss, 'metrics': {'acc': acc}}
+
+    def validation_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for validation, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'metrics': {'acc': acc}}
+
+    def predict(self, data, max_seq_len=128, batch_size=1, use_gpu=False):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The processed data whose each element is the raw text.
+            max_seq_len (:obj:`int`, `optional`, defaults to :int:`None`):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            use_gpu(obj:`bool`, defaults to `False`): Whether to use gpu to run or not.
+
+        Returns:
+            results(obj:`list`): All the predictions labels.
+        """
+        # TODO(zhangxuefei): add task token_classification task predict.
+        if self.task not in ['sequence_classification']:
+            raise RuntimeError("The predict method is for sequence_classification task, but got task %s." % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        tokenizer = self.get_tokenizer()
+
+        examples = []
+        for text in data:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, max_seq_len=max_seq_len)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], max_seq_len=max_seq_len)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+            examples.append((encoded_inputs['input_ids'], encoded_inputs['segment_ids']))
+
+        def _batchify_fn(batch):
+            input_ids = [entry[0] for entry in batch]
+            segment_ids = [entry[1] for entry in batch]
+            return input_ids, segment_ids
+
+        # Seperates data into some batches.
+        batches = []
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                batches.append(one_batch)
+                one_batch = []
+        if one_batch:
+            # The last batch whose size is less than the config batch_size setting.
+            batches.append(one_batch)
+
+        results = []
+        self.eval()
+        for batch in batches:
+            input_ids, segment_ids = _batchify_fn(batch)
+            input_ids = paddle.to_tensor(input_ids)
+            segment_ids = paddle.to_tensor(segment_ids)
+
+            # TODO(zhangxuefei): add task token_classification postprocess after prediction.
+            if self.task == 'sequence_classification':
+                probs = self(input_ids, segment_ids)
+                idx = paddle.argmax(probs, axis=1).numpy()
+                idx = idx.tolist()
+                labels = [self.label_map[i] for i in idx]
+                results.extend(labels)
+
+        return results
+
+    @serving
+    def get_embedding(self, texts, use_gpu=False):
+        if self.task is not None:
+            raise RuntimeError("The get_embedding method is only valid when task is None, but got task %s" % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+
+        tokenizer = self.get_tokenizer()
+        results = []
+        for text in texts:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, pad_to_max_seq_len=False)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], pad_to_max_seq_len=False)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+
+            input_ids = paddle.to_tensor(encoded_inputs['input_ids']).unsqueeze(0)
+            segment_ids = paddle.to_tensor(encoded_inputs['segment_ids']).unsqueeze(0)
+            sequence_output, pooled_output = self(input_ids, segment_ids)
+
+            sequence_output = sequence_output.squeeze(0)
+            pooled_output = pooled_output.squeeze(0)
+            results.append((sequence_output.numpy().tolist(), pooled_output.numpy().tolist()))
+        return results
--- a/modules/text/language_model/roberta-wwm-ext/README.md
+++ b/modules/text/language_model/roberta-wwm-ext/README.md
+```shell
+$ hub install roberta-wwm-ext==2.0.0
+```
+<p align="center">
+<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png"  hspace='10'/> <br />
+</p>
+
+更多详情请参考[RoBERTa论文](https://arxiv.org/abs/1907.11692)、[Chinese-BERT-wwm技术报告](https://arxiv.org/abs/1906.08101)
+
+## API
+
+```python
+def __init__(
+    task=None,
+    load_checkpoint=None,
+    label_map=None)
+```
+
+创建Module对象（动态图组网版本）。
+
+**参数**
+
+* `task`： 任务名称，可为`sequence_classification`。
+* `load_checkpoint`：使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
+* `label_map`：预测时的类别映射表。
+
+```python
+def predict(
+    data,
+    max_seq_len=128,
+    batch_size=1,
+    use_gpu=False)
+```
+
+**参数**
+
+* `data`： 待预测数据，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，
+    每个样例可以包含text\_a与text\_b。每个样例文本数量（1个或者2个）需和训练时保持一致。
+* `max_seq_len`：模型处理文本的最大长度
+* `batch_size`：模型批处理大小
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+```python
+def get_embedding(
+    texts,
+    use_gpu=False
+)
+```
+
+用于获取输入文本的句子粒度特征与字粒度特征
+
+**参数**
+
+* `texts`：输入文本列表，格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\]，其中每个元素都是一个样例，每个样例可以包含text\_a与text\_b。
+* `use_gpu`：是否使用gpu，默认为False。对于GPU用户，建议开启use_gpu。
+
+**返回**
+
+* `results`：list类型，格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\]，其中每个元素都是对应样例的特征输出，每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
+
+
+**代码示例**
+
+```python
+import paddlehub as hub
+
+data = [
+    '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+]
+label_map = {0: 'negative', 1: 'positive'}
+
+model = hub.Module(
+    name='roberta-wwm-ext',
+    version='2.0.0',
+    task='sequence_classification',
+    load_checkpoint='/path/to/parameters',
+    label_map=label_map)
+results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
+for idx, text in enumerate(data):
+    print('Data: {} \t Lable: {}'.format(text, results[idx]))
+```
+
+参考PaddleHub 文本分类示例。https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classifcation
+
+## 服务部署
+
+PaddleHub Serving可以部署一个在线获取预训练词向量。
+
+### Step1: 启动PaddleHub Serving
+
+运行启动命令：
+
+```shell
+$ hub serving start -m roberta-wwm-ext
+```
+
+这样就完成了一个获取预训练词向量服务化API的部署，默认端口号为8866。
+
+**NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+
+### Step2: 发送预测请求
+
+配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果
+
+```python
+import requests
+import json
+
+# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
+text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般"]]
+# 以key的方式指定text传入预测方法的时的参数，此例中为"texts"
+# 对应本地部署，则为module.get_embedding(texts=text)
+data = {"texts": text}
+# 发送post请求，content-type类型应指定json方式
+url = "http://10.12.121.132:8866/predict/roberta-wwm-ext"
+# 指定post请求的headers为application/json方式
+headers = {"Content-Type": "application/json"}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+print(r.json())
+```
+
+##   查看代码
+
+https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_langauge_models/BERT
+
+
+## 依赖
+
+paddlepaddle >= 2.0.0
+
+paddlehub >= 2.0.0
+
+## 更新历史
+
+* 1.0.0
+
+  初始发布
+
+* 2.0.0
+
+  全面升级动态图，接口有所变化。
--- a/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/__init__.py
+++ b/modules/text/language_model/bert_multi_uncased_L_12_H_768_A_12/__init__.py
--- a/modules/text/language_model/roberta-wwm-ext/module.py
+++ b/modules/text/language_model/roberta-wwm-ext/module.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Dict, List, Optional, Union, Tuple
+import os
+
+from paddle.dataset.common import DATA_HOME
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlehub import BertTokenizer
+from paddlehub.module.modeling_roberta import RobertaModel, RobertaForSequenceClassification
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download
+
+
+@moduleinfo(
+    name="roberta-wwm-ext",
+    version="2.0.0",
+    summary=
+    "chinese-roberta-wwm-ext, 12-layer, 768-hidden, 12-heads, 110M parameters.  The module is executed as paddle.dygraph.",
+    author="ymcui",
+    author_email="ymcui@ir.hit.edu.cn",
+    type="nlp/semantic_model",
+)
+class Roberta(nn.Layer):
+    """
+    RoBERTa model
+    """
+
+    def __init__(
+            self,
+            task=None,
+            load_checkpoint=None,
+            label_map=None,
+    ):
+        super(Roberta, self).__init__()
+        # TODO(zhangxuefei): add token_classification task
+        if task == 'sequence_classification':
+            self.model = RobertaForSequenceClassification.from_pretrained(
+                pretrained_model_name_or_path='roberta-wwm-ext')
+            self.criterion = paddle.nn.loss.CrossEntropyLoss()
+            self.metric = paddle.metric.Accuracy(name='acc_accumulation')
+        elif task is None:
+            self.model = RobertaModel.from_pretrained(pretrained_model_name_or_path='roberta-wwm-ext')
+        else:
+            raise RuntimeError("Unknown task %s, task should be sequence_classification" % task)
+
+        self.task = task
+        self.label_map = label_map
+
+        if load_checkpoint is not None and os.path.isfile(load_checkpoint):
+            state_dict = paddle.load(load_checkpoint)
+            self.set_state_dict(state_dict)
+            logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint))
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        result = self.model(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.task is not None:
+            logits = result
+            probs = F.softmax(logits, axis=1)
+            if labels is not None:
+                loss = self.criterion(logits, labels)
+                correct = self.metric.compute(probs, labels)
+                acc = self.metric.update(correct)
+                return probs, loss, acc
+            return probs
+        else:
+            sequence_output, pooled_output = result
+            return sequence_output, pooled_output
+
+    def get_vocab_path(self):
+        """
+        Gets the path of the module vocabulary path.
+        """
+        save_path = os.path.join(DATA_HOME, 'roberta-wwm-ext', 'vocab.txt')
+        if not os.path.exists(save_path) or not os.path.isfile(save_path):
+            url = "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_base/vocab.txt"
+            download(url, os.path.join(DATA_HOME, 'roberta-wwm-ext'))
+        return save_path
+
+    def get_tokenizer(self, tokenize_chinese_chars=True):
+        """
+        Gets the tokenizer that is customized for this module.
+        Args:
+            tokenize_chinese_chars (:obj: bool , defaults to :obj: True):
+                Whether to tokenize chinese characters or not.
+        Returns:
+            tokenizer (:obj:BertTokenizer) : The tokenizer which was customized for this module.
+        """
+        return BertTokenizer(tokenize_chinese_chars=tokenize_chinese_chars, vocab_file=self.get_vocab_path())
+
+    def training_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for training, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as loss and metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'loss': avg_loss, 'metrics': {'acc': acc}}
+
+    def validation_step(self, batch: List[paddle.Tensor], batch_idx: int):
+        """
+        One step for validation, which should be called as forward computation.
+        Args:
+            batch(:obj:List[paddle.Tensor]): The one batch data, which contains the model needed,
+                such as input_ids, sent_ids, pos_ids, input_mask and labels.
+            batch_idx(int): The index of batch.
+        Returns:
+            results(:obj: Dict) : The model outputs, such as metrics.
+        """
+        predictions, avg_loss, acc = self(input_ids=batch[0], token_type_ids=batch[1], labels=batch[2])
+        return {'metrics': {'acc': acc}}
+
+    def predict(self, data, max_seq_len=128, batch_size=1, use_gpu=False):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(Union(str))`): The processed data (the one sequence or sequence pair) whose each element is the raw text.
+            max_seq_len (:obj:`int`, `optional`, defaults to :int:`None`):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            use_gpu(obj:`bool`, defaults to `False`): Whether to use gpu to run or not.
+
+        Returns:
+            results(obj:`list`): All the predictions labels.
+        """
+        # TODO(zhangxuefei): add task token_classification task predict.
+        if self.task not in ['sequence_classification']:
+            raise RuntimeError("The predict method is for sequence_classification task, but got task %s." % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+        tokenizer = self.get_tokenizer()
+
+        examples = []
+        for text in data:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, max_seq_len=max_seq_len)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], max_seq_len=max_seq_len)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+            examples.append((encoded_inputs['input_ids'], encoded_inputs['segment_ids']))
+
+        def _batchify_fn(batch):
+            input_ids = [entry[0] for entry in batch]
+            segment_ids = [entry[1] for entry in batch]
+            return input_ids, segment_ids
+
+        # Seperates data into some batches.
+        batches = []
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                batches.append(one_batch)
+                one_batch = []
+        if one_batch:
+            # The last batch whose size is less than the config batch_size setting.
+            batches.append(one_batch)
+
+        results = []
+        self.eval()
+        for batch in batches:
+            input_ids, segment_ids = _batchify_fn(batch)
+            input_ids = paddle.to_tensor(input_ids)
+            segment_ids = paddle.to_tensor(segment_ids)
+
+            # TODO(zhangxuefei): add task token_classification postprocess after prediction.
+            if self.task == 'sequence_classification':
+                probs = self(input_ids, segment_ids)
+                idx = paddle.argmax(probs, axis=1).numpy()
+                idx = idx.tolist()
+                labels = [self.label_map[i] for i in idx]
+                results.extend(labels)
+
+        return results
+
+    @serving
+    def get_embedding(self, texts, use_gpu=False):
+        if self.task is not None:
+            raise RuntimeError("The get_embedding method is only valid when task is None, but got task %s" % self.task)
+
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
+
+        tokenizer = self.get_tokenizer()
+        results = []
+        for text in texts:
+            if len(text) == 1:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=None, pad_to_max_seq_len=False)
+            elif len(text) == 2:
+                encoded_inputs = tokenizer.encode(text[0], text_pair=text[1], pad_to_max_seq_len=False)
+            else:
+                raise RuntimeError(
+                    'The input text must have one or two sequence, but got %d. Please check your inputs.' % len(text))
+
+            input_ids = paddle.to_tensor(encoded_inputs['input_ids']).unsqueeze(0)
+            segment_ids = paddle.to_tensor(encoded_inputs['segment_ids']).unsqueeze(0)
+            sequence_output, pooled_output = self(input_ids, segment_ids)
+
+            sequence_output = sequence_output.squeeze(0)
+            pooled_output = pooled_output.squeeze(0)
+            results.append((sequence_output.numpy().tolist(), pooled_output.numpy().tolist()))
+        return results
--- a/paddlehub/__init__.py
+++ b/paddlehub/__init__.py
@@ -21,13 +21,16 @@ __version__ = '2.0.0-beta0'

 from paddlehub import env
 from paddlehub.config import config
+from paddlehub import datasets
+from paddlehub.finetune import Trainer
 from paddlehub.utils import log, parser, utils
 from paddlehub.utils import download as _download
 from paddlehub.utils.paddlex import download, ResourceNotFoundError
 from paddlehub.server import server_check
 from paddlehub.server.server_source import ServerConnectionError
 from paddlehub.module import Module
-from paddlehub.text.bert_tokenizer import BertTokenizer
+from paddlehub.text.bert_tokenizer import BertTokenizer, ErnieTinyTokenizer
+from paddlehub.text.tokenizer import CustomTokenizer

 # In order to maintain the compatibility of the old version, we put the relevant
 # compatible code in the paddlehub.compat package, and mapped some modules referenced

--- a/paddlehub/datasets/__init__.py
+++ b/paddlehub/datasets/__init__.py
-# coding:utf-8
 # Copyright (c) 2020  PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License"
@@ -16,3 +15,4 @@
 from paddlehub.datasets.canvas import Canvas
 from paddlehub.datasets.flowers import Flowers
 from paddlehub.datasets.minicoco import MiniCOCO
+from paddlehub.datasets.chnsenticorp import ChnSentiCorp
--- a/paddlehub/datasets/base_nlp_dataset.py
+++ b/paddlehub/datasets/base_nlp_dataset.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Dict, List, Optional, Union, Tuple
+import csv
+import io
+import os
+
+import numpy as np
+import paddle.fluid as fluid
+
+from paddlehub.env import DATA_HOME
+from paddlehub.text.bert_tokenizer import BertTokenizer
+from paddlehub.text.tokenizer import CustomTokenizer
+from paddlehub.utils.log import logger
+from paddlehub.utils.utils import download
+from paddlehub.utils.xarfile import is_xarfile, unarchive
+
+
+class InputExample(object):
+    """
+    The input data structure of Transformer modules (BERT, ERNIE and so on).
+    """
+
+    def __init__(self, guid: int, text_a: str, text_b: Optional[str] = None, label: Optional[str] = None):
+        """
+        The input data structure.
+        Args:
+          guid (:obj:`int`):
+              Unique id for the input data.
+          text_a (:obj:`str`, `optional`, defaults to :obj:`None`):
+              The first sequence. For single sequence tasks, only this sequence must be specified.
+          text_b (:obj:`str`, `optional`, defaults to :obj:`None`):
+              The second sequence if sentence-pair.
+          label (:obj:`str`, `optional`, defaults to :obj:`None`):
+              The label of the example.
+        Examples:
+            .. code-block:: python
+                from paddlehub.datasets.base_nlp_dataset import InputExample
+                example = InputExample(guid=0,
+                                text_a='15.4寸笔记本的键盘确实爽，基本跟台式机差不多了',
+                                text_b='蛮喜欢数字小键盘，输数字特方便，样子也很美观，做工也相当不错',
+                                label='1')
+        """
+        self.guid = guid
+        self.text_a = text_a
+        self.text_b = text_b
+        self.label = label
+
+    def __str__(self):
+        if self.text_b is None:
+            return "text={}\tlabel={}".format(self.text_a, self.label)
+        else:
+            return "text_a={}\ttext_b={},label={}".format(self.text_a, self.text_b, self.label)
+
+
+class BaseNLPDataset(object):
+    """
+    The virtual base class for nlp datasets, such TextClassificationDataset, SeqLabelingDataset, and so on.
+    The base class must be supered and re-implemented the method _read_file.
+    """
+
+    def __init__(self,
+                 base_path: str,
+                 tokenizer: Union[BertTokenizer, CustomTokenizer],
+                 max_seq_len: Optional[int] = 128,
+                 mode: Optional[str] = "train",
+                 data_file: Optional[str] = None,
+                 label_file: Optional[str] = None,
+                 label_list: Optional[List[str]] = None):
+        """
+        Ags:
+            base_path (:obj:`str`): The directory to the whole dataset.
+            tokenizer (:obj:`BertTokenizer` or :obj:`CustomTokenizer`):
+                It tokenizes the text and encodes the data as model needed.
+            max_seq_len (:obj:`int`, `optional`, defaults to :128):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            mode (:obj:`str`, `optional`, defaults to `train`):
+                It identifies the dataset mode (train, test or dev).
+            data_file(:obj:`str`, `optional`, defaults to :obj:`None`):
+                The data file name, which is relative to the base_path.
+            label_file(:obj:`str`, `optional`, defaults to :obj:`None`):
+                The label file name, which is relative to the base_path.
+                It is all labels of the dataset, one line one label.
+            label_list(:obj:`List[str]`, `optional`, defaults to :obj:`None`):
+                The list of all labels of the dataset
+        """
+        self.data_file = os.path.join(base_path, data_file)
+        self.label_list = label_list
+
+        self.mode = mode
+        self.tokenizer = tokenizer
+        self.max_seq_len = max_seq_len
+
+        if label_file:
+            self.label_file = os.path.join(base_path, label_file)
+            if not self.label_list:
+                self.label_list = self._load_label_data()
+            else:
+                logger.warning("As label_list has been assigned, label_file is noneffective")
+        if self.label_list:
+            self.label_map = {item: index for index, item in enumerate(self.label_list)}
+
+    def _load_label_data(self):
+        """
+        Loads labels from label file.
+        """
+        if os.path.exists(self.label_file):
+            with open(self.label_file, "r", encoding="utf8") as f:
+                return f.read().strip().split("\n")
+        else:
+            raise RuntimeError("The file {} is not found.".format(self.label_file))
+
+    def _download_and_uncompress_dataset(self, destination: str, url: str):
+        """
+        Downloads dataset and uncompresses it.
+        Args:
+           destination (:obj:`str`): The dataset cached directory.
+           url (:obj: str): The link to be downloaded a dataset.
+        """
+        if not os.path.exists(destination):
+            dataset_package = download(url=url, path=DATA_HOME)
+            if is_xarfile(dataset_package):
+                unarchive(dataset_package, DATA_HOME)
+        else:
+            logger.info("Dataset {} already cached.".format(destination))
+
+    def _read_file(self, input_file: str, is_file_with_header: bool = False):
+        """
+        Reads the files.
+        Args:
+            input_file (:obj:str) : The file to be read.
+            is_file_with_header(:obj:bool, `optional`, default to :obj: False) :
+                Whether or not the file is with the header introduction.
+        """
+        raise NotImplementedError
+
+    def get_labels(self):
+        """
+        Gets all labels.
+        """
+        return self.label_list
+
+
+class TextClassificationDataset(BaseNLPDataset, fluid.io.Dataset):
+    """
+    The dataset class which is fit for all datatset of text classification.
+    """
+
+    def __init__(self,
+                 base_path: str,
+                 tokenizer: Union[BertTokenizer, CustomTokenizer],
+                 max_seq_len: int = 128,
+                 mode: str = "train",
+                 data_file: str = None,
+                 label_file: str = None,
+                 label_list: list = None,
+                 is_file_with_header: bool = False):
+        """
+        Ags:
+            base_path (:obj:`str`): The directory to the whole dataset.
+            tokenizer (:obj:`BertTokenizer` or :obj:`CustomTokenizer`):
+                It tokenizes the text and encodes the data as model needed.
+            max_seq_len (:obj:`int`, `optional`, defaults to :128):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+            mode (:obj:`str`, `optional`, defaults to `train`):
+                It identifies the dataset mode (train, test or dev).
+            data_file(:obj:`str`, `optional`, defaults to :obj:`None`):
+                The data file name, which is relative to the base_path.
+            label_file(:obj:`str`, `optional`, defaults to :obj:`None`):
+                The label file name, which is relative to the base_path.
+                It is all labels of the dataset, one line one label.
+            label_list(:obj:`List[str]`, `optional`, defaults to :obj:`None`):
+                The list of all labels of the dataset
+            is_file_with_header(:obj:bool, `optional`, default to :obj: False) :
+                Whether or not the file is with the header introduction.
+        """
+        super(TextClassificationDataset, self).__init__(
+            base_path=base_path,
+            tokenizer=tokenizer,
+            max_seq_len=max_seq_len,
+            mode=mode,
+            data_file=data_file,
+            label_file=label_file,
+            label_list=label_list)
+        self.examples = self._read_file(self.data_file, is_file_with_header)
+
+        self.records = self._convert_examples_to_records(self.examples)
+
+    def _read_file(self, input_file, is_file_with_header: bool = False) -> List[InputExample]:
+        """
+        Reads a tab separated value file.
+        Args:
+            input_file (:obj:str) : The file to be read.
+            is_file_with_header(:obj:bool, `optional`, default to :obj: False) :
+                Whether or not the file is with the header introduction.
+        Returns:
+            examples (:obj:`List[InputExample]`): All the input data.
+        """
+        if not os.path.exists(input_file):
+            raise RuntimeError("The file {} is not found.".format(input_file))
+        else:
+            with io.open(input_file, "r", encoding="UTF-8") as f:
+                reader = csv.reader(f, delimiter="\t", quotechar=None)
+                examples = []
+                seq_id = 0
+                header = next(reader) if is_file_with_header else None
+                for line in reader:
+                    example = InputExample(guid=seq_id, label=line[0], text_a=line[1])
+                    seq_id += 1
+                    examples.append(example)
+                return examples
+
+    def _convert_examples_to_records(self, examples: List[InputExample]) -> List[dict]:
+        """
+        Converts all examples to records which the model needs.
+        Args:
+            examples(obj:`List[InputExample]`): All data examples returned by _read_file.
+        Returns:
+            records(:obj:`List[dict]`): All records which the model needs.
+        """
+        records = []
+        for example in examples:
+            record = self.tokenizer.encode(text=example.text_a, text_pair=example.text_b, max_seq_len=self.max_seq_len)
+            # CustomTokenizer will tokenize the text firstly and then lookup words in the vocab
+            # When all words are not found in the vocab, the text will be dropped.
+            if not record:
+                logger.info(
+                    "The text %s has been dropped as it has no words in the vocab after tokenization." % example.text_a)
+                continue
+            if example.label:
+                record['label'] = self.label_map[example.label]
+            records.append(record)
+        return records
+
+    def __getitem__(self, idx):
+        record = self.records[idx]
+        if 'label' in record.keys():
+            if isinstance(self.tokenizer, BertTokenizer):
+                return np.array(record['input_ids']), np.array(record['segment_ids']), np.array(record['label'])
+            elif isinstance(self.tokenizer, CustomTokenizer):
+                return np.array(record['text']), np.array(record['seq_len']), np.array(record['label'])
+        else:
+            if isinstance(self.tokenizer, BertTokenizer):
+                return np.array(record['input_ids']), np.array(record['segment_ids'])
+            elif isinstance(self.tokenizer, CustomTokenizer):
+                return np.array(record['text']), np.array(record['seq_len'])
+
+    def __len__(self):
+        return len(self.records)
--- a/paddlehub/datasets/chnsenticorp.py
+++ b/paddlehub/datasets/chnsenticorp.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Dict, List, Optional, Union, Tuple
+import os
+
+from paddlehub.env import DATA_HOME
+from paddlehub.utils.download import download_data
+from paddlehub.datasets.base_nlp_dataset import TextClassificationDataset
+from paddlehub.text.bert_tokenizer import BertTokenizer
+from paddlehub.text.tokenizer import CustomTokenizer
+
+
+@download_data(url="https://bj.bcebos.com/paddlehub-dataset/chnsenticorp.tar.gz")
+class ChnSentiCorp(TextClassificationDataset):
+    """
+    ChnSentiCorp is a dataset for chinese sentiment classification,
+    which was published by Tan Songbo at ICT of Chinese Academy of Sciences.
+    """
+
+    # TODO(zhangxuefei): simplify datatset load, such as
+    # train_ds, dev_ds, test_ds = hub.datasets.ChnSentiCorp(tokenizer=xxx, max_seq_len=128, select='train', 'test', 'valid')
+    def __init__(self, tokenizer: Union[BertTokenizer, CustomTokenizer], max_seq_len: int = 128, mode: str = 'train'):
+        """
+        Args:
+            tokenizer (:obj:`BertTokenizer` or `CustomTokenizer`):
+                It tokenizes the text and encodes the data as model needed.
+            max_seq_len (:obj:`int`, `optional`, defaults to :128):
+                The maximum length (in number of tokens) for the inputs to the selected module,
+                such as ernie, bert and so on.
+            mode (:obj:`str`, `optional`, defaults to `train`):
+                It identifies the dataset mode (train, test or dev).
+        Examples:
+            .. code-block:: python
+                import paddlehub as hub
+
+                tokenizer = hub.BertTokenizer(vocab_file='./vocab.txt')
+                train_dataset = hub.datasets.ChnSentiCorp(tokenizer=tokenizer, max_seq_len=120, mode='train')
+                dev_dataset = hub.datasets.ChnSentiCorp(tokenizer=tokenizer, max_seq_len=120, mode='dev')
+                test_dataset = hub.datasets.ChnSentiCorp(tokenizer=tokenizer, max_seq_len=120, mode='test')
+
+        """
+        base_path = os.path.join(DATA_HOME, "chnsenticorp")
+        if mode == 'train':
+            data_file = 'train.tsv'
+        elif mode == 'test':
+            data_file = 'test.tsv'
+        else:
+            data_file = 'dev.tsv'
+        super(ChnSentiCorp, self).__init__(
+            base_path=base_path,
+            tokenizer=tokenizer,
+            max_seq_len=max_seq_len,
+            mode=mode,
+            data_file=data_file,
+            label_list=["0", "1"],
+            is_file_with_header=True)
--- a/paddlehub/datasets/flowers.py
+++ b/paddlehub/datasets/flowers.py
-# coding:utf-8
 # Copyright (c) 2020  PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License"

--- a/paddlehub/finetune/__init__.py
+++ b/paddlehub/finetune/__init__.py
+# Copyright (c) 2020  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlehub.finetune.trainer import Trainer
--- a/paddlehub/finetune/trainer.py
+++ b/paddlehub/finetune/trainer.py
-# coding:utf-8
 # Copyright (c) 2020  PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License"
@@ -33,6 +32,7 @@ class Trainer(object):
    Args:
        model(paddle.nn.Layer) : Model to train or evaluate.
        optimizer(paddle.optimizer.Optimizer) : Optimizer for loss.
+        use_gpu(bool) : Whether to use gpu to run.
        use_vdl(bool) : Whether to use visualdl to record training data.
        checkpoint_dir(str) : Directory where the checkpoint is saved, and the trainer will restore the
            state and model parameters from the checkpoint.
@@ -52,9 +52,12 @@ class Trainer(object):
    def __init__(self,
                 model: paddle.nn.Layer,
                 optimizer: paddle.optimizer.Optimizer,
+                 use_gpu: bool = False,
                 use_vdl: bool = True,
                 checkpoint_dir: str = None,
-                 compare_metrics: Callable = None):
+                 compare_metrics: Callable = None,
+                 **kwargs):
+        paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
        self.nranks = paddle.distributed.get_world_size()
        self.local_rank = paddle.distributed.get_rank()
        self.model = model
@@ -154,7 +157,8 @@ class Trainer(object):
              num_workers: int = 0,
              eval_dataset: paddle.io.Dataset = None,
              log_interval: int = 10,
-              save_interval: int = 10):
+              save_interval: int = 10,
+              collate_fn: Callable = None):
        '''
        Train a model with specific config.

@@ -167,6 +171,8 @@ class Trainer(object):
                execute evaluate function every `save_interval` epochs.
            log_interval(int) : Log the train infomation every `log_interval` steps.
            save_interval(int) : Save the checkpoint every `save_interval` epochs.
+            collate_fn(callable): function to generate mini-batch data by merging the sample list.
+                None for only stack each fields of sample in axis 0(same as :attr::`np.stack(..., axis=0)`). Default None
        '''
        batch_sampler = paddle.io.DistributedBatchSampler(
            train_dataset, batch_size=batch_size, shuffle=True, drop_last=False)
@@ -175,7 +181,8 @@ class Trainer(object):
            batch_sampler=batch_sampler,
            num_workers=num_workers,
            return_list=True,
-            use_buffer_reader=True)
+            use_buffer_reader=True,
+            collate_fn=collate_fn)

        steps_per_epoch = len(batch_sampler)
        timer = Timer(steps_per_epoch * epochs)
@@ -195,7 +202,9 @@ class Trainer(object):
                # calculate metrics and loss
                avg_loss += loss.numpy()[0]
                for metric, value in metrics.items():
-                    avg_metrics[metric] += value.numpy()[0]
+                    if isinstance(value, paddle.Tensor):
+                        value = value.numpy()
+                    avg_metrics[metric] += value

                timer.count()

@@ -225,7 +234,7 @@ class Trainer(object):

                if self.current_epoch % save_interval == 0 and batch_idx + 1 == steps_per_epoch and self.local_rank == 0:
                    if eval_dataset:
-                        result = self.evaluate(eval_dataset, batch_size, num_workers)
+                        result = self.evaluate(eval_dataset, batch_size, num_workers, collate_fn=collate_fn)
                        eval_loss = result.get('loss', None)
                        eval_metrics = result.get('metrics', {})
                        if self.use_vdl:
@@ -250,7 +259,11 @@ class Trainer(object):

                    self._save_checkpoint()

-    def evaluate(self, eval_dataset: paddle.io.Dataset, batch_size: int = 1, num_workers: int = 0):
+    def evaluate(self,
+                 eval_dataset: paddle.io.Dataset,
+                 batch_size: int = 1,
+                 num_workers: int = 0,
+                 collate_fn: Callable = None):
        '''
        Run evaluation and returns metrics.

@@ -258,12 +271,18 @@ class Trainer(object):
            eval_dataset(paddle.io.Dataset) : The validation dataset
            batch_size(int) : Batch size of per step, default is 1.
            num_workers(int) : Number of subprocess to load data, default is 0.
+            collate_fn(callable): function to generate mini-batch data by merging the sample list.
+                None for only stack each fields of sample in axis 0(same as :attr::`np.stack(..., axis=0)`). Default None
        '''
-        batch_sampler = paddle.io.DistributedBatchSampler(
-            eval_dataset, batch_size=batch_size, shuffle=False, drop_last=False)
+        if self.local_rank == 0:
+            batch_sampler = paddle.io.BatchSampler(eval_dataset, batch_size=batch_size, shuffle=False, drop_last=False)

            loader = paddle.io.DataLoader(
-            eval_dataset, batch_sampler=batch_sampler, num_workers=num_workers, return_list=True)
+                eval_dataset,
+                batch_sampler=batch_sampler,
+                num_workers=num_workers,
+                return_list=True,
+                collate_fn=collate_fn)

            self.model.eval()
            avg_loss = num_samples = 0
@@ -282,7 +301,7 @@ class Trainer(object):
                        avg_loss += loss.numpy()[0] * bs

                    for metric, value in metrics.items():
-                    sum_metrics[metric] += value.numpy()[0] * bs
+                        sum_metrics[metric] += value * bs

            # print avg metrics and loss
            print_msg = '[Evaluation result]'
@@ -318,7 +337,7 @@ class Trainer(object):
            raise RuntimeError('The return value of `trainning_step` in {} is not a dict'.format(self.model.__class__))

        loss = result.get('loss', None)
-        if not loss:
+        if loss is None:
            raise RuntimeError('Cannot find loss attribute in the return value of `trainning_step` of {}'.format(
                self.model.__class__))


--- a/paddlehub/module/modeling_bert.py
+++ b/paddlehub/module/modeling_bert.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# FIXME(zhangxuefei): remove this file after paddlenlp is released.
+
+import paddle
+import paddle.nn as nn
+
+from paddlehub.module.nlp_module import PretrainedModel, register_base_model
+
+
+class BertEmbeddings(nn.Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(self,
+                 vocab_size,
+                 hidden_size=768,
+                 hidden_dropout_prob=0.1,
+                 max_position_embeddings=512,
+                 type_vocab_size=16):
+        super(BertEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=0)
+        self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)
+        self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size)
+        self.layer_norm = nn.LayerNorm(hidden_size)
+        self.dropout = nn.Dropout(hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None):
+        if position_ids is None:
+            # maybe need use shape op to unify static graph and dynamic graph
+            seq_length = input_ids.shape[1]
+            position_ids = paddle.arange(0, seq_length, dtype="int64")
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embedings + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertPooler(nn.Layer):
+    """
+    """
+
+    def __init__(self, hidden_size):
+        super(BertPooler, self).__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class BertPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained BERT models. It provides BERT related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models. See `PretrainedModel` for more details.
+    """
+
+    model_config_file = "model_config.json"
+    pretrained_init_configuration = {
+        "bert-base-uncased": {
+            "vocab_size": 30522,
+            "hidden_size": 768,
+            "num_hidden_layers": 12,
+            "num_attention_heads": 12,
+            "intermediate_size": 3072,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 512,
+            "type_vocab_size": 2,
+            "initializer_range": 0.02,
+            "pad_token_id": 0,
+        },
+        "bert-large-uncased": {
+            "vocab_size": 30522,
+            "hidden_size": 1024,
+            "num_hidden_layers": 24,
+            "num_attention_heads": 16,
+            "intermediate_size": 4096,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 512,
+            "type_vocab_size": 2,
+            "initializer_range": 0.02,
+            "pad_token_id": 0,
+        },
+        "bert-base-multilingual-uncased": {
+            "vocab_size": 105879,
+            "hidden_size": 768,
+            "num_hidden_layers": 12,
+            "num_attention_heads": 12,
+            "intermediate_size": 3072,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 512,
+            "type_vocab_size": 2,
+            "initializer_range": 0.02,
+            "pad_token_id": 0,
+        },
+        "bert-base-cased": {
+            "vocab_size": 30522,
+            "hidden_size": 768,
+            "num_hidden_layers": 12,
+            "num_attention_heads": 12,
+            "intermediate_size": 3072,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 512,
+            "type_vocab_size": 2,
+            "initializer_range": 0.02,
+            "pad_token_id": 0,
+        },
+        "bert-base-chinese": {
+            "vocab_size": 21128,
+            "hidden_size": 768,
+            "num_hidden_layers": 12,
+            "num_attention_heads": 12,
+            "intermediate_size": 3072,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 512,
+            "type_vocab_size": 2,
+            "initializer_range": 0.02,
+            "pad_token_id": 0,
+        },
+        "bert-base-multilingual-cased": {
+            "vocab_size": 119547,
+            "hidden_size": 768,
+            "num_hidden_layers": 12,
+            "num_attention_heads": 12,
+            "intermediate_size": 3072,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 512,
+            "type_vocab_size": 2,
+            "initializer_range": 0.02,
+            "pad_token_id": 0,
+        },
+        "bert-large-cased": {
+            "vocab_size": 28996,
+            "hidden_size": 1024,
+            "num_hidden_layers": 24,
+            "num_attention_heads": 16,
+            "intermediate_size": 4096,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 512,
+            "type_vocab_size": 2,
+            "initializer_range": 0.02,
+            "pad_token_id": 0,
+        },
+    }
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_resource_files_map = {
+        "model_state": {
+            "bert-base-uncased": "https://paddlenlp.bj.bcebos.com/models/transformers/bert-base-uncased.pdparams",
+            "bert-large-uncased": "https://paddlenlp.bj.bcebos.com/models/transformers/bert-large-uncased.pdparams",
+            "bert-base-multilingual-uncased":
+            "http://paddlenlp.bj.bcebos.com/models/transformers/bert-base-multilingual-uncased.pdparams",
+            "bert-base-cased": "http://paddlenlp.bj.bcebos.com/models/transformers/bert/bert-base-cased.pdparams",
+            "bert-base-chinese": "http://paddlenlp.bj.bcebos.com/models/transformers/bert/bert-base-chinese.pdparams",
+            "bert-base-multilingual-cased":
+            "http://paddlenlp.bj.bcebos.com/models/transformers/bert/bert-base-multilingual-cased.pdparamss",
+            "bert-large-cased": "http://paddlenlp.bj.bcebos.com/models/transformers/bert/bert-large-cased.pdparams"
+        }
+    }
+    base_model_prefix = "bert"
+
+    def init_weights(self, layer):
+        """ Initialization hook """
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.initializer_range
+                    if hasattr(self, "initializer_range") else self.bert.config["initializer_range"],
+                    shape=layer.weight.shape))
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class BertModel(BertPretrainedModel):
+    """
+    """
+
+    def __init__(self,
+                 vocab_size,
+                 hidden_size=768,
+                 num_hidden_layers=12,
+                 num_attention_heads=12,
+                 intermediate_size=3072,
+                 hidden_act="gelu",
+                 hidden_dropout_prob=0.1,
+                 attention_probs_dropout_prob=0.1,
+                 max_position_embeddings=512,
+                 type_vocab_size=16,
+                 initializer_range=0.02,
+                 pad_token_id=0):
+        super(BertModel, self).__init__()
+        self.pad_token_id = pad_token_id
+        self.initializer_range = initializer_range
+        self.embeddings = BertEmbeddings(vocab_size, hidden_size, hidden_dropout_prob, max_position_embeddings,
+                                         type_vocab_size)
+        encoder_layer = nn.TransformerEncoderLayer(
+            hidden_size,
+            num_attention_heads,
+            intermediate_size,
+            dropout=hidden_dropout_prob,
+            activation=hidden_act,
+            attn_dropout=attention_probs_dropout_prob,
+            act_dropout=0)
+        self.encoder = nn.TransformerEncoder(encoder_layer, num_hidden_layers)
+        self.pooler = BertPooler(hidden_size)
+        self.apply(self.init_weights)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e9, axis=[1, 2])
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
+        encoder_outputs = self.encoder(embedding_output, attention_mask)
+        sequence_output = encoder_outputs
+        pooled_output = self.pooler(sequence_output)
+        return sequence_output, pooled_output
+
+
+class BertForSequenceClassification(BertPretrainedModel):
+    """
+    Model for sentence (pair) classification task with BERT.
+    Args:
+        bert (BertModel): An instance of BertModel.
+        num_classes (int, optional): The number of classes. Default 2
+        dropout (float, optional): The dropout probability for output of BERT.
+            If None, use the same value as `hidden_dropout_prob` of `BertModel`
+            instance `bert`. Default None
+    """
+
+    def __init__(self, bert, num_classes=2, dropout=None):
+        super(BertForSequenceClassification, self).__init__()
+        self.num_classes = num_classes
+        self.bert = bert  # allow bert to be config
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.bert.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.bert.config["hidden_size"], num_classes)
+        self.apply(self.init_weights)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, pooled_output = self.bert(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask)
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        return logits
--- a/paddlehub/module/modeling_ernie.py
+++ b/paddlehub/module/modeling_ernie.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# FIXME(zhangxuefei): remove this file after paddlenlp is released.
+
+import paddle
+import paddle.nn as nn
+
+from paddlehub.module.nlp_module import PretrainedModel, register_base_model
+
+
+class ErnieEmbeddings(nn.Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(self,
+                 vocab_size,
+                 hidden_size=768,
+                 hidden_dropout_prob=0.1,
+                 max_position_embeddings=512,
+                 type_vocab_size=2,
+                 pad_token_id=0):
+        super(ErnieEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_token_id)
+        self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)
+        self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size)
+        self.layer_norm = nn.LayerNorm(hidden_size)
+        self.dropout = nn.Dropout(hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None):
+        if position_ids is None:
+            # maybe need use shape op to unify static graph and dynamic graph
+            seq_length = input_ids.shape[1]
+            position_ids = paddle.arange(0, seq_length, dtype="int64")
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embedings + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class ErniePooler(nn.Layer):
+    """
+    """
+
+    def __init__(self, hidden_size):
+        super(ErniePooler, self).__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class ErniePretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained ERNIE models. It provides ERNIE related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models. See `PretrainedModel` for more details.
+    """
+
+    model_config_file = "model_config.json"
+    pretrained_init_configuration = {
+        "ernie": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "relu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 513,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 2,
+            "vocab_size": 18000,
+            "pad_token_id": 0,
+        },
+        "ernie_tiny": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "relu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "intermediate_size": 4096,
+            "max_position_embeddings": 600,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 3,
+            "type_vocab_size": 2,
+            "vocab_size": 50006,
+            "pad_token_id": 0,
+        },
+        "ernie_v2_eng_base": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+        },
+        "ernie_v2_eng_large": {
+            "attention_probs_dropout_prob": 0.1,
+            "intermediate_size": 4096,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 24,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+        },
+    }
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_resource_files_map = {
+        "model_state": {
+            "ernie":
+            "https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams",
+            "ernie_tiny":
+            "https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/ernie_tiny.pdparams",
+            "ernie_v2_eng_base":
+            "https://paddlenlp.bj.bcebos.com/models/transformers/ernie_v2_base/ernie_v2_eng_base.pdparams",
+            "ernie_v2_eng_large":
+            "https://paddlenlp.bj.bcebos.com/models/transformers/ernie_v2_large/ernie_v2_eng_large.pdparams",
+        }
+    }
+    base_model_prefix = "ernie"
+
+    def init_weights(self, layer):
+        """ Initialization hook """
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.initializer_range
+                    if hasattr(self, "initializer_range") else self.ernie.config["initializer_range"],
+                    shape=layer.weight.shape))
+
+
+@register_base_model
+class ErnieModel(ErniePretrainedModel):
+    """
+    """
+
+    def __init__(self,
+                 vocab_size,
+                 hidden_size=768,
+                 num_hidden_layers=12,
+                 num_attention_heads=12,
+                 intermediate_size=3072,
+                 hidden_act="gelu",
+                 hidden_dropout_prob=0.1,
+                 attention_probs_dropout_prob=0.1,
+                 max_position_embeddings=512,
+                 type_vocab_size=16,
+                 initializer_range=0.02,
+                 pad_token_id=0):
+        super(ErnieModel, self).__init__()
+        self.pad_token_id = pad_token_id
+        self.initializer_range = initializer_range
+        self.embeddings = ErnieEmbeddings(vocab_size, hidden_size, hidden_dropout_prob, max_position_embeddings,
+                                          type_vocab_size, pad_token_id)
+        encoder_layer = nn.TransformerEncoderLayer(
+            hidden_size,
+            num_attention_heads,
+            intermediate_size,
+            dropout=hidden_dropout_prob,
+            activation=hidden_act,
+            attn_dropout=attention_probs_dropout_prob,
+            act_dropout=0)
+        self.encoder = nn.TransformerEncoder(encoder_layer, num_hidden_layers)
+        self.pooler = ErniePooler(hidden_size)
+        self.apply(self.init_weights)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e9, axis=[1, 2])
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
+        encoder_outputs = self.encoder(embedding_output, attention_mask)
+        sequence_output = encoder_outputs
+        pooled_output = self.pooler(sequence_output)
+        return sequence_output, pooled_output
+
+
+class ErnieForSequenceClassification(ErniePretrainedModel):
+    """
+    Model for sentence (pair) classification task with ERNIE.
+    Args:
+        ernie (ErnieModel): An instance of `ErnieModel`.
+        num_classes (int, optional): The number of classes. Default 2
+        dropout (float, optional): The dropout probability for output of ERNIE.
+            If None, use the same value as `hidden_dropout_prob` of `ErnieModel`
+            instance `Ernie`. Default None
+    """
+
+    def __init__(self, ernie, num_classes=2, dropout=None):
+        super(ErnieForSequenceClassification, self).__init__()
+        self.num_classes = num_classes
+        self.ernie = ernie  # allow ernie to be config
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.ernie.config["hidden_size"], num_classes)
+        self.apply(self.init_weights)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, pooled_output = self.ernie(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask)
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        return logits
--- a/paddlehub/module/modeling_roberta.py
+++ b/paddlehub/module/modeling_roberta.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# FIXME(zhangxuefei): remove this file after paddlenlp is released.
+
+import paddle
+import paddle.nn as nn
+
+from paddlehub.module.nlp_module import PretrainedModel, register_base_model
+
+
+class RobertaEmbeddings(nn.Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(self,
+                 vocab_size,
+                 hidden_size=768,
+                 hidden_dropout_prob=0.1,
+                 max_position_embeddings=512,
+                 type_vocab_size=16,
+                 pad_token_id=0):
+        super(RobertaEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_token_id)
+        self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)
+        self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size)
+        self.layer_norm = nn.LayerNorm(hidden_size)
+        self.dropout = nn.Dropout(hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None):
+        if position_ids is None:
+            # maybe need use shape op to unify static graph and dynamic graph
+            seq_length = input_ids.shape[1]
+            position_ids = paddle.arange(0, seq_length, dtype="int64")
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embedings + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class RobertaPooler(nn.Layer):
+    """
+    """
+
+    def __init__(self, hidden_size):
+        super(RobertaPooler, self).__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class RobertaPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained RoBERTa models. It provides RoBERTa related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models. See `PretrainedModel` for more details.
+    """
+
+    model_config_file = "model_config.json"
+    pretrained_init_configuration = {
+        "roberta-wwm-ext": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "intermediate_size": 3072,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0
+        },
+        "roberta-wwm-ext-large": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "intermediate_size": 4096,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 24,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0
+        }
+    }
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_resource_files_map = {
+        "model_state": {
+            "roberta-wwm-ext":
+            "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_base/roberta_chn_base.pdparams",
+            "roberta-wwm-ext-large":
+            "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/roberta_chn_large.pdparams",
+        }
+    }
+    base_model_prefix = "roberta"
+
+    def init_weights(self, layer):
+        """ Initialization hook """
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.initializer_range
+                    if hasattr(self, "initializer_range") else self.roberta.config["initializer_range"],
+                    shape=layer.weight.shape))
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class RobertaModel(RobertaPretrainedModel):
+    """
+    """
+
+    def __init__(self,
+                 vocab_size,
+                 hidden_size=768,
+                 num_hidden_layers=12,
+                 num_attention_heads=12,
+                 intermediate_size=3072,
+                 hidden_act="gelu",
+                 hidden_dropout_prob=0.1,
+                 attention_probs_dropout_prob=0.1,
+                 max_position_embeddings=512,
+                 type_vocab_size=16,
+                 initializer_range=0.02,
+                 pad_token_id=0):
+        super(RobertaModel, self).__init__()
+        self.pad_token_id = pad_token_id
+        self.initializer_range = initializer_range
+        self.embeddings = RobertaEmbeddings(vocab_size, hidden_size, hidden_dropout_prob, max_position_embeddings,
+                                            type_vocab_size, pad_token_id)
+        encoder_layer = nn.TransformerEncoderLayer(
+            hidden_size,
+            num_attention_heads,
+            intermediate_size,
+            dropout=hidden_dropout_prob,
+            activation=hidden_act,
+            attn_dropout=attention_probs_dropout_prob,
+            act_dropout=0)
+        self.encoder = nn.TransformerEncoder(encoder_layer, num_hidden_layers)
+        self.pooler = RobertaPooler(hidden_size)
+        self.apply(self.init_weights)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e9, axis=[1, 2])
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
+        encoder_outputs = self.encoder(embedding_output, attention_mask)
+        sequence_output = encoder_outputs
+        pooled_output = self.pooler(sequence_output)
+        return sequence_output, pooled_output
+
+
+class RobertaForSequenceClassification(RobertaPretrainedModel):
+    """
+    Model for sentence (pair) classification task with RoBERTa.
+    Args:
+        roberta (RobertaModel): An instance of `RobertaModel`.
+        num_classes (int, optional): The number of classes. Default 2
+        dropout (float, optional): The dropout probability for output of RoBERTa.
+            If None, use the same value as `hidden_dropout_prob` of `RobertaModel`
+            instance `Roberta`. Default None
+    """
+
+    def __init__(self, roberta, num_classes=2, dropout=None):
+        super(RobertaForSequenceClassification, self).__init__()
+        self.num_classes = num_classes
+        self.roberta = roberta  # allow roberta to be config
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes)
+        self.apply(self.init_weights)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, pooled_output = self.roberta(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask)
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        return logits
--- a/paddlehub/module/nlp_module.py
+++ b/paddlehub/module/nlp_module.py
-# coding:utf-8
 # Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
 #
-# Licensed under the Apache License, Version 2.0 (the "License"
+# Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
@@ -13,7 +12,333 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+# FIXME(zhangxuefei): remove this file after paddlenlp is released.

-class DataFormatError(Exception):
-    def __init__(self, *args):
-        self.args = args
+import copy
+import functools
+import inspect
+import io
+import json
+import os
+import six
+
+import paddle
+import paddle.nn as nn
+from paddle.dataset.common import DATA_HOME
+from paddle.utils.download import get_path_from_url
+
+from paddlehub.utils.log import logger
+
+__all__ = [
+    'PretrainedModel',
+    'register_base_model',
+]
+
+
+def fn_args_to_dict(func, *args, **kwargs):
+    """
+    Inspect function `func` and its arguments for running, and extract a
+    dict mapping between argument names and keys.
+    """
+    if hasattr(inspect, 'getfullargspec'):
+        (spec_args, spec_varargs, spec_varkw, spec_defaults, _, _, _) = inspect.getfullargspec(func)
+    else:
+        (spec_args, spec_varargs, spec_varkw, spec_defaults) = inspect.getargspec(func)
+    # add positional argument values
+    init_dict = dict(zip(spec_args, args))
+    # add default argument values
+    kwargs_dict = dict(zip(spec_args[-len(spec_defaults):], spec_defaults)) if spec_defaults else {}
+    kwargs_dict.update(kwargs)
+    init_dict.update(kwargs_dict)
+    return init_dict
+
+
+class InitTrackerMeta(type(nn.Layer)):
+    """
+    This metaclass wraps the `__init__` method of a class to add `init_config`
+    attribute for instances of that class, and `init_config` use a dict to track
+    the initial configuration. If the class has `_wrap_init` method, it would be
+    hooked after `__init__` and called as `_wrap_init(self, init_fn, init_args)`.
+    Since InitTrackerMeta would be used as metaclass for pretrained model classes,
+    which always are Layer and `type(nn.Layer)` is not `type`, thus use `type(nn.Layer)`
+    rather than `type` as base class for it to avoid inheritance metaclass
+    conflicts.
+    """
+
+    def __init__(cls, name, bases, attrs):
+        init_func = cls.__init__
+        # If attrs has `__init__`, wrap it using accessable `_wrap_init`.
+        # Otherwise, no need to wrap again since the super cls has been wraped.
+        # TODO: remove reduplicated tracker if using super cls `__init__`
+        help_func = getattr(cls, '_wrap_init', None) if '__init__' in attrs else None
+        cls.__init__ = InitTrackerMeta.init_and_track_conf(init_func, help_func)
+        super(InitTrackerMeta, cls).__init__(name, bases, attrs)
+
+    @staticmethod
+    def init_and_track_conf(init_func, help_func=None):
+        """
+        wraps `init_func` which is `__init__` method of a class to add `init_config`
+        attribute for instances of that class.
+        Args:
+            init_func (callable): It should be the `__init__` method of a class.
+            help_func (callable, optional): If provided, it would be hooked after
+                `init_func` and called as `_wrap_init(self, init_func, *init_args, **init_args)`.
+                Default None.
+
+        Returns:
+            function: the wrapped function
+        """
+
+        @functools.wraps(init_func)
+        def __impl__(self, *args, **kwargs):
+            # keep full configuration
+            init_func(self, *args, **kwargs)
+            # registed helper by `_wrap_init`
+            if help_func:
+                help_func(self, init_func, *args, **kwargs)
+            self.init_config = kwargs
+            if args:
+                kwargs['init_args'] = args
+            kwargs['init_class'] = self.__class__.__name__
+
+        return __impl__
+
+
+def register_base_model(cls):
+    """
+    Add a `base_model_class` attribute for the base class of decorated class,
+    representing the base model class in derived classes of the same architecture.
+    Args:
+        cls (class): the name of the model
+    """
+    base_cls = cls.__bases__[0]
+    assert issubclass(base_cls,
+                      PretrainedModel), "`register_base_model` should be used on subclasses of PretrainedModel."
+    base_cls.base_model_class = cls
+    return cls
+
+
+@six.add_metaclass(InitTrackerMeta)
+class PretrainedModel(nn.Layer):
+    """
+    The base class for all pretrained models. It provides some attributes and
+    common methods for all pretrained models, including attributes `init_config`,
+    `config` for initialized arguments and methods for saving, loading.
+    It also includes some class attributes (should be set by derived classes):
+    - `model_config_file` (str): represents the file name for saving and loading
+      model configuration, it's value is `model_config.json`.
+    - `resource_files_names` (dict): use this to map resources to specific file
+      names for saving and loading.
+    - `pretrained_resource_files_map` (dict): The dict has the same keys as
+      `resource_files_names`, the values are also dict mapping specific pretrained
+      model name to URL linking to pretrained model.
+    - `pretrained_init_configuration` (dict): The dict has pretrained model names
+      as keys, and the values are also dict preserving corresponding configuration
+      for model initialization.
+
+    - `base_model_prefix` (str): represents the the attribute associated to the
+      base model in derived classes of the same architecture adding layers on
+      top of the base model.
+    """
+    model_config_file = "model_config.json"
+    pretrained_init_configuration = {}
+    # TODO: more flexible resource handle, namedtuple with fileds as:
+    # resource_name, saved_file, handle_name_for_load(None for used as __init__
+    # arguments), handle_name_for_save
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_resource_files_map = {}
+    base_model_prefix = ""
+
+    def _wrap_init(self, original_init, *args, **kwargs):
+        """
+        It would be hooked after `__init__` to add a dict including arguments of
+        `__init__` as a attribute named `config` of the prtrained model instance.
+        """
+        init_dict = fn_args_to_dict(original_init, *args, **kwargs)
+        self.config = init_dict
+
+    @property
+    def base_model(self):
+        return getattr(self, self.base_model_prefix, self)
+
+    @property
+    def model_name_list(self):
+        return list(self.pretrained_init_configuration.keys())
+
+    def get_input_embeddings(self):
+        base_model = getattr(self, self.base_model_prefix, self)
+        if base_model is not self:
+            return base_model.get_input_embeddings()
+        else:
+            raise NotImplementedError
+
+    def get_output_embeddings(self):
+        return None  # Overwrite for models with output embeddings
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
+        """
+        Instantiate an instance of `PretrainedModel` from a predefined
+        model specified by name or path.
+        Args:
+            pretrained_model_name_or_path (str): A name of or a file path to a
+                pretrained model.
+            *args (tuple): position arguments for `__init__`. If provide, use
+                this as position argument values for model initialization.
+            **kwargs (dict): keyword arguments for `__init__`. If provide, use
+                this to update pre-defined keyword argument values for model
+                initialization.
+        Returns:
+            PretrainedModel: An instance of PretrainedModel.
+        """
+        pretrained_models = list(cls.pretrained_init_configuration.keys())
+        resource_files = {}
+        init_configuration = {}
+        if pretrained_model_name_or_path in pretrained_models:
+            for file_id, map_list in cls.pretrained_resource_files_map.items():
+                resource_files[file_id] = map_list[pretrained_model_name_or_path]
+            init_configuration = copy.deepcopy(cls.pretrained_init_configuration[pretrained_model_name_or_path])
+        else:
+            if os.path.isdir(pretrained_model_name_or_path):
+                for file_id, file_name in cls.resource_files_names.items():
+                    full_file_name = os.path.join(pretrained_model_name_or_path, file_name)
+                    resource_files[file_id] = full_file_name
+                resource_files["model_config_file"] = os.path.join(pretrained_model_name_or_path, cls.model_config_file)
+            else:
+                raise ValueError("Calling {}.from_pretrained() with a model identifier or the "
+                                 "path to a directory instead. The supported model "
+                                 "identifiers are as follows: {}".format(cls.__name__,
+                                                                         cls.pretrained_init_configuration.keys()))
+        # FIXME(chenzeyu01): We should use another data path for storing model
+        default_root = os.path.join(DATA_HOME, pretrained_model_name_or_path)
+        resolved_resource_files = {}
+        for file_id, file_path in resource_files.items():
+            path = os.path.join(default_root, file_path.split('/')[-1])
+            if file_path is None or os.path.isfile(file_path):
+                resolved_resource_files[file_id] = file_path
+            elif os.path.exists(path):
+                logger.info("Already cached %s" % path)
+                resolved_resource_files[file_id] = path
+            else:
+                logger.info("Downloading %s and saved to %s" % (file_path, default_root))
+                resolved_resource_files[file_id] = get_path_from_url(file_path, default_root)
+
+        # Prepare model initialization kwargs
+        # Did we saved some inputs and kwargs to reload ?
+        model_config_file = resolved_resource_files.pop("model_config_file", None)
+        if model_config_file is not None:
+            with io.open(model_config_file, encoding="utf-8") as f:
+                init_kwargs = json.load(f)
+        else:
+            init_kwargs = init_configuration
+        # position args are stored in kwargs, maybe better not include
+        init_args = init_kwargs.pop("init_args", ())
+        # class name corresponds to this configuration
+        init_class = init_kwargs.pop("init_class", cls.base_model_class.__name__)
+
+        # Check if the loaded config matches the current model class's __init__
+        # arguments. If not match, the loaded config is for the base model class.
+        if init_class == cls.base_model_class.__name__:
+            base_args = init_args
+            base_kwargs = init_kwargs
+            derived_args = ()
+            derived_kwargs = {}
+            base_arg_index = None
+        else:  # extract config for base model
+            derived_args = list(init_args)
+            derived_kwargs = init_kwargs
+            for i, arg in enumerate(init_args):
+                if isinstance(arg, dict) and "init_class" in arg:
+                    assert arg.pop("init_class") == cls.base_model_class.__name__, (
+                        "pretrained base model should be {}").format(cls.base_model_class.__name__)
+                    base_arg_index = i
+                    break
+            for arg_name, arg in init_kwargs.items():
+                if isinstance(arg, dict) and "init_class" in arg:
+                    assert arg.pop("init_class") == cls.base_model_class.__name__, (
+                        "pretrained base model should be {}").format(cls.base_model_class.__name__)
+                    base_arg_index = arg_name
+                    break
+            base_args = arg.pop("init_args", ())
+            base_kwargs = arg
+        if cls == cls.base_model_class:
+            # Update with newly provided args and kwargs for base model
+            base_args = base_args if not args else args
+            base_kwargs.update(kwargs)
+            model = cls(*base_args, **base_kwargs)
+        else:
+            # Update with newly provided args and kwargs for derived model
+            base_model = cls.base_model_class(*base_args, **base_kwargs)
+            if base_arg_index is not None:
+                derived_args[base_arg_index] = base_model
+            else:
+                derived_args = (base_model, )  # assume at the first position
+            derived_args = derived_args if not args else args
+            derived_kwargs.update(kwargs)
+            model = cls(*derived_args, **derived_kwargs)
+
+        # Maybe need more ways to load resources.
+        weight_path = list(resolved_resource_files.values())[0]
+        assert weight_path.endswith(".pdparams"), "suffix of weight must be .pdparams"
+        state_dict = paddle.load(weight_path)
+
+        # Make sure we are able to load base models as well as derived models
+        # (with heads)
+        start_prefix = ""
+        model_to_load = model
+        state_to_load = state_dict
+        unexpected_keys = []
+        missing_keys = []
+        if not hasattr(model, cls.base_model_prefix) and any(
+                s.startswith(cls.base_model_prefix) for s in state_dict.keys()):
+            # base model
+            state_to_load = {}
+            start_prefix = cls.base_model_prefix + "."
+            for k, v in state_dict.items():
+                if k.startswith(cls.base_model_prefix):
+                    state_to_load[k[len(start_prefix):]] = v
+                else:
+                    unexpected_keys.append(k)
+        if hasattr(model,
+                   cls.base_model_prefix) and not any(s.startswith(cls.base_model_prefix) for s in state_dict.keys()):
+            # derived model (base model with heads)
+            model_to_load = getattr(model, cls.base_model_prefix)
+            for k in model.state_dict().keys():
+                if not k.startswith(cls.base_model_prefix):
+                    missing_keys.append(k)
+        if len(missing_keys) > 0:
+            logger.info("Weights of {} not initialized from pretrained model: {}".format(
+                model.__class__.__name__, missing_keys))
+        if len(unexpected_keys) > 0:
+            logger.info("Weights from pretrained model not used in {}: {}".format(model.__class__.__name__,
+                                                                                  unexpected_keys))
+        model_to_load.set_state_dict(state_to_load)
+        if paddle.in_dynamic_mode():
+            return model
+        return model, state_to_load
+
+    def save_pretrained(self, save_directory):
+        """
+        Save model configuration and related resources (model state) to files
+        under `save_directory`.
+        Args:
+            save_directory (str): Directory to save files into.
+        """
+        assert os.path.isdir(save_directory), "Saving directory ({}) should be a directory".format(save_directory)
+        # save model config
+        model_config_file = os.path.join(save_directory, self.model_config_file)
+        model_config = self.init_config
+        # If init_config contains a Layer, use the layer's init_config to save
+        for key, value in model_config.items():
+            if key == "init_args":
+                args = []
+                for arg in value:
+                    args.append(arg.init_config if isinstance(arg, PretrainedModel) else arg)
+                model_config[key] = tuple(args)
+            elif isinstance(value, PretrainedModel):
+                model_config[key] = value.init_config
+        with io.open(model_config_file, "w", encoding="utf-8") as f:
+            f.write(json.dumps(model_config, ensure_ascii=False))
+        # save model
+        file_name = os.path.join(save_directory, list(self.resource_files_names.values())[0])
+        paddle.save(self.state_dict(), file_name)
--- a/paddlehub/text/bert_tokenizer.py
+++ b/paddlehub/text/bert_tokenizer.py
@@ -20,7 +20,7 @@ import pickle
 import unicodedata
 from typing import Dict, List, Optional, Union, Tuple

-import sentencepiece as spm
+from paddle.utils import try_import

 from paddlehub.text.utils import load_vocab, is_whitespace, is_control, is_punctuation, whitespace_tokenize, is_chinese_char

@@ -509,9 +509,9 @@ class BertTokenizer(object):
               max_seq_len: Optional[int] = None,
               pad_to_max_seq_len: bool = True,
               truncation_strategy: str = 'longest_first',
-               return_position_ids: bool = True,
+               return_position_ids: bool = False,
               return_segment_ids: bool = True,
-               return_input_mask: bool = True,
+               return_input_mask: bool = False,
               return_length: bool = True,
               return_overflowing_tokens: bool = False,
               return_special_tokens_mask: bool = False):
@@ -541,11 +541,11 @@ class BertTokenizer(object):
                - 'only_first': Only truncate the first sequence
                - 'only_second': Only truncate the second sequence
                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_seq_len)
-            return_position_ids (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            return_position_ids (:obj:`bool`, `optional`, defaults to :obj:`False`):
                Set to True to return tokens position ids (default True).
            return_segment_ids (:obj:`bool`, `optional`, defaults to :obj:`True`):
                Whether to return token type IDs.
-            return_input_mask (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            return_input_mask (:obj:`bool`, `optional`, defaults to :obj:`False`):
                Whether to return the attention mask.
            return_length (:obj:`int`, defaults to :obj:`True`):
                If set the resulting dictionary will include the length of each encoded inputs
@@ -612,7 +612,6 @@ class BertTokenizer(object):
                encoded_inputs['num_truncated_tokens'] = total_len - max_seq_len

        # Add special tokens
-
        sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
        segment_ids = self.create_segment_ids_from_sequences(ids, pair_ids)

@@ -704,6 +703,7 @@ class ErnieTinyTokenizer(BertTokenizer):
            cls_token: str = '[CLS]',
            mask_token: str = '[MASK]',
    ):
+        mod = try_import('sentencepiece')
        self.unk_token = unk_token
        self.sep_token = sep_token
        self.pad_token = pad_token
@@ -719,7 +719,7 @@ class ErnieTinyTokenizer(BertTokenizer):

        # Here is the difference with BertTokenizer.
        self.dict = pickle.load(open(word_dict_path, 'rb'))
-        self.sp_model = spm.SentencePieceProcessor()
+        self.sp_model = mod.SentencePieceProcessor()
        self.window_size = 5
        self.sp_model.Load(spm_path)