Add BERT. (#5125)

63e1893b · Guo Sheng · GitHub · 4d87afd6 · 63e1893b · 63e1893b
5 changed file
--- a/PaddleNLP/examples/language_model/bert/README.md
+++ b/PaddleNLP/examples/language_model/bert/README.md
+# BERT
+
+## 模型简介
+
+[BERT](https://arxiv.org/abs/1810.04805) （Bidirectional Encoder Representations from Transformers）以[Transformer](https://arxiv.org/abs/1706.03762) 编码器为网络基本组件，使用掩码语言模型（Masked Language Model）和邻接句子预测（Next Sentence Prediction）两个任务在大规模无标注文本语料上进行预训练（pre-train），得到融合了双向内容的通用语义表示模型。以预训练产生的通用语义表示模型为基础，结合任务适配的简单输出层，微调（fine-tune）后即可应用到下游的NLP任务，效果通常也较直接在下游的任务上训练的模型更优。此前BERT即在[GLUE评测任务](https://gluebenchmark.com/tasks)上取得了SOTA的结果。
+
+本项目是BERT在 Paddle 2.0上的开源实现，包含了预训练和[GLUE评测任务](https://gluebenchmark.com/tasks)上的微调代码。
+
+## 快速开始
+
+### 安装说明
+
+* PaddlePaddle 安装
+
+   本项目依赖于 PaddlePaddle 2.0rc1 及以上版本，请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装
+
+* PaddleNLP 安装
+
+   ```shell
+   pip install paddlenlp>=2.0.0b
+   ```
+
+### 数据准备
+
+#### Pre-training数据准备
+
+`create_pretraining_data.py` 是创建预训练程序所需数据的脚本。其以文本文件（使用换行符换行和空白符分隔，data目录下提供了部分示例数据）为输入，经由BERT tokenizer进行tokenize后再做生成sentence pair正负样本、掩码token等处理，最后输出hdf5格式的数据文件。使用方式如下：
+
+```python
+python create_pretraining_data.py \
+  --input_file=data/sample_text.txt \
+  --output_file=data/training_data.hdf5 \
+  --bert_model=bert-base-uncased \
+  --max_seq_length=128 \
+  --max_predictions_per_seq=20 \
+  --masked_lm_prob=0.15 \
+  --random_seed=12345 \
+  --dupe_factor=5
+```
+
+其中参数释义如下：
+- `input_file` 指定输入文件，可以使用目录，指定目录时将包括目录中的所有`.txt`文件。
+- `output_file` 指定输出文件。
+- `bert_model` 指定使用特定BERT模型对应的tokenizer进行tokenize处理。
+- `max_seq_length` 指定最大句子长度，超过该长度将被截断，不足该长度的将会进行padding。
+- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目。
+- `masked_lm_prob` 表示每个token被mask的概率。
+- `random_seed` 指定随机种子。
+- `dupe_factor` 指定输入数据被重复处理的次数，每次处理将重新产生随机mask。
+
+使用以上预训练数据生成程序可以用于处理领域垂类数据后进行二次预训练。若需要使用BERT论文中预训练使用的英文Wiki和BookCorpus数据，可以参考[这里](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT)进行处理，得到的数据可以直接接入本项目中的预训练程序使用。
+
+#### Fine-tunning数据准备
+
+##### GLUE评测任务数据
+
+GLUE评测任务所含数据集已在paddlenlp中以API形式提供，无需预先准备，使用`run_glue.py`执行微调时将会自动下载。
+
+### 执行Pre-training
+
+```shell
+python -u ./run_pretrain.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --max_predictions_per_seq 20 \
+    --batch_size 32   \
+    --learning_rate 1e-4 \
+    --weight_decay 1e-2 \
+    --adam_epsilon 1e-6 \
+    --warmup_steps 10000 \
+    --num_train_epochs 3 \
+    --input_dir data/ \
+    --output_dir pretrained_models/ \
+    --logging_steps 1 \
+    --save_steps 20000 \
+    --max_steps 1000000 \
+    --n_gpu 1
+```
+
+其中参数释义如下：
+- `model_type` 指示了模型类型，使用BERT模型时设置为bert即可。
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目，与创建预训练数据时的设置一致。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `adam_epsilon` 表示AdamW优化器中使用的epsilon值。
+- `warmup_steps` 表示。
+- `num_train_epochs` 表示训练轮数。
+- `input_dir` 表示输入数据的目录，该目录下所有文件名中包含training的文件将被作为训练数据。
+- `output_dir` 表示模型的保存目录。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `max_steps` 表示最大训练步数。若训练`num_train_epochs`轮包含的训练步数大于该值，则达到`max_steps`后就提前结束。
+- `n_gpu` 表示使用的 GPU 卡数。若希望使用多卡训练，将其设置为指定数目即可；若为0，则使用CPU。
+
+### 执行Fine-tunning
+
+以GLUE中的SST-2任务为例，启动Fine-tuning的方式如下：
+
+```shell
+python -u ./run_glue.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --task_name SST-2 \
+    --max_seq_length 128 \
+    --batch_size 32   \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 500 \
+    --output_dir ./tmp/ \
+    --n_gpu 1 \
+```
+
+其中参数释义如下：
+- `model_type` 指示了模型类型，使用BERT模型时设置为bert即可。
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `task_name` 表示Fine-tuning的任务。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `n_gpu` 表示使用的 GPU 卡数。若希望使用多卡训练，将其设置为指定数目即可；若为0，则使用CPU。
+
+基于`bert-base-uncased`在GLUE各评测任务上Fine-tuning后，在验证集上有如下结果：
+
+| Task  | Metric                       | Result            |
+|:-----:|:----------------------------:|:-----------------:|
+| SST-2 | Accuracy                     |      0.92660      |
+| QNLI  | Accuracy                     |      0.91781      |
+| CoLA  | Mattehew's corr              |      0.59557      |
+| MRPC  | F1/Accuracy                  |  0.91667/0.88235  |
+| STS-B | Person/Spearman corr         |  0.88847/0.88350  |
+| QQP   | Accuracy/F1                  |  0.90581/0.87347  |
+| MNLI  | Matched acc/MisMatched acc   |  0.84422/0.84825  |
+| RTE   | Accuracy                     |      0.711191     |
--- a/PaddleNLP/examples/language_model/bert/create_pretraining_data.py
+++ b/PaddleNLP/examples/language_model/bert/create_pretraining_data.py
+# coding=utf-8
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Create masked LM/next sentence masked_lm TF examples for BERT."""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import argparse
+import logging
+import os
+import random
+from io import open
+import h5py
+import numpy as np
+from tqdm import tqdm
+
+from paddlenlp.transformers import BertTokenizer
+from paddlenlp.transformers.tokenizer_utils import convert_to_unicode
+
+import random
+import collections
+
+
+class TrainingInstance(object):
+    """A single training instance (sentence pair)."""
+    def __init__(self, tokens, segment_ids, masked_lm_positions,
+                 masked_lm_labels, is_random_next):
+        self.tokens = tokens
+        self.segment_ids = segment_ids
+        self.is_random_next = is_random_next
+        self.masked_lm_positions = masked_lm_positions
+        self.masked_lm_labels = masked_lm_labels
+
+
+def write_instance_to_example_file(instances, tokenizer, max_seq_length,
+                                   max_predictions_per_seq, output_file):
+    """Create TF example files from `TrainingInstance`s."""
+
+    total_written = 0
+    features = collections.OrderedDict()
+
+    num_instances = len(instances)
+    features["input_ids"] = np.zeros([num_instances, max_seq_length],
+                                     dtype="int32")
+    features["input_mask"] = np.zeros([num_instances, max_seq_length],
+                                      dtype="int32")
+    features["segment_ids"] = np.zeros([num_instances, max_seq_length],
+                                       dtype="int32")
+    features["masked_lm_positions"] = np.zeros(
+        [num_instances, max_predictions_per_seq], dtype="int32")
+    features["masked_lm_ids"] = np.zeros(
+        [num_instances, max_predictions_per_seq], dtype="int32")
+    features["next_sentence_labels"] = np.zeros(num_instances, dtype="int32")
+
+    for inst_index, instance in enumerate(tqdm(instances)):
+        input_ids = tokenizer.convert_tokens_to_ids(instance.tokens)
+        input_mask = [1] * len(input_ids)
+        segment_ids = list(instance.segment_ids)
+        assert len(input_ids) <= max_seq_length
+
+        while len(input_ids) < max_seq_length:
+            input_ids.append(0)
+            input_mask.append(0)
+            segment_ids.append(0)
+
+        assert len(input_ids) == max_seq_length
+        assert len(input_mask) == max_seq_length
+        assert len(segment_ids) == max_seq_length
+
+        masked_lm_positions = list(instance.masked_lm_positions)
+        masked_lm_ids = tokenizer.convert_tokens_to_ids(
+            instance.masked_lm_labels)
+        masked_lm_weights = [1.0] * len(masked_lm_ids)
+
+        while len(masked_lm_positions) < max_predictions_per_seq:
+            masked_lm_positions.append(0)
+            masked_lm_ids.append(0)
+            masked_lm_weights.append(0.0)
+
+        next_sentence_label = 1 if instance.is_random_next else 0
+
+        features["input_ids"][inst_index] = input_ids
+        features["input_mask"][inst_index] = input_mask
+        features["segment_ids"][inst_index] = segment_ids
+        features["masked_lm_positions"][inst_index] = masked_lm_positions
+        features["masked_lm_ids"][inst_index] = masked_lm_ids
+        features["next_sentence_labels"][inst_index] = next_sentence_label
+
+        total_written += 1
+
+    print("saving data")
+    f = h5py.File(output_file, 'w')
+    f.create_dataset("input_ids",
+                     data=features["input_ids"],
+                     dtype='i4',
+                     compression='gzip')
+    f.create_dataset("input_mask",
+                     data=features["input_mask"],
+                     dtype='i1',
+                     compression='gzip')
+    f.create_dataset("segment_ids",
+                     data=features["segment_ids"],
+                     dtype='i1',
+                     compression='gzip')
+    f.create_dataset("masked_lm_positions",
+                     data=features["masked_lm_positions"],
+                     dtype='i4',
+                     compression='gzip')
+    f.create_dataset("masked_lm_ids",
+                     data=features["masked_lm_ids"],
+                     dtype='i4',
+                     compression='gzip')
+    f.create_dataset("next_sentence_labels",
+                     data=features["next_sentence_labels"],
+                     dtype='i1',
+                     compression='gzip')
+    f.flush()
+    f.close()
+
+
+def create_training_instances(input_files, tokenizer, max_seq_length,
+                              dupe_factor, short_seq_prob, masked_lm_prob,
+                              max_predictions_per_seq, rng):
+    """Create `TrainingInstance`s from raw text."""
+    all_documents = [[]]
+
+    # Input file format:
+    # (1) One sentence per line. These should ideally be actual sentences, not
+    # entire paragraphs or arbitrary spans of text. (Because we use the
+    # sentence boundaries for the "next sentence prediction" task).
+    # (2) Blank lines between documents. Document boundaries are needed so
+    # that the "next sentence prediction" task doesn't span between documents.
+    for input_file in input_files:
+        print("creating instance from {}".format(input_file))
+        with open(input_file, "r") as reader:
+            while True:
+                line = convert_to_unicode(reader.readline())
+                if not line:
+                    break
+                line = line.strip()
+
+                # Empty lines are used as document delimiters
+                if not line:
+                    all_documents.append([])
+                # tokens = tokenizer.tokenize(line)
+                tokens = tokenizer(line)
+                if tokens:
+                    all_documents[-1].append(tokens)
+
+    # Remove empty documents
+    all_documents = [x for x in all_documents if x]
+    rng.shuffle(all_documents)
+
+    # vocab_words = list(tokenizer.vocab.keys())
+    vocab_words = list(tokenizer.vocab.token_to_idx.keys())
+    instances = []
+    for _ in range(dupe_factor):
+        for document_index in range(len(all_documents)):
+            instances.extend(
+                create_instances_from_document(all_documents, document_index,
+                                               max_seq_length, short_seq_prob,
+                                               masked_lm_prob,
+                                               max_predictions_per_seq,
+                                               vocab_words, rng))
+
+    rng.shuffle(instances)
+    return instances
+
+
+def create_instances_from_document(all_documents, document_index,
+                                   max_seq_length, short_seq_prob,
+                                   masked_lm_prob, max_predictions_per_seq,
+                                   vocab_words, rng):
+    """Creates `TrainingInstance`s for a single document."""
+    document = all_documents[document_index]
+
+    # Account for [CLS], [SEP], [SEP]
+    max_num_tokens = max_seq_length - 3
+
+    # We *usually* want to fill up the entire sequence since we are padding
+    # to `max_seq_length` anyways, so short sequences are generally wasted
+    # computation. However, we *sometimes*
+    # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
+    # sequences to minimize the mismatch between pre-training and fine-tuning.
+    # The `target_seq_length` is just a rough target however, whereas
+    # `max_seq_length` is a hard limit.
+    target_seq_length = max_num_tokens
+    if rng.random() < short_seq_prob:
+        target_seq_length = rng.randint(2, max_num_tokens)
+
+    # We DON'T just concatenate all of the tokens from a document into a long
+    # sequence and choose an arbitrary split point because this would make the
+    # next sentence prediction task too easy. Instead, we split the input into
+    # segments "A" and "B" based on the actual "sentences" provided by the user
+    # input.
+    instances = []
+    current_chunk = []
+    current_length = 0
+    i = 0
+    while i < len(document):
+        segment = document[i]
+        current_chunk.append(segment)
+        current_length += len(segment)
+        if i == len(document) - 1 or current_length >= target_seq_length:
+            if current_chunk:
+                # `a_end` is how many segments from `current_chunk` go into the `A`
+                # (first) sentence.
+                a_end = 1
+                if len(current_chunk) >= 2:
+                    a_end = rng.randint(1, len(current_chunk) - 1)
+
+                tokens_a = []
+                for j in range(a_end):
+                    tokens_a.extend(current_chunk[j])
+
+                tokens_b = []
+                # Random next
+                is_random_next = False
+                if len(current_chunk) == 1 or rng.random() < 0.5:
+                    is_random_next = True
+                    target_b_length = target_seq_length - len(tokens_a)
+
+                    # This should rarely go for more than one iteration for large
+                    # corpora. However, just to be careful, we try to make sure that
+                    # the random document is not the same as the document
+                    # we're processing.
+                    for _ in range(10):
+                        random_document_index = rng.randint(
+                            0,
+                            len(all_documents) - 1)
+                        if random_document_index != document_index:
+                            break
+
+                    #If picked random document is the same as the current document
+                    if random_document_index == document_index:
+                        is_random_next = False
+
+                    random_document = all_documents[random_document_index]
+                    random_start = rng.randint(0, len(random_document) - 1)
+                    for j in range(random_start, len(random_document)):
+                        tokens_b.extend(random_document[j])
+                        if len(tokens_b) >= target_b_length:
+                            break
+                    # We didn't actually use these segments so we "put them back" so
+                    # they don't go to waste.
+                    num_unused_segments = len(current_chunk) - a_end
+                    i -= num_unused_segments
+                # Actual next
+                else:
+                    is_random_next = False
+                    for j in range(a_end, len(current_chunk)):
+                        tokens_b.extend(current_chunk[j])
+                truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)
+
+                assert len(tokens_a) >= 1
+                assert len(tokens_b) >= 1
+
+                tokens = []
+                segment_ids = []
+                tokens.append("[CLS]")
+                segment_ids.append(0)
+                for token in tokens_a:
+                    tokens.append(token)
+                    segment_ids.append(0)
+
+                tokens.append("[SEP]")
+                segment_ids.append(0)
+
+                for token in tokens_b:
+                    tokens.append(token)
+                    segment_ids.append(1)
+                tokens.append("[SEP]")
+                segment_ids.append(1)
+
+                (tokens, masked_lm_positions,
+                 masked_lm_labels) = create_masked_lm_predictions(
+                     tokens, masked_lm_prob, max_predictions_per_seq,
+                     vocab_words, rng)
+                instance = TrainingInstance(
+                    tokens=tokens,
+                    segment_ids=segment_ids,
+                    is_random_next=is_random_next,
+                    masked_lm_positions=masked_lm_positions,
+                    masked_lm_labels=masked_lm_labels)
+                instances.append(instance)
+            current_chunk = []
+            current_length = 0
+        i += 1
+
+    return instances
+
+
+MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
+                                          ["index", "label"])
+
+
+def create_masked_lm_predictions(tokens, masked_lm_prob,
+                                 max_predictions_per_seq, vocab_words, rng):
+    """Creates the predictions for the masked LM objective."""
+
+    cand_indexes = []
+    for (i, token) in enumerate(tokens):
+        if token == "[CLS]" or token == "[SEP]":
+            continue
+        cand_indexes.append(i)
+
+    rng.shuffle(cand_indexes)
+
+    output_tokens = list(tokens)
+
+    num_to_predict = min(max_predictions_per_seq,
+                         max(1, int(round(len(tokens) * masked_lm_prob))))
+
+    masked_lms = []
+    covered_indexes = set()
+    for index in cand_indexes:
+        if len(masked_lms) >= num_to_predict:
+            break
+        if index in covered_indexes:
+            continue
+        covered_indexes.add(index)
+
+        masked_token = None
+        # 80% of the time, replace with [MASK]
+        if rng.random() < 0.8:
+            masked_token = "[MASK]"
+        else:
+            # 10% of the time, keep original
+            if rng.random() < 0.5:
+                masked_token = tokens[index]
+            # 10% of the time, replace with random word
+            else:
+                masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
+
+        output_tokens[index] = masked_token
+
+        masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
+
+    masked_lms = sorted(masked_lms, key=lambda x: x.index)
+
+    masked_lm_positions = []
+    masked_lm_labels = []
+    for p in masked_lms:
+        masked_lm_positions.append(p.index)
+        masked_lm_labels.append(p.label)
+
+    return (output_tokens, masked_lm_positions, masked_lm_labels)
+
+
+def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng):
+    """Truncates a pair of sequences to a maximum sequence length."""
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_num_tokens:
+            break
+
+        trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
+        assert len(trunc_tokens) >= 1
+
+        # We want to sometimes truncate from the front and sometimes from the
+        # back to add more randomness and avoid biases.
+        if rng.random() < 0.5:
+            del trunc_tokens[0]
+        else:
+            trunc_tokens.pop()
+
+
+def main():
+
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--input_file",
+        default=None,
+        type=str,
+        required=True,
+        help=
+        "The input train corpus. can be directory with .txt files or a path to a single file"
+    )
+    parser.add_argument(
+        "--output_file",
+        default=None,
+        type=str,
+        required=True,
+        help="The output file where created hdf5 formatted data will be written.")
+    parser.add_argument("--vocab_file",
+                        default=None,
+                        type=str,
+                        required=False,
+                        help="The vocabulary the BERT model will train on. "
+                        "Use bert_model argument would ignore this. "
+                        "The bert_model argument is recommended.")
+    parser.add_argument(
+        "--do_lower_case",
+        action='store_true',
+        default=True,
+        help=
+        "Whether to lower case the input text. True for uncased models, False for cased models. "
+        "Use bert_model argument would ignore this. The bert_model argument is recommended."
+    )
+    parser.add_argument(
+        "--bert_model",
+        default="bert-base-uncased",
+        type=str,
+        required=False,
+        help="Bert pre-trained model selected in the list: bert-base-uncased, "
+        "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese."
+        "If provided, use the pre-trained model used tokenizer to create data "
+        "and ignore vocab_file and do_lower_case.")
+
+    ## Other parameters
+    #int
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help=
+        "The maximum total input sequence length after WordPiece tokenization. \n"
+        "Sequences longer than this will be truncated, and sequences shorter \n"
+        "than this will be padded.")
+    parser.add_argument(
+        "--dupe_factor",
+        default=10,
+        type=int,
+        help=
+        "Number of times to duplicate the input data (with different masks).")
+    parser.add_argument(
+        "--max_predictions_per_seq",
+        default=20,
+        type=int,
+        help="Maximum number of masked LM predictions per sequence.")
+
+    # floats
+    parser.add_argument("--masked_lm_prob",
+                        default=0.15,
+                        type=float,
+                        help="Masked LM probability.")
+    parser.add_argument(
+        "--short_seq_prob",
+        default=0.1,
+        type=float,
+        help=
+        "Probability to create a sequence shorter than maximum sequence length")
+
+    parser.add_argument('--random_seed',
+                        type=int,
+                        default=12345,
+                        help="random seed for initialization")
+
+    args = parser.parse_args()
+    print(args)
+
+    if args.bert_model:
+        tokenizer = BertTokenizer.from_pretrained(args.bert_model)
+    else:
+        assert args.vocab_file, (
+            "vocab_file must be set If bert_model is not provided.")
+        tokenizer = BertTokenizer(args.vocab_file,
+                                  do_lower_case=args.do_lower_case)
+
+    input_files = []
+    if os.path.isfile(args.input_file):
+        input_files.append(args.input_file)
+    elif os.path.isdir(args.input_file):
+        input_files = [
+            os.path.join(args.input_file, f)
+            for f in os.listdir(args.input_file)
+            if (os.path.isfile(os.path.join(args.input_file, f))
+                and f.endswith('.txt'))
+        ]
+    else:
+        raise ValueError("{} is not a valid path".format(args.input_file))
+
+    rng = random.Random(args.random_seed)
+    instances = create_training_instances(input_files, tokenizer,
+                                          args.max_seq_length, args.dupe_factor,
+                                          args.short_seq_prob,
+                                          args.masked_lm_prob,
+                                          args.max_predictions_per_seq, rng)
+
+    output_file = args.output_file
+
+    write_instance_to_example_file(instances, tokenizer, args.max_seq_length,
+                                   args.max_predictions_per_seq, output_file)
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
--- a/PaddleNLP/examples/language_model/bert/data/sample_text.txt
+++ b/PaddleNLP/examples/language_model/bert/data/sample_text.txt
+Zulfiqar A. Bhutta trained as a physician in Pakistan in the early stages of his career.
+He holds titles across various organizations in diverse geographies.
+Professor Bhutta is the Founding Director of the Center of Excellence in Women and Child Health & Institute for Global Child Health & Development, at the Aga Khan University South-Central Asia, East Africa & United Kingdom.
+He is currently the Co-Director at the Centre for Global Child Health, at the Hospital for Sick Children and leads many projects as a Senior Scientist at the Research Institute in the Centre for Global Child Health at Sick Kids.
+He holds a Professorship at the University of Toronto in the Department of Nutritional Sciences and the Division of Epidemiology, Dalla Lana School of Public Health.
+Additionally, he holds concurrent professorship at the Department of Paediatrics, Aga Khan University in Karachi, Pakistan and at the Schools of Public Health of Johns Hopkins University, Tufts University, Boston University, University of Alberta and the London School of Hygiene & Tropical Medicine.
+He is a designated Distinguished National Professor of the Government of Pakistan and was the Founding Chair of the National Research Ethics Committee of the Government of Pakistan from 2003-2014.
+Dr. Bhutta received his MBBS from Khyber Medical College in Peshawar, Pakistan in 1977 at which time he was names "Best Graduate of the Year" and awarded the University Gold Medal for overall distinction.
+His PhD work was completed at Karolinska Institute in Stockholm, Sweden in 1996.
+He is a Fellow of the Royal College of Physicians (Edinburgh & London), the Royal College of Paediatrics and Child Health (London), American Academy of Paediatrics and the Pakistan Academy of Sciences.
+Following the completion of his PhD Dr. Bhutta began working as House Surgeon in Obstetrics & Gynecology at the Khyber Teaching Hospital, Peshawar (April-November 1978).
+He began work in paediatrics as a physician in November of 1978 in the Professorial Unit at the Institute of Child Health, Jinnah Postgraduate Medical Centre, Karachi (Pakistan).
+Through 1980's he continued his work as a surgeon and paediatrician.
+He undertook his first professor position in the Department of Paediatrics, The Aga Khan University Hospital, Karachi (Pakistan), from November 1987 to June 1992.
+In 2005, Dr. Bhutta became the Chairman of the Department of Paediatrics & Child Health at the Aga Khan University & Medical Center, a position held until 2008.
+Following his term as Chairman he became The Noordin Noormahomed Sheriff Professor & Founding Chair, Division of Women & Child Health, The Aga Khan University, a position he held for four years.
+Dr. Bhutta currently holds the titles of co-director of the Centre for Global Child Health at the Hospital for Sick Children in Toronto, and founding director of the Centre of Excellence in Women and Child Health at the Aga Khan University.
+In 2020, he was appointed founding director of the Institute for Global child Health & Development at the Aga Khan University and elected Fellow to the Royal Society, United Kingdom.
+Outside of his professional responsibilities Dr. Bhutta serves on various local and international boards and committees, including a series of editorial boards.
+In his various capacities Dr. Bhutta has produced a large collection of publications working with his teams at Sick Kids, AKU and international partners.
+These include book reviews, chapters, 1.
+"Haematological disorders" "Neonatal Jaundice" in Neonatal Vade‑Mecum, Fleming PJ, Speidel BD, Dunn PM Eds, Lloyd‑Luke Publishers, UK, 1986.
+Revised 2nd Edition 1991.
+2.
+"Nutritional management of acute and persistent diarrhoea".
+A M Molla, Bhutta Z A and  A Molla.
+In McNeish A S, Mittal S K and Walker-Smith J A (eds).
+Recent trends in diarrhoea and malnutrition, MAMC, Delhi, 1991, pp 37-51.
+3.
+"Paediatric Prescribing” in "Text book of Paediatrics for developing countries"            Arif MA, Hanif SM, Wasti SMK Eds, 1989, 2nd Edition 1996,  PPA, Karachi.
+& Lahore 4.
+"Innovations in neonatal care : Impact on neonatal survival in the developing world:.
+Bhutta Z A  Zaidi S (Editor) 1992.
+TWEL Publisher.
+Karachi pp 121-131 5.
+"Short course therapy in Pediatrics" Bhutta Z A& Teele D.  In Tice A D, Waldvogel F (Eds), Contemporary issues in Infectious Disease Epidemiology and Management, 1993 Gardiner Caldwell, Cheshire, pp 52 - 60.
+6.
+"Dietary management of persistent diarrhoea".
+Bhutta Z A, Molla A M, Issani Z.
+In Reflections on  Diarrhoeal Disease & Nutrition  of Children".
+1993 Karachi, pp 97 - 103.
+7.
+"Prescribing practices amongst general practitioners (GPs) and consultant paediatricians in childhood diarrhoea.”  S.Q.
+Nizami, I.A.
+Khan, Bhutta Z A.
+In "Reflections on Diarrhoeal Disease and Nutrition of Children".
+1993 Karachi, pp  88-90.
+8.
+"The challenge of multidrug-resistant typhoid".
+Bhutta Z A.
+In Puri R K, Sachdev H P S, Choudhry P, Verma I C (Eds), Current concepts in Paediatrics, 1994.
+Jaypee Publishers, New Delhi, pp 403.8.
+9.
+"Perinatal Care in Pakistan: Current status and trends".
+In Proceedings of the Workshop in Reproductive Health.
+College of Physicians and Surgeons, Pakistan, Karachi, 1995, pp 95-103.
+10.
+“A study of whole body protein kinetics in malnourished children with persistent diarrhoea” Bhutta Z A, Nizami SQ, Isani Z, Hardy S, Hendricks K, Young V.   Report of the second RCM coordinated Research Programme for application of stable isotope tracer methods to studies of energy metabolism in malnourished populations of developing countries.
+NAHRES-30 1996 IAEA Vienna.
+11.
+"Pneumococcal infections in Pakistan: a country report".
+In Adult Immunization in Asia, Fondation Mercel Merieux, Lyon, 1998. pp 79-82.
+12.
+“Factors affecting protein and aminoacid metabolism in childhood from developing countries".
+In Child Nutrition: an international perspective.
+Editors Solomons NW, Caballero B, Brown KH.
+CRC Press 1998.
+13.
+"Protein Digestion and Bioavailability".
+In Encyclopedia of Human Nutrition.
+Editors: Sadler M, Strain JJ, Caballero B.
+Academic Press (London), 1998 pp.1646-54.
+14.
+"Perinatal Care in Pakistan.
+Reproductive Health: A manual for family practice and primary health care.
+Bhutta Z A, Maqbool S.  College of Physicians and Surgeons, Pakistan, Karachi, 1999, pp 69-78.
+15.
+“Effective interventions to reduce neonatal mortality and morbidity from perinatal infection.
+Bhutta ZA.
+In Costello A, Manandhar D (eds).
+"Improving Newborn Infant Health in Developing Countries’ 1999.
+Imperial College Press, London pp.289-308.
+16.
+“Ambulatory management of typhoid fever”            “Risk factors and management of micronutrient deficiencies”            “Management of persistent diarrhoea in developing countries”.
+In Manual of International Child Health, British Medical Journal, 2000 (in press).
+17.
+“The role of Cefixime in typhoid fever during childhood” in Cefixime, Adam D, Quintiliani R (Eds), Torre-Lazur-McCann, Tokyo, 2000; pp.107-112.
+18.
+"Micronutrients and Child Health in the Commonwealth”, Commonwealth Foundation" (UK) (2001).
+19.
+"Isotopic evaluation of breast milk intake, energy metabolism growth and body composition of exclusively breastfed infants in Pakistan".
+Bhutta ZA, Nizami SQ, Weaver LT, Preston T. In Application of Stable Isotopes to evaluate Growth and Body Composition of Exclusively Breastfed infants, IAEA and WHO, NAHRES Report.
+2000.
+20.
+“Typhoid Fever in Childhood: the south Asian experience”.
+Ahmad K &Bhutta ZA.
+In "Recent Advances in Paediatrics", Gupte S (Ed), 2000, India .
+21.
+“Neonatal Infections in developing countries” in  Carrera JM, Cabero L, Baraibar R (Eds).
+The Perinatal Medicine of the new Millennium.
\ No newline at end of file
--- a/PaddleNLP/examples/language_model/bert/run_glue.py
+++ b/PaddleNLP/examples/language_model/bert/run_glue.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import os
+import sys
+import random
+import time
+import math
+from functools import partial
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+from paddle.metric import Metric, Accuracy, Precision, Recall
+
+from paddlenlp.datasets import GlueCoLA, GlueSST2, GlueMRPC, GlueSTSB, GlueQQP, GlueMNLI, GlueQNLI, GlueRTE
+from paddlenlp.data import Stack, Tuple, Pad
+from paddlenlp.transformers import BertForSequenceClassification, BertTokenizer
+from paddlenlp.transformers import ElectraForSequenceClassification, ElectraTokenizer
+from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+
+FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+TASK_CLASSES = {
+    "cola": (GlueCoLA, Mcc),
+    "sst-2": (GlueSST2, Accuracy),
+    "mrpc": (GlueMRPC, AccuracyAndF1),
+    "sts-b": (GlueSTSB, PearsonAndSpearman),
+    "qqp": (GlueQQP, AccuracyAndF1),
+    "mnli": (GlueMNLI, Accuracy),
+    "qnli": (GlueQNLI, Accuracy),
+    "rte": (GlueRTE, Accuracy),
+}
+
+MODEL_CLASSES = {"bert": (BertForSequenceClassification, BertTokenizer)}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " +
+        ", ".join(TASK_CLASSES.keys()), )
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " +
+        ", ".join(MODEL_CLASSES.keys()), )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([
+                list(classes[-1].pretrained_init_configuration.keys())
+                for classes in MODEL_CLASSES.values()
+            ], [])), )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.", )
+    parser.add_argument(
+        "--learning_rate",
+        default=1e-4,
+        type=float,
+        help="The initial learning rate for Adam.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.", )
+    parser.add_argument(
+        "--logging_steps",
+        type=int,
+        default=100,
+        help="Log every X updates steps.")
+    parser.add_argument(
+        "--save_steps",
+        type=int,
+        default=100,
+        help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size per GPU/CPU for training.", )
+    parser.add_argument(
+        "--weight_decay",
+        default=0.0,
+        type=float,
+        help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion"
+    )
+    parser.add_argument(
+        "--warmup_proportion",
+        default=0.,
+        type=float,
+        help="Linear warmup proportion over total steps.")
+    parser.add_argument(
+        "--adam_epsilon",
+        default=1e-6,
+        type=float,
+        help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument(
+        "--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--n_gpu",
+        default=1,
+        type=int,
+        help="number of gpus to use, 0 for cpu.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    random.seed(args.seed + paddle.distributed.get_rank())
+    np.random.seed(args.seed + paddle.distributed.get_rank())
+    paddle.seed(args.seed + paddle.distributed.get_rank())
+
+
+def evaluate(model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids)
+        loss = loss_fct(logits, labels)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    if isinstance(metric, AccuracyAndF1):
+        logger.info(
+            "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s."
+            % (loss.numpy(), res[0], res[1], res[2], res[3], res[4]))
+    elif isinstance(metric, Mcc):
+        logger.info("eval loss: %f, mcc: %s." % (loss.numpy(), res[0]))
+    elif isinstance(metric, PearsonAndSpearman):
+        logger.info(
+            "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s."
+            % (loss.numpy(), res[0], res[1], res[2]))
+    else:
+        logger.info("eval loss: %f, acc: %s." % (loss.numpy(), res))
+    model.train()
+
+
+def convert_example(example,
+                    tokenizer,
+                    label_list,
+                    max_seq_length=512,
+                    is_test=False):
+    """convert a glue example into necessary features"""
+
+    def _truncate_seqs(seqs, max_seq_length):
+        if len(seqs) == 1:  # single sentence
+            # Account for [CLS] and [SEP] with "- 2"
+            seqs[0] = seqs[0][0:(max_seq_length - 2)]
+        else:  # Sentence pair
+            # Account for [CLS], [SEP], [SEP] with "- 3"
+            tokens_a, tokens_b = seqs
+            max_seq_length -= 3
+            while True:  # Truncate with longest_first strategy
+                total_length = len(tokens_a) + len(tokens_b)
+                if total_length <= max_seq_length:
+                    break
+                if len(tokens_a) > len(tokens_b):
+                    tokens_a.pop()
+                else:
+                    tokens_b.pop()
+        return seqs
+
+    def _concat_seqs(seqs, separators, seq_mask=0, separator_mask=1):
+        concat = sum((seq + sep for sep, seq in zip(separators, seqs)), [])
+        segment_ids = sum(
+            ([i] * (len(seq) + len(sep))
+             for i, (sep, seq) in enumerate(zip(separators, seqs))), [])
+        if isinstance(seq_mask, int):
+            seq_mask = [[seq_mask] * len(seq) for seq in seqs]
+        if isinstance(separator_mask, int):
+            separator_mask = [[separator_mask] * len(sep) for sep in separators]
+        p_mask = sum((s_mask + mask
+                      for sep, seq, s_mask, mask in zip(
+                          separators, seqs, seq_mask, separator_mask)), [])
+        return concat, segment_ids, p_mask
+
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example[-1]
+        example = example[:-1]
+        # Create label maps if classification task
+        if label_list:
+            label_map = {}
+            for (i, l) in enumerate(label_list):
+                label_map[l] = i
+            label = label_map[label]
+        label = np.array([label], dtype=label_dtype)
+
+    # Tokenize raw text
+    tokens_raw = [tokenizer(l) for l in example]
+    # Truncate to the truncate_length,
+    tokens_trun = _truncate_seqs(tokens_raw, max_seq_length)
+    # Concate the sequences with special tokens
+    tokens_trun[0] = [tokenizer.cls_token] + tokens_trun[0]
+    tokens, segment_ids, _ = _concat_seqs(tokens_trun, [[tokenizer.sep_token]] *
+                                          len(tokens_trun))
+    # Convert the token to ids
+    input_ids = tokenizer.convert_tokens_to_ids(tokens)
+    valid_length = len(input_ids)
+    # The mask has 1 for real tokens and 0 for padding tokens. Only real
+    # tokens are attended to.
+    # input_mask = [1] * len(input_ids)
+    if not is_test:
+        return input_ids, segment_ids, valid_length, label
+    else:
+        return input_ids, segment_ids, valid_length
+
+
+def do_train(args):
+    paddle.set_device("gpu" if args.n_gpu else "cpu")
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    dataset_class, metric_class = TASK_CLASSES[args.task_name]
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    train_dataset = dataset_class.get_datasets(["train"])
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        label_list=train_dataset.get_labels(),
+        max_seq_length=args.max_seq_length)
+    train_dataset = train_dataset.apply(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(
+        train_dataset, batch_size=args.batch_size, shuffle=True)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+        Stack(),  # length
+        Stack(dtype="int64" if train_dataset.get_labels() else "float32")  # label
+    ): [data for i, data in enumerate(fn(samples)) if i != 2]
+    train_data_loader = DataLoader(
+        dataset=train_dataset,
+        batch_sampler=train_batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=0,
+        return_list=True)
+    if args.task_name == "mnli":
+        dev_dataset_matched, dev_dataset_mismatched = dataset_class.get_datasets(
+            ["dev_matched", "dev_mismatched"])
+        dev_dataset_matched = dev_dataset_matched.apply(trans_func, lazy=True)
+        dev_dataset_mismatched = dev_dataset_mismatched.apply(
+            trans_func, lazy=True)
+        dev_batch_sampler_matched = paddle.io.BatchSampler(
+            dev_dataset_matched, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader_matched = DataLoader(
+            dataset=dev_dataset_matched,
+            batch_sampler=dev_batch_sampler_matched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True)
+        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
+            dev_dataset_mismatched, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader_mismatched = DataLoader(
+            dataset=dev_dataset_mismatched,
+            batch_sampler=dev_batch_sampler_mismatched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True)
+    else:
+        dev_dataset = dataset_class.get_datasets(["dev"])
+        dev_dataset = dev_dataset.apply(trans_func, lazy=True)
+        dev_batch_sampler = paddle.io.BatchSampler(
+            dev_dataset, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader = DataLoader(
+            dataset=dev_dataset,
+            batch_sampler=dev_batch_sampler,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True)
+
+    num_classes = 1 if train_dataset.get_labels() == None else len(
+        train_dataset.get_labels())
+    model = model_class.from_pretrained(
+        args.model_name_or_path, num_classes=num_classes)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else (
+        len(train_data_loader) * args.num_train_epochs)
+    warmup_steps = args.warmup_steps if args.warmup_steps > 0 else (
+        int(math.floor(num_training_steps * args.warmup_proportion)))
+    lr_scheduler = paddle.optimizer.lr.LambdaDecay(
+        args.learning_rate,
+        lambda current_step, num_warmup_steps=warmup_steps,
+        num_training_steps=num_training_steps : float(
+            current_step) / float(max(1, num_warmup_steps))
+        if current_step < num_warmup_steps else max(
+            0.0,
+            float(num_training_steps - current_step) / float(
+                max(1, num_training_steps - num_warmup_steps))))
+
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in [
+            p.name for n, p in model.named_parameters()
+            if not any(nd in n for nd in ["bias", "norm"])
+        ])
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_dataset.get_labels(
+    ) else paddle.nn.loss.MSELoss()
+
+    metric = metric_class()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            input_ids, segment_ids, labels = batch
+            logits = model(input_ids, segment_ids)
+            loss = loss_fct(logits, labels)
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_gradients()
+            if global_step % args.logging_steps == 0:
+                logger.info(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (global_step, num_training_steps, epoch, step,
+                       paddle.distributed.get_rank(), loss, optimizer.get_lr(),
+                       args.logging_steps / (time.time() - tic_train)))
+                tic_train = time.time()
+            if global_step % args.save_steps == 0:
+                tic_eval = time.time()
+                if args.task_name == "mnli":
+                    evaluate(model, loss_fct, metric, dev_data_loader_matched)
+                    evaluate(model, loss_fct, metric,
+                             dev_data_loader_mismatched)
+                    logger.info("eval done total : %s s" %
+                                (time.time() - tic_eval))
+                else:
+                    evaluate(model, loss_fct, metric, dev_data_loader)
+                    logger.info("eval done total : %s s" %
+                                (time.time() - tic_eval))
+                if (not args.n_gpu > 1) or paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(
+                        args.output_dir, "%s_ft_model_%d.pdparams" %
+                        (args.task_name, global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # Need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(
+                        model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print('-----------  Configuration Arguments -----------')
+    for arg, value in sorted(vars(args).items()):
+        print('%s: %s' % (arg, value))
+    print('------------------------------------------------')
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    if args.n_gpu > 1:
+        paddle.distributed.spawn(do_train, args=(args, ), nprocs=args.n_gpu)
+    else:
+        do_train(args)
--- a/PaddleNLP/examples/language_model/bert/run_pretrain.py
+++ b/PaddleNLP/examples/language_model/bert/run_pretrain.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import collections
+import itertools
+import logging
+import os
+import random
+import time
+import h5py
+from functools import partial
+from concurrent.futures import ThreadPoolExecutor
+
+import numpy as np
+
+import paddle
+import paddle.distributed as dist
+from paddle.io import DataLoader, Dataset
+
+from paddlenlp.data import Stack, Tuple, Pad
+from paddlenlp.transformers import BertForPretraining, BertModel, BertPretrainingCriterion
+from paddlenlp.transformers import BertTokenizer
+
+FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+MODEL_CLASSES = {
+    "bert": (BertForPretraining, BertTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " +
+        ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([
+                list(classes[-1].pretrained_init_configuration.keys())
+                for classes in MODEL_CLASSES.values()
+            ], [])),
+    )
+    parser.add_argument(
+        "--input_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input directory where the data will be read from.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help=
+        "The output directory where the model predictions and checkpoints will be written.",
+    )
+
+    parser.add_argument(
+        "--max_predictions_per_seq",
+        default=80,
+        type=int,
+        help="The maximum total of masked tokens in input sequence")
+
+    parser.add_argument(
+        "--batch_size",
+        default=8,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--learning_rate",
+                        default=5e-5,
+                        type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay",
+                        default=0.0,
+                        type=float,
+                        help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon",
+                        default=1e-8,
+                        type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm",
+                        default=1.0,
+                        type=float,
+                        help="Max gradient norm.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help=
+        "If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--warmup_steps",
+                        default=0,
+                        type=int,
+                        help="Linear warmup over warmup_steps.")
+
+    parser.add_argument("--logging_steps",
+                        type=int,
+                        default=500,
+                        help="Log every X updates steps.")
+    parser.add_argument("--save_steps",
+                        type=int,
+                        default=500,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed",
+                        type=int,
+                        default=42,
+                        help="random seed for initialization")
+    parser.add_argument("--n_gpu",
+                        type=int,
+                        default=1,
+                        help="number of gpus to use, 0 for cpu.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    random.seed(args.seed + paddle.distributed.get_rank())
+    np.random.seed(args.seed + paddle.distributed.get_rank())
+    paddle.seed(args.seed + paddle.distributed.get_rank())
+
+
+class WorkerInitObj(object):
+    def __init__(self, seed):
+        self.seed = seed
+
+    def __call__(self, id):
+        np.random.seed(seed=self.seed + id)
+        random.seed(self.seed + id)
+
+
+def create_pretraining_dataset(input_file, max_pred_length, shared_list, args,
+                               worker_init):
+    train_data = PretrainingDataset(input_file=input_file,
+                                    max_pred_length=max_pred_length)
+    # files have been sharded, no need to dispatch again
+    train_batch_sampler = paddle.io.BatchSampler(train_data,
+                                                 batch_size=args.batch_size,
+                                                 shuffle=True)
+
+    # DataLoader cannot be pickled because of its place.
+    # If it can be pickled, use global function instead of lambda and use
+    # ProcessPoolExecutor instead of ThreadPoolExecutor to prefetch.
+    def _collate_data(data, stack_fn=Stack()):
+        num_fields = len(data[0])
+        out = [None] * num_fields
+        # input_ids, segment_ids, input_mask, masked_lm_positions,
+        # masked_lm_labels, next_sentence_labels, mask_token_num
+        for i in (0, 1, 2, 5):
+            out[i] = stack_fn([x[i] for x in data])
+        batch_size, seq_length = out[0].shape
+        size = num_mask = sum(len(x[3]) for x in data)
+        # Padding for divisibility by 8 for fp16 or int8 usage
+        if size % 8 != 0:
+            size += 8 - (size % 8)
+        # masked_lm_positions
+        # Organize as a 1D tensor for gather or use gather_nd
+        out[3] = np.full(size, 0, dtype=np.int64)
+        # masked_lm_labels
+        out[4] = np.full([size, 1], -1, dtype=np.int64)
+        mask_token_num = 0
+        for i, x in enumerate(data):
+            for j, pos in enumerate(x[3]):
+                out[3][mask_token_num] = i * seq_length + pos
+                out[4][mask_token_num] = x[4][j]
+                mask_token_num += 1
+        # mask_token_num
+        out.append(np.asarray([mask_token_num], dtype=np.float32))
+        return out
+
+    train_data_loader = DataLoader(dataset=train_data,
+                                   batch_sampler=train_batch_sampler,
+                                   collate_fn=_collate_data,
+                                   num_workers=0,
+                                   worker_init_fn=worker_init,
+                                   return_list=True)
+    return train_data_loader, input_file
+
+
+class PretrainingDataset(Dataset):
+    def __init__(self, input_file, max_pred_length):
+        self.input_file = input_file
+        self.max_pred_length = max_pred_length
+        f = h5py.File(input_file, "r")
+        keys = [
+            'input_ids', 'input_mask', 'segment_ids', 'masked_lm_positions',
+            'masked_lm_ids', 'next_sentence_labels'
+        ]
+        self.inputs = [np.asarray(f[key][:]) for key in keys]
+        f.close()
+
+    def __len__(self):
+        'Denotes the total number of samples'
+        return len(self.inputs[0])
+
+    def __getitem__(self, index):
+
+        [
+            input_ids, input_mask, segment_ids, masked_lm_positions,
+            masked_lm_ids, next_sentence_labels
+        ] = [
+            input[index].astype(np.int64)
+            if indice < 5 else np.asarray(input[index].astype(np.int64))
+            for indice, input in enumerate(self.inputs)
+        ]
+        # TODO: whether to use reversed mask by changing 1s and 0s to be
+        # consistent with nv bert
+        input_mask = (1 - np.reshape(input_mask.astype(np.float32),
+                                     [1, 1, input_mask.shape[0]])) * -1e9
+
+        index = self.max_pred_length
+        # store number of  masked tokens in index
+        # outputs of torch.nonzero diff with that of numpy.nonzero by zip
+        padded_mask_indices = (masked_lm_positions == 0).nonzero()[0]
+        if len(padded_mask_indices) != 0:
+            index = padded_mask_indices[0].item()
+            mask_token_num = index
+        else:
+            index = 0
+            mask_token_num = 0
+        # masked_lm_labels = np.full(input_ids.shape, -1, dtype=np.int64)
+        # masked_lm_labels[masked_lm_positions[:index]] = masked_lm_ids[:index]
+        masked_lm_labels = masked_lm_ids[:index]
+        masked_lm_positions = masked_lm_positions[:index]
+        # softmax_with_cross_entropy enforce last dim size equal 1
+        masked_lm_labels = np.expand_dims(masked_lm_labels, axis=-1)
+        next_sentence_labels = np.expand_dims(next_sentence_labels, axis=-1)
+
+        return [
+            input_ids, segment_ids, input_mask, masked_lm_positions,
+            masked_lm_labels, next_sentence_labels
+        ]
+
+
+def do_train(args):
+    paddle.set_device("gpu" if args.n_gpu else "cpu")
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+    worker_init = WorkerInitObj(args.seed + paddle.distributed.get_rank())
+
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    model = BertForPretraining(
+        BertModel(**model_class.pretrained_init_configuration[
+            args.model_name_or_path]))
+    criterion = BertPretrainingCriterion(
+        getattr(model,
+                BertForPretraining.base_model_prefix).config["vocab_size"])
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    # If use defalut last_epoch, lr of the first iteration is 0.
+    # Use `last_epoch = 0` to be consistent with nv bert.
+    lr_scheduler = paddle.optimizer.lr.LambdaDecay(
+        args.learning_rate,
+        lambda current_step, num_warmup_steps=args.warmup_steps,
+        num_training_steps=args.max_steps if args.max_steps > 0 else
+        (len(train_data_loader) * args.num_train_epochs): float(
+            current_step) / float(max(1, num_warmup_steps))
+        if current_step < num_warmup_steps else max(
+            0.0,
+            float(num_training_steps - current_step) / float(
+                max(1, num_training_steps - num_warmup_steps))),
+        last_epoch=0)
+
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in [
+            p.name for n, p in model.named_parameters()
+            if not any(nd in n for nd in ["bias", "norm"])
+        ])
+
+    pool = ThreadPoolExecutor(1)
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.num_train_epochs):
+        files = [
+            os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir)
+            if os.path.isfile(os.path.join(args.input_dir, f))
+            and "training" in f
+        ]
+        files.sort()
+        num_files = len(files)
+        random.Random(args.seed + epoch).shuffle(files)
+        f_start_id = 0
+
+        shared_file_list = {}
+
+        if paddle.distributed.get_world_size() > num_files:
+            remainder = paddle.distributed.get_world_size() % num_files
+            data_file = files[
+                (f_start_id * paddle.distributed.get_world_size() +
+                 paddle.distributed.get_rank() + remainder * f_start_id) %
+                num_files]
+        else:
+            data_file = files[(f_start_id * paddle.distributed.get_world_size()
+                               + paddle.distributed.get_rank()) % num_files]
+
+        previous_file = data_file
+
+        train_data_loader, _ = create_pretraining_dataset(
+            data_file, args.max_predictions_per_seq, shared_file_list, args,
+            worker_init) 
+
+        # TODO(guosheng): better way to process single file
+        if f_start_id + 1 == len(files): single_file = True
+            
+        for f_id in range(f_start_id, len(files)):
+            if not single_file:
+                continue
+            if paddle.distributed.get_world_size() > num_files:
+                data_file = files[(f_id * paddle.distributed.get_world_size() +
+                                   paddle.distributed.get_rank() +
+                                   remainder * f_id) % num_files]
+            else:
+                data_file = files[(f_id * paddle.distributed.get_world_size() +
+                                   paddle.distributed.get_rank()) % num_files]
+
+            previous_file = data_file
+            dataset_future = pool.submit(create_pretraining_dataset, data_file,
+                                         args.max_predictions_per_seq,
+                                         shared_file_list, args, worker_init)
+            for step, batch in enumerate(train_data_loader):
+                global_step += 1
+                (input_ids, segment_ids, input_mask, masked_lm_positions,
+                 masked_lm_labels, next_sentence_labels,
+                 masked_lm_scale) = batch
+                prediction_scores, seq_relationship_score = model(
+                    input_ids=input_ids,
+                    token_type_ids=segment_ids,
+                    attention_mask=input_mask,
+                    masked_positions=masked_lm_positions)
+                loss = criterion(prediction_scores, seq_relationship_score,
+                                 masked_lm_labels, next_sentence_labels,
+                                 masked_lm_scale)
+                if global_step % args.logging_steps == 0:
+                    if (not args.n_gpu > 1
+                        ) or paddle.distributed.get_rank() == 0:
+                        logger.info(
+                            "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                            % (global_step, epoch, step, loss,
+                               args.logging_steps / (time.time() - tic_train)))
+                    tic_train = time.time()
+                loss.backward()
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_gradients()
+                if global_step % args.save_steps == 0:
+                    if (not args.n_gpu > 1
+                        ) or paddle.distributed.get_rank() == 0:
+                        output_dir = os.path.join(args.output_dir,
+                                                  "model_%d" % global_step)
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        # need better way to get inner model of DataParallel
+                        model_to_save = model._layers if isinstance(
+                            model, paddle.DataParallel) else model
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        paddle.save(
+                            optimizer.state_dict(),
+                            os.path.join(output_dir, "model_state.pdopt"))
+                if global_step >= args.max_steps:
+                    del train_data_loader
+                    return
+
+            del train_data_loader
+            train_data_loader, data_file = dataset_future.result(timeout=None)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    if args.n_gpu > 1:
+        paddle.distributed.spawn(do_train, args=(args, ), nprocs=args.n_gpu)
+    else:
+        do_train(args)