未验证 提交 3ce83f24 编写于 作者: S Steffy-zxf 提交者: GitHub

Refine text classification and matching examples (#5251)

* update docs

* update codes

* update docs

* update codes

* update codes

* update codes

* add method tokenizer.get_special_tokens_mask()
上级 60c50b4a
...@@ -52,7 +52,7 @@ ...@@ -52,7 +52,7 @@
- paddlepaddle >= 2.0.0-rc1 - paddlepaddle >= 2.0.0-rc1
``` ```
pip install paddlenlp==2.0.0b pip install paddlenlp>=2.0.0rc
``` ```
### 代码结构说明 ### 代码结构说明
...@@ -73,17 +73,12 @@ pretrained_models/ ...@@ -73,17 +73,12 @@ pretrained_models/
```shell ```shell
# 设置使用的GPU卡号 # 设置使用的GPU卡号
CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0
python train.py --model_type ernie --model_name ernie-tiny --n_gpu 1 --save_dir ./checkpoints python train.py --n_gpu 1 --save_dir ./checkpoints
``` ```
可支持配置的参数: 可支持配置的参数:
* `model_type`:必选,模型类型,可以选择bert,ernie,roberta。 * `save_dir`:可选,保存训练模型的目录;默认保存在当前目录checkpoints文件夹下。
* `model_name`: 必选,具体的模型简称。
`model_type=ernie`,则model_name可以选择`ernie-1.0``ernie-tiny`
`model_type=bert`,则model_name可以选择`bert-base-chinese``bert-wwm-chinese``bert-wwm-ext-chinese`
`model_type=roberta`,则model_name可以选择`roberta-wwm-ext-large``roberta-wwm-ext``rbt3``rbtl3`
* `save_dir`:必选,保存训练模型的目录。
* `max_seq_length`:可选,ERNIE/BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 * `max_seq_length`:可选,ERNIE/BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。
* `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 * `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。
* `learning_rate`:可选,Fine-tune的最大学习率;默认为5e-5。 * `learning_rate`:可选,Fine-tune的最大学习率;默认为5e-5。
...@@ -94,6 +89,45 @@ python train.py --model_type ernie --model_name ernie-tiny --n_gpu 1 --save_dir ...@@ -94,6 +89,45 @@ python train.py --model_type ernie --model_name ernie-tiny --n_gpu 1 --save_dir
* `seed`:可选,随机种子,默认为1000. * `seed`:可选,随机种子,默认为1000.
* `n_gpu`:可选,训练过程中使用GPU卡数量,默认为1。若n_gpu=0,则使用CPU训练。 * `n_gpu`:可选,训练过程中使用GPU卡数量,默认为1。若n_gpu=0,则使用CPU训练。
代码示例中使用的预训练模型是ERNIE,如果想要使用其他预训练模型如BERT,RoBERTa,Electra等,只需更换`model``tokenizer`即可。
```python
# 使用ernie预训练模型
# ernie
model = ppnlp.transformers.ErnieForSequenceClassification.from_pretrained('ernie',num_classes=2))
tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie')
# ernie-tiny
# model = ppnlp.transformers.ErnieForSequenceClassification.rom_pretrained('ernie-tiny',num_classes=2))
# tokenizer = ppnlp.transformers.ErnieTinyTokenizer.from_pretrained('ernie-tiny')
# 使用bert预训练模型
# bert-base-chinese
model = ppnlp.transformers.BertForSequenceClassification.from_pretrained('bert-base-chinese', num_class=2)
tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained('bert-base-chinese')
# bert-wwm-chinese
# model = ppnlp.transformers.BertForSequenceClassification.from_pretrained('bert-wwm-chinese', num_class=2)
# tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained('bert-wwm-chinese')
# bert-wwm-ext-chinese
# model = ppnlp.transformers.BertForSequenceClassification.from_pretrained('bert-wwm-ext-chinese', num_class=2)
# tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained('bert-wwm-ext-chinese')
# 使用roberta预训练模型
# roberta-wwm-ext
# model = ppnlp.transformers.RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext', num_class=2)
# tokenizer = ppnlp.transformers.RobertaTokenizer.from_pretrained('roberta-wwm-ext')
# roberta-wwm-ext
# model = ppnlp.transformers.RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext-large', num_class=2)
# tokenizer = ppnlp.transformers.RobertaTokenizer.from_pretrained('roberta-wwm-ext-large')
```
更多预训练模型,参考[transformers](../../../docs/transformers.md)
程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存模型在指定的`save_dir`中。 程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存模型在指定的`save_dir`中。
如: 如:
...@@ -114,7 +148,7 @@ checkpoints/ ...@@ -114,7 +148,7 @@ checkpoints/
运行方式: 运行方式:
```shell ```shell
python export_model.py --model_type=roberta --model_name=roberta-wwm-ext --params_path=./checkpoint/model_200/model_state.pdparams --output_path=./static_graph_params python export_model.py --params_path=./checkpoint/model_900/model_state.pdparams --output_path=./static_graph_params
``` ```
其中`params_path`是指动态图训练保存的参数路径,`output_path`是指静态图参数导出路径。 其中`params_path`是指动态图训练保存的参数路径,`output_path`是指静态图参数导出路径。
...@@ -123,7 +157,7 @@ python export_model.py --model_type=roberta --model_name=roberta-wwm-ext --param ...@@ -123,7 +157,7 @@ python export_model.py --model_type=roberta --model_name=roberta-wwm-ext --param
启动预测: 启动预测:
```shell ```shell
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
python predict.py --model_type ernie --model_name ernie-tiny --params_path checkpoints/model_400/model_state.pdparams python predict.py --params_path checkpoints/model_900/model_state.pdparams
``` ```
将待预测数据如以下示例: 将待预测数据如以下示例:
......
...@@ -25,46 +25,23 @@ import paddle.nn.functional as F ...@@ -25,46 +25,23 @@ import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad from paddlenlp.data import Stack, Tuple, Pad
import paddlenlp as ppnlp import paddlenlp as ppnlp
MODEL_CLASSES = {
"bert": (ppnlp.transformers.BertForSequenceClassification,
ppnlp.transformers.BertTokenizer),
'ernie': (ppnlp.transformers.ErnieForSequenceClassification,
ppnlp.transformers.ErnieTokenizer),
'roberta': (ppnlp.transformers.RobertaForSequenceClassification,
ppnlp.transformers.RobertaTokenizer),
}
# yapf: disable # yapf: disable
def parse_args(): parser = argparse.ArgumentParser()
parser = argparse.ArgumentParser() parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
# Required parameters parser.add_argument("--output_path", type=str, default='./static_graph_params', help="The path of model parameter in static graph to be saved.")
parser.add_argument("--model_type", default='roberta', required=True, type=str, help="Model type selected in the list: " +", ".join(MODEL_CLASSES.keys())) args = parser.parse_args()
parser.add_argument("--model_name_or_path", default='roberta-wwm-ext', required=True, type=str, help="Path to pre-trained model or shortcut name selected in the list: " +
", ".join(sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])))
parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_200/model_state.pdparams', help="The path to model parameters to be loaded.")
parser.add_argument("--output_path", type=str, default='./static_graph_params', help="The path of model parameter in static graph to be saved.")
args = parser.parse_args()
return args
# yapf: enable # yapf: enable
if __name__ == "__main__": if __name__ == "__main__":
args = parse_args()
args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
if args.model_name_or_path == 'ernie-tiny':
# ErnieTinyTokenizer is special for ernie-tiny pretained model. # ErnieTinyTokenizer is special for ernie-tiny pretained model.
tokenizer = ppnlp.transformers.ErnieTinyTokenizer.from_pretrained( tokenizer = ppnlp.transformers.ErnieTinyTokenizer.from_pretrained(
args.model_name_or_path) 'ernie-tiny')
else:
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
# The number of labels should be in accordance with the training dataset. # The number of labels should be in accordance with the training dataset.
label_map = {0: 'negative', 1: 'positive'} label_map = {0: 'negative', 1: 'positive'}
model = model_class.from_pretrained( model = ppnlp.transformers.ErnieForSequenceClassification.from_pretrained(
args.model_name_or_path, num_classes=len(label_map)) "ernie-tiny", num_classes=len(label_map))
if args.params_path and os.path.isfile(args.params_path): if args.params_path and os.path.isfile(args.params_path):
state_dict = paddle.load(args.params_path) state_dict = paddle.load(args.params_path)
......
...@@ -25,31 +25,14 @@ import paddle.nn.functional as F ...@@ -25,31 +25,14 @@ import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad from paddlenlp.data import Stack, Tuple, Pad
import paddlenlp as ppnlp import paddlenlp as ppnlp
MODEL_CLASSES = {
"bert": (ppnlp.transformers.BertForSequenceClassification,
ppnlp.transformers.BertTokenizer),
'ernie': (ppnlp.transformers.ErnieForSequenceClassification,
ppnlp.transformers.ErnieTokenizer),
'roberta': (ppnlp.transformers.RobertaForSequenceClassification,
ppnlp.transformers.RobertaTokenizer),
}
# yapf: disable # yapf: disable
def parse_args(): parser = argparse.ArgumentParser()
parser = argparse.ArgumentParser() parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
# Required parameters parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. "
parser.add_argument("--model_type", default='ernie', required=True, type=str, help="Model type selected in the list: " +", ".join(MODEL_CLASSES.keys()))
parser.add_argument("--model_name_or_path", default='ernie-tiny', required=True, type=str, help="Path to pre-trained model or shortcut name selected in the list: " +
", ".join(sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])))
parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. "
"Sequences longer than this will be truncated, sequences shorter will be padded.") "Sequences longer than this will be truncated, sequences shorter will be padded.")
parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument("--n_gpu", type=int, default=1, help="Number of GPUs to use, 0 for CPU.") parser.add_argument("--n_gpu", type=int, default=1, help="Number of GPUs to use, 0 for CPU.")
args = parser.parse_args() args = parser.parse_args()
return args
# yapf: enable # yapf: enable
...@@ -134,23 +117,16 @@ def predict(model, data, tokenizer, label_map, batch_size=1): ...@@ -134,23 +117,16 @@ def predict(model, data, tokenizer, label_map, batch_size=1):
is_test=True) is_test=True)
examples.append((input_ids, segment_ids)) examples.append((input_ids, segment_ids))
# Seperates data into some batches.
batches = [
examples[idx:idx + batch_size]
for idx in range(0, len(examples), batch_size)
]
batchify_fn = lambda samples, fn=Tuple( batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment
): fn(samples) ): fn(samples)
# Seperates data into some batches.
batches = []
one_batch = []
for example in examples:
one_batch.append(example)
if len(one_batch) == batch_size:
batches.append(one_batch)
one_batch = []
if one_batch:
# The last batch whose size is less than the config batch_size setting.
batches.append(one_batch)
results = [] results = []
model.eval() model.eval()
for batch in batches: for batch in batches:
...@@ -167,18 +143,11 @@ def predict(model, data, tokenizer, label_map, batch_size=1): ...@@ -167,18 +143,11 @@ def predict(model, data, tokenizer, label_map, batch_size=1):
if __name__ == "__main__": if __name__ == "__main__":
args = parse_args()
paddle.set_device("gpu" if args.n_gpu else "cpu") paddle.set_device("gpu" if args.n_gpu else "cpu")
args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
if args.model_name_or_path == 'ernie-tiny':
# ErnieTinyTokenizer is special for ernie-tiny pretained model. # ErnieTinyTokenizer is special for ernie-tiny pretained model.
tokenizer = ppnlp.transformers.ErnieTinyTokenizer.from_pretrained( tokenizer = ppnlp.transformers.ErnieTinyTokenizer.from_pretrained(
args.model_name_or_path) 'ernie-tiny')
else:
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
data = [ data = [
'这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般', '这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般',
...@@ -187,8 +156,8 @@ if __name__ == "__main__": ...@@ -187,8 +156,8 @@ if __name__ == "__main__":
] ]
label_map = {0: 'negative', 1: 'positive'} label_map = {0: 'negative', 1: 'positive'}
model = model_class.from_pretrained( model = ppnlp.transformers.ErnieForSequenceClassification.from_pretrained(
args.model_name_or_path, num_classes=len(label_map)) 'ernie-tiny', num_classes=len(label_map))
if args.params_path and os.path.isfile(args.params_path): if args.params_path and os.path.isfile(args.params_path):
state_dict = paddle.load(args.params_path) state_dict = paddle.load(args.params_path)
......
...@@ -25,97 +25,28 @@ import paddle.nn.functional as F ...@@ -25,97 +25,28 @@ import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad from paddlenlp.data import Stack, Tuple, Pad
import paddlenlp as ppnlp import paddlenlp as ppnlp
MODEL_CLASSES = { # yapf: disable
"bert": (ppnlp.transformers.BertForSequenceClassification, parser = argparse.ArgumentParser()
ppnlp.transformers.BertTokenizer), parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
'ernie': (ppnlp.transformers.ErnieForSequenceClassification, parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. "
ppnlp.transformers.ErnieTokenizer), "Sequences longer than this will be truncated, sequences shorter will be padded.")
'roberta': (ppnlp.transformers.RobertaForSequenceClassification, parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
ppnlp.transformers.RobertaTokenizer), parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
} parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proption over the training process.")
def parse_args(): parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
parser = argparse.ArgumentParser() parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
# Required parameters parser.add_argument("--n_gpu", type=int, default=1, help="Number of GPUs to use, 0 for CPU.")
parser.add_argument( args = parser.parse_args()
"--model_type", # yapf: enable
default='ernie',
required=True,
type=str, def set_seed(seed):
help="Model type selected in the list: " +
", ".join(MODEL_CLASSES.keys()))
parser.add_argument(
"--model_name",
default='ernie-tiny',
required=True,
type=str,
help="Path to pre-trained model or shortcut name selected in the list: "
+ ", ".join(
sum([
list(classes[-1].pretrained_init_configuration.keys())
for classes in MODEL_CLASSES.values()
], [])))
parser.add_argument(
"--save_dir",
default='./checkpoint',
required=True,
type=str,
help="The output directory where the model checkpoints will be written.")
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after tokenization. "
"Sequences longer than this will be truncated, sequences shorter will be padded."
)
parser.add_argument(
"--batch_size",
default=32,
type=int,
help="Batch size per GPU/CPU for training.")
parser.add_argument(
"--learning_rate",
default=5e-5,
type=float,
help="The initial learning rate for Adam.")
parser.add_argument(
"--weight_decay",
default=0.0,
type=float,
help="Weight decay if we apply some.")
parser.add_argument(
"--epochs",
default=3,
type=int,
help="Total number of training epochs to perform.")
parser.add_argument(
"--warmup_proption",
default=0.0,
type=float,
help="Linear warmup proption over the training process.")
parser.add_argument(
"--init_from_ckpt",
type=str,
default=None,
help="The path of checkpoint to be loaded.")
parser.add_argument(
"--seed", type=int, default=1000, help="random seed for initialization")
parser.add_argument(
"--n_gpu",
type=int,
default=1,
help="Number of GPUs to use, 0 for CPU.")
args = parser.parse_args()
return args
def set_seed(args):
"""sets random seed""" """sets random seed"""
random.seed(args.seed) random.seed(seed)
np.random.seed(args.seed) np.random.seed(seed)
paddle.seed(args.seed) paddle.seed(seed)
@paddle.no_grad() @paddle.no_grad()
...@@ -223,24 +154,30 @@ def create_dataloader(dataset, ...@@ -223,24 +154,30 @@ def create_dataloader(dataset,
return_list=True) return_list=True)
def do_train(args): def do_train():
set_seed(args) set_seed(args.seed)
paddle.set_device("gpu" if args.n_gpu else "cpu") paddle.set_device("gpu" if args.n_gpu else "cpu")
world_size = paddle.distributed.get_world_size() world_size = paddle.distributed.get_world_size()
if world_size > 1: if world_size > 1:
paddle.distributed.init_parallel_env() paddle.distributed.init_parallel_env()
args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
train_dataset, dev_dataset, test_dataset = ppnlp.datasets.ChnSentiCorp.get_datasets( train_dataset, dev_dataset, test_dataset = ppnlp.datasets.ChnSentiCorp.get_datasets(
['train', 'dev', 'test']) ['train', 'dev', 'test'])
if args.model_name == 'ernie-tiny':
# If you wanna use bert/roberta/electra pretrained model,
# model = ppnlp.transformers.BertForSequenceClassification.from_pretrained('bert-base-chinese', num_class=2)
# model = ppnlp.transformers.RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext', num_class=2)
# model = ppnlp.transformers.ElectraForSequenceClassification.from_pretrained('chinese-electra-small', num_classes=2)
model = ppnlp.transformers.ErnieForSequenceClassification.from_pretrained(
'ernie-tiny', num_classes=len(train_dataset.get_labels()))
# If you wanna use bert/roberta/electra pretrained model,
# tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained('bert-base-chinese')
# tokenizer = ppnlp.transformers.RobertaTokenizer.from_pretrained('roberta-wwm-ext')
# tokenizer = ppnlp.transformers.ElectraTokenizer.from_pretrained('chinese-electra-small', num_classes=2)
# ErnieTinyTokenizer is special for ernie-tiny pretained model. # ErnieTinyTokenizer is special for ernie-tiny pretained model.
tokenizer = ppnlp.transformers.ErnieTinyTokenizer.from_pretrained( tokenizer = ppnlp.transformers.ErnieTinyTokenizer.from_pretrained(
args.model_name) 'ernie-tiny')
else:
tokenizer = tokenizer_class.from_pretrained(args.model_name)
trans_func = partial( trans_func = partial(
convert_example, convert_example,
...@@ -271,16 +208,13 @@ def do_train(args): ...@@ -271,16 +208,13 @@ def do_train(args):
batchify_fn=batchify_fn, batchify_fn=batchify_fn,
trans_fn=trans_func) trans_fn=trans_func)
model = model_class.from_pretrained(
args.model_name, num_classes=len(train_dataset.get_labels()))
if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
state_dict = paddle.load(args.init_from_ckpt) state_dict = paddle.load(args.init_from_ckpt)
model.set_dict(state_dict) model.set_dict(state_dict)
model = paddle.DataParallel(model) model = paddle.DataParallel(model)
num_training_steps = len(train_data_loader) * args.epochs num_training_steps = len(train_data_loader) * args.epochs
num_warmup_steps = int(args.warmup_proption * num_training_steps) num_warmup_steps = int(args.warmup_proportion * num_training_steps)
def get_lr_factor(current_step): def get_lr_factor(current_step):
if current_step < num_warmup_steps: if current_step < num_warmup_steps:
...@@ -342,8 +276,7 @@ def do_train(args): ...@@ -342,8 +276,7 @@ def do_train(args):
if __name__ == "__main__": if __name__ == "__main__":
args = parse_args()
if args.n_gpu > 1: if args.n_gpu > 1:
paddle.distributed.spawn(do_train, args=(args, ), nprocs=args.n_gpu) paddle.distributed.spawn(do_train, nprocs=args.n_gpu)
else: else:
do_train(args) do_train()
...@@ -115,7 +115,7 @@ PaddleNLP提供了一系列的文本表示技术,如`seq2vec`模块。 ...@@ -115,7 +115,7 @@ PaddleNLP提供了一系列的文本表示技术,如`seq2vec`模块。
- paddlepaddle >= 2.0.0-rc1 - paddlepaddle >= 2.0.0-rc1
``` ```
pip install paddlenlp==2.0.0b pip install paddlenlp>=2.0.0rc
``` ```
### 代码结构说明 ### 代码结构说明
......
...@@ -16,8 +16,7 @@ import argparse ...@@ -16,8 +16,7 @@ import argparse
import paddle import paddle
import paddlenlp as ppnlp import paddlenlp as ppnlp
from paddlenlp.data import Vocab
from utils import load_vocab
# yapf: disable # yapf: disable
parser = argparse.ArgumentParser(__doc__) parser = argparse.ArgumentParser(__doc__)
...@@ -31,7 +30,7 @@ args = parser.parse_args() ...@@ -31,7 +30,7 @@ args = parser.parse_args()
def main(): def main():
# Load vocab. # Load vocab.
vocab = load_vocab(args.vocab_path) vocab = Vocab.load_vocabulary(args.vocab_path)
label_map = {0: 'negative', 1: 'positive'} label_map = {0: 'negative', 1: 'positive'}
# Construct the newtork. # Construct the newtork.
......
...@@ -14,10 +14,11 @@ ...@@ -14,10 +14,11 @@
import argparse import argparse
import paddle import paddle
import paddlenlp as ppnlp
import paddle.nn.functional as F import paddle.nn.functional as F
import paddlenlp as ppnlp
from paddlenlp.data import JiebaTokenizer, Stack, Tuple, Pad
from utils import load_vocab, generate_batch, preprocess_prediction_data from utils import preprocess_prediction_data
# yapf: disable # yapf: disable
parser = argparse.ArgumentParser(__doc__) parser = argparse.ArgumentParser(__doc__)
...@@ -30,7 +31,7 @@ args = parser.parse_args() ...@@ -30,7 +31,7 @@ args = parser.parse_args()
# yapf: enable # yapf: enable
def predict(model, data, label_map, collate_fn, batch_size=1, pad_token_id=0): def predict(model, data, label_map, batch_size=1, pad_token_id=0):
""" """
Predicts the data labels. Predicts the data labels.
...@@ -39,8 +40,6 @@ def predict(model, data, label_map, collate_fn, batch_size=1, pad_token_id=0): ...@@ -39,8 +40,6 @@ def predict(model, data, label_map, collate_fn, batch_size=1, pad_token_id=0):
data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
A Example object contains `text`(word_ids) and `se_len`(sequence length). A Example object contains `text`(word_ids) and `se_len`(sequence length).
label_map(obj:`dict`): The label id (key) to label str (value) map. label_map(obj:`dict`): The label id (key) to label str (value) map.
collate_fn(obj: `callable`): function to generate mini-batch data by merging
the sample list.
batch_size(obj:`int`, defaults to 1): The number of batch. batch_size(obj:`int`, defaults to 1): The number of batch.
pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
...@@ -49,22 +48,18 @@ def predict(model, data, label_map, collate_fn, batch_size=1, pad_token_id=0): ...@@ -49,22 +48,18 @@ def predict(model, data, label_map, collate_fn, batch_size=1, pad_token_id=0):
""" """
# Seperates data into some batches. # Seperates data into some batches.
batches = [] batches = [
one_batch = [] data[idx:idx + batch_size] for idx in range(0, len(data), batch_size)
for example in data: ]
one_batch.append(example) batchify_fn = lambda samples, fn=Tuple(
if len(one_batch) == batch_size: Pad(axis=0, pad_val=pad_token_id), # input_ids
batches.append(one_batch) Stack(dtype="int64"), # seq len
one_batch = [] ): [data for data in fn(samples)]
if one_batch:
# The last batch whose size is less than the config batch_size setting.
batches.append(one_batch)
results = [] results = []
model.eval() model.eval()
for batch in batches: for batch in batches:
texts, seq_lens = collate_fn( texts, seq_lens = batchify_fn(batch)
batch, pad_token_id=pad_token_id, return_label=False)
texts = paddle.to_tensor(texts) texts = paddle.to_tensor(texts)
seq_lens = paddle.to_tensor(seq_lens) seq_lens = paddle.to_tensor(seq_lens)
logits = model(texts, seq_lens) logits = model(texts, seq_lens)
...@@ -78,8 +73,9 @@ def predict(model, data, label_map, collate_fn, batch_size=1, pad_token_id=0): ...@@ -78,8 +73,9 @@ def predict(model, data, label_map, collate_fn, batch_size=1, pad_token_id=0):
if __name__ == "__main__": if __name__ == "__main__":
paddle.set_device("gpu") if args.use_gpu else paddle.set_device("cpu") paddle.set_device("gpu") if args.use_gpu else paddle.set_device("cpu")
# Loads vocab. # Loads vocab.s
vocab = load_vocab(args.vocab_path) vocab = ppnlp.data.Vocab.load_vocabulary(
args.vocab_path, unk_token='[UNK]', pad_token='[PAD]')
label_map = {0: 'negative', 1: 'positive'} label_map = {0: 'negative', 1: 'positive'}
# Constructs the newtork. # Constructs the newtork.
...@@ -97,14 +93,14 @@ if __name__ == "__main__": ...@@ -97,14 +93,14 @@ if __name__ == "__main__":
'怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片', '怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片',
'作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。', '作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。',
] ]
examples = preprocess_prediction_data(data, vocab) tokenizer = JiebaTokenizer(vocab)
examples = preprocess_prediction_data(data, tokenizer)
results = predict( results = predict(
model, model,
examples, examples,
label_map=label_map, label_map=label_map,
batch_size=args.batch_size, batch_size=args.batch_size,
collate_fn=generate_batch) pad_token_id=vocab.token_to_idx.get("[PAD]", 0))
for idx, text in enumerate(data): for idx, text in enumerate(data):
print('Data: {} \t Label: {}'.format(text, results[idx])) print('Data: {} \t Label: {}'.format(text, results[idx]))
...@@ -19,10 +19,10 @@ import random ...@@ -19,10 +19,10 @@ import random
import numpy as np import numpy as np
import paddle import paddle
import paddlenlp as ppnlp import paddlenlp as ppnlp
from paddlenlp.data import Stack, Tuple, Pad from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
from paddlenlp.datasets import ChnSentiCorp from paddlenlp.datasets import ChnSentiCorp
from utils import load_vocab, convert_example from utils import convert_example
# yapf: disable # yapf: disable
parser = argparse.ArgumentParser(__doc__) parser = argparse.ArgumentParser(__doc__)
...@@ -50,7 +50,6 @@ def create_dataloader(dataset, ...@@ -50,7 +50,6 @@ def create_dataloader(dataset,
mode='train', mode='train',
batch_size=1, batch_size=1,
use_gpu=False, use_gpu=False,
pad_token_id=0,
batchify_fn=None): batchify_fn=None):
""" """
Creats dataloader. Creats dataloader.
...@@ -61,7 +60,6 @@ def create_dataloader(dataset, ...@@ -61,7 +60,6 @@ def create_dataloader(dataset,
mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
use_gpu(obj:`bool`, optional, defaults to obj:`False`): Whether to use gpu to run. use_gpu(obj:`bool`, optional, defaults to obj:`False`): Whether to use gpu to run.
pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
the sample list, None for only stack each fields of sample in axis the sample list, None for only stack each fields of sample in axis
0(same as :attr::`np.stack(..., axis=0)`). 0(same as :attr::`np.stack(..., axis=0)`).
...@@ -95,8 +93,9 @@ if __name__ == "__main__": ...@@ -95,8 +93,9 @@ if __name__ == "__main__":
if not os.path.exists(args.vocab_path): if not os.path.exists(args.vocab_path):
raise RuntimeError('The vocab_path can not be found in the path %s' % raise RuntimeError('The vocab_path can not be found in the path %s' %
args.vocab_path) args.vocab_path)
vocab = load_vocab(args.vocab_path)
vocab = Vocab.load_vocabulary(
args.vocab_path, unk_token='[UNK]', pad_token='[PAD]')
# Loads dataset. # Loads dataset.
train_ds, dev_ds, test_ds = ChnSentiCorp.get_datasets( train_ds, dev_ds, test_ds = ChnSentiCorp.get_datasets(
['train', 'dev', 'test']) ['train', 'dev', 'test'])
...@@ -110,13 +109,10 @@ if __name__ == "__main__": ...@@ -110,13 +109,10 @@ if __name__ == "__main__":
model = paddle.Model(model) model = paddle.Model(model)
# Reads data and generates mini-batches. # Reads data and generates mini-batches.
trans_fn = partial( tokenizer = JiebaTokenizer(vocab)
convert_example, trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False)
vocab=vocab,
unk_token_id=vocab.get('[UNK]', 1),
is_test=False)
batchify_fn = lambda samples, fn=Tuple( batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=vocab['[PAD]']), # input_ids Pad(axis=0, pad_val=vocab.token_to_idx.get('[PAD]', 0)), # input_ids
Stack(dtype="int64"), # seq len Stack(dtype="int64"), # seq len
Stack(dtype="int64") # label Stack(dtype="int64") # label
): [data for data in fn(samples)] ): [data for data in fn(samples)]
...@@ -126,7 +122,6 @@ if __name__ == "__main__": ...@@ -126,7 +122,6 @@ if __name__ == "__main__":
batch_size=args.batch_size, batch_size=args.batch_size,
mode='train', mode='train',
use_gpu=args.use_gpu, use_gpu=args.use_gpu,
pad_token_id=vocab.get('[PAD]', 0),
batchify_fn=batchify_fn) batchify_fn=batchify_fn)
dev_loader = create_dataloader( dev_loader = create_dataloader(
dev_ds, dev_ds,
...@@ -134,7 +129,6 @@ if __name__ == "__main__": ...@@ -134,7 +129,6 @@ if __name__ == "__main__":
batch_size=args.batch_size, batch_size=args.batch_size,
mode='validation', mode='validation',
use_gpu=args.use_gpu, use_gpu=args.use_gpu,
pad_token_id=vocab.get('[PAD]', 0),
batchify_fn=batchify_fn) batchify_fn=batchify_fn)
test_loader = create_dataloader( test_loader = create_dataloader(
test_ds, test_ds,
...@@ -142,7 +136,6 @@ if __name__ == "__main__": ...@@ -142,7 +136,6 @@ if __name__ == "__main__":
batch_size=args.batch_size, batch_size=args.batch_size,
mode='test', mode='test',
use_gpu=args.use_gpu, use_gpu=args.use_gpu,
pad_token_id=vocab.get('[PAD]', 0),
batchify_fn=batchify_fn) batchify_fn=batchify_fn)
optimizer = paddle.optimizer.Adam( optimizer = paddle.optimizer.Adam(
......
...@@ -11,113 +11,26 @@ ...@@ -11,113 +11,26 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import jieba
import numpy as np import numpy as np
def load_vocab(vocab_file): def convert_example(example, tokenizer, is_test=False):
"""Loads a vocabulary file into a dictionary."""
vocab = {}
with open(vocab_file, "r", encoding="utf-8") as reader:
tokens = reader.readlines()
for index, token in enumerate(tokens):
token = token.rstrip("\n").split("\t")[0]
vocab[token] = index
return vocab
def convert_ids_to_tokens(wids, inversed_vocab):
""" Converts a token string (or a sequence of tokens) in a single integer id
(or a sequence of ids), using the vocabulary.
"""
tokens = []
for wid in wids:
wstr = inversed_vocab.get(wid, None)
if wstr:
tokens.append(wstr)
return tokens
def convert_tokens_to_ids(tokens, vocab):
""" Converts a token id (or a sequence of id) in a token string
(or a sequence of tokens), using the vocabulary.
"""
ids = []
unk_id = vocab.get('[UNK]', None)
for token in tokens:
wid = vocab.get(token, unk_id)
if wid:
ids.append(wid)
return ids
def pad_texts_to_max_seq_len(texts, max_seq_len, pad_token_id=0):
"""
Padded the texts to the max sequence length if the length of text is lower than it.
Unless it truncates the text.
Args:
texts(obj:`list`): Texts which contrains a sequence of word ids.
max_seq_len(obj:`int`): Max sequence length.
pad_token_id(obj:`int`, optinal, defaults to 0) : The pad token index.
"""
for index, text in enumerate(texts):
seq_len = len(text)
if seq_len < max_seq_len:
padded_tokens = [pad_token_id for _ in range(max_seq_len - seq_len)]
new_text = text + padded_tokens
texts[index] = new_text
elif seq_len > max_seq_len:
new_text = text[:max_seq_len]
texts[index] = new_text
def generate_batch(batch, pad_token_id=0, return_label=True):
"""
Generates a batch whose text will be padded to the max sequence length in the batch.
Args:
batch(obj:`List[Example]`) : One batch, which contains texts, labels and the true sequence lengths.
pad_token_id(obj:`int`, optinal, defaults to 0) : The pad token index.
Returns:
batch(:obj:`Tuple[list]`): The batch data which contains texts, seq_lens and labels.
"""
seq_lens = [entry[1] for entry in batch]
batch_max_seq_len = max(seq_lens)
texts = [entry[0] for entry in batch]
pad_texts_to_max_seq_len(texts, batch_max_seq_len, pad_token_id)
if return_label:
labels = [[entry[-1]] for entry in batch]
return texts, seq_lens, labels
else:
return texts, seq_lens
def convert_example(example, vocab, unk_token_id=1, is_test=False):
""" """
Builds model inputs from a sequence for sequence classification tasks. Builds model inputs from a sequence for sequence classification tasks.
It use `jieba.cut` to tokenize text. It use `jieba.cut` to tokenize text.
Args: Args:
example(obj:`list[str]`): List of input data, containing text and label if it have label. example(obj:`list[str]`): List of input data, containing text and label if it have label.
vocab(obj:`dict`): The vocabulary. tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
unk_token_id(obj:`int`, defaults to 1): The unknown token id.
is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
Returns: Returns:
input_ids(obj:`list[int]`): The list of token ids.s input_ids(obj:`list[int]`): The list of token ids.
valid_length(obj:`int`): The input sequence valid length. valid_length(obj:`int`): The input sequence valid length.
label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
""" """
input_ids = [] input_ids = tokenizer.encode(example[0])
for token in jieba.cut(example[0]):
token_id = vocab.get(token, unk_token_id)
input_ids.append(token_id)
valid_length = np.array(len(input_ids), dtype='int64') valid_length = np.array(len(input_ids), dtype='int64')
input_ids = np.array(input_ids, dtype='int64') input_ids = np.array(input_ids, dtype='int64')
...@@ -128,12 +41,13 @@ def convert_example(example, vocab, unk_token_id=1, is_test=False): ...@@ -128,12 +41,13 @@ def convert_example(example, vocab, unk_token_id=1, is_test=False):
return input_ids, valid_length return input_ids, valid_length
def preprocess_prediction_data(data, vocab): def preprocess_prediction_data(data, tokenizer):
""" """
It process the prediction data as the format used as training. It process the prediction data as the format used as training.
Args: Args:
data (obj:`List[str]`): The prediction data whose each element is a tokenized text. data (obj:`List[str]`): The prediction data whose each element is a tokenized text.
tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
Returns: Returns:
examples (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. examples (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
...@@ -142,7 +56,6 @@ def preprocess_prediction_data(data, vocab): ...@@ -142,7 +56,6 @@ def preprocess_prediction_data(data, vocab):
""" """
examples = [] examples = []
for text in data: for text in data:
tokens = " ".join(jieba.cut(text)).split(' ') ids = tokenizer.encode(text)
ids = convert_tokens_to_ids(tokens, vocab)
examples.append([ids, len(ids)]) examples.append([ids, len(ids)])
return examples return examples
...@@ -89,17 +89,12 @@ sentence_transformers/ ...@@ -89,17 +89,12 @@ sentence_transformers/
```shell ```shell
# 设置使用的GPU卡号 # 设置使用的GPU卡号
CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0
python train.py --model_type ernie --model_name ernie-1.0 --n_gpu 1 --save_dir ./checkpoints python train.py --save_dir ./checkpoints
``` ```
可支持配置的参数: 可支持配置的参数:
* `model_type`:必选,模型类型,可以选择bert,ernie,roberta。 * `save_dir`:可选,保存训练模型的目录;默认保存在当前目录checkpoints文件夹下。
* `model_name`: 必选,具体的模型简称。
`model_type=ernie`,则model_name可以选择`ernie-1.0``ernie-tiny`
`model_type=bert`,则model_name可以选择`bert-base-chinese``bert-wwm-chinese``bert-wwm-ext-chinese`
`model_type=roberta`,则model_name可以选择`roberta-wwm-ext``rbt3``rbtl3`
* `save_dir`:必选,保存训练模型的目录。
* `max_seq_length`:可选,ERNIE/BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 * `max_seq_length`:可选,ERNIE/BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。
* `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 * `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。
* `learning_rate`:可选,Fine-tune的最大学习率;默认为5e-5。 * `learning_rate`:可选,Fine-tune的最大学习率;默认为5e-5。
...@@ -110,6 +105,44 @@ python train.py --model_type ernie --model_name ernie-1.0 --n_gpu 1 --save_dir . ...@@ -110,6 +105,44 @@ python train.py --model_type ernie --model_name ernie-1.0 --n_gpu 1 --save_dir .
* `seed`:可选,随机种子,默认为1000. * `seed`:可选,随机种子,默认为1000.
* `n_gpu`:可选,训练过程中使用GPU卡数量,默认为1。若n_gpu=0,则使用CPU训练。 * `n_gpu`:可选,训练过程中使用GPU卡数量,默认为1。若n_gpu=0,则使用CPU训练。
代码示例中使用的预训练模型是ERNIE,如果想要使用其他预训练模型如BERT,RoBERTa,Electra等,只需更换`model``tokenizer`即可。
```python
# 使用ernie预训练模型
# ernie
model = ppnlp.transformers.ErnieModel.from_pretrained('ernie'))
tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie')
# ernie-tiny
# model = ppnlp.transformers.ErnieModel.from_pretrained('ernie-tiny'))
# tokenizer = ppnlp.transformers.ErnieTinyTokenizer.from_pretrained('ernie-tiny')
# 使用bert预训练模型
# bert-base-chinese
# model = ppnlp.transformers.BertModel.from_pretrained('bert-base-chinese')
# tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained('bert-base-chinese')
# bert-wwm-chinese
# model = ppnlp.transformers.BertModel.from_pretrained('bert-wwm-chinese')
# tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained('bert-wwm-chinese')
# bert-wwm-ext-chinese
# model = ppnlp.transformers.BertModel.from_pretrained('bert-wwm-ext-chinese')
# tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained('bert-wwm-ext-chinese')
# 使用roberta预训练模型
# roberta-wwm-ext
# model = ppnlp.transformers.RobertaModel.from_pretrained('roberta-wwm-ext')
# tokenizer = ppnlp.transformers.RobertaTokenizer.from_pretrained('roberta-wwm-ext')
# roberta-wwm-ext
# model = ppnlp.transformers.RobertaModel.from_pretrained('roberta-wwm-ext-large')
# tokenizer = ppnlp.transformers.RobertaTokenizer.from_pretrained('roberta-wwm-ext-large')
```
更多预训练模型,参考[transformers](../../../docs/transformers.md)
程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存模型在指定的`save_dir`中。 程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存模型在指定的`save_dir`中。
如: 如:
...@@ -132,7 +165,7 @@ checkpoints/ ...@@ -132,7 +165,7 @@ checkpoints/
启动预测: 启动预测:
```shell ```shell
CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0
python predict.py --model_type ernie --model_name ernie-tiny --params_path checkpoints/model_400/model_state.pdparams python predict.py --params_path checkpoints/model_400/model_state.pdparams
``` ```
将待预测数据如以下示例: 将待预测数据如以下示例:
......
...@@ -26,29 +26,14 @@ from paddlenlp.data import Stack, Tuple, Pad ...@@ -26,29 +26,14 @@ from paddlenlp.data import Stack, Tuple, Pad
from model import SentenceTransformer from model import SentenceTransformer
MODEL_CLASSES = {
"bert": (ppnlp.transformers.BertModel, ppnlp.transformers.BertTokenizer),
'ernie': (ppnlp.transformers.ErnieModel, ppnlp.transformers.ErnieTokenizer),
'roberta':
(ppnlp.transformers.RobertaModel, ppnlp.transformers.RobertaTokenizer)
}
# yapf: disable # yapf: disable
def parse_args(): parser = argparse.ArgumentParser()
parser = argparse.ArgumentParser() parser.add_argument("--params_path", type=str, default='./checkpoint/model_2700/model_state.pdparams', help="The path to model parameters to be loaded.")
# Required parameters parser.add_argument("--max_seq_length", default=50, type=int, help="The maximum total input sequence length after tokenization. "
parser.add_argument("--model_type", default='ernie', type=str, help="Model type selected in the list: " +", ".join(MODEL_CLASSES.keys()))
parser.add_argument("--model_name", default='ernie-1.0', type=str, help="Path to pre-trained model or shortcut name selected in the list: " +
", ".join(sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])))
parser.add_argument("--params_path", type=str, default='./checkpoint/model_4900/model_state.pdparams', help="The path to model parameters to be loaded.")
parser.add_argument("--max_seq_length", default=50, type=int, help="The maximum total input sequence length after tokenization. "
"Sequences longer than this will be truncated, sequences shorter will be padded.") "Sequences longer than this will be truncated, sequences shorter will be padded.")
parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument("--n_gpu", type=int, default=0, help="Number of GPUs to use, 0 for CPU.") parser.add_argument("--n_gpu", type=int, default=0, help="Number of GPUs to use, 0 for CPU.")
args = parser.parse_args() args = parser.parse_args()
return args
# yapf: enable # yapf: enable
...@@ -143,6 +128,11 @@ def predict(model, data, tokenizer, label_map, batch_size=1): ...@@ -143,6 +128,11 @@ def predict(model, data, tokenizer, label_map, batch_size=1):
examples.append((query_input_ids, query_segment_ids, title_input_ids, examples.append((query_input_ids, query_segment_ids, title_input_ids,
title_segment_ids)) title_segment_ids))
# Seperates data into some batches.
batches = [
examples[idx:idx + batch_size]
for idx in range(0, len(examples), batch_size)
]
batchify_fn = lambda samples, fn=Tuple( batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input
Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_segment Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_segment
...@@ -150,18 +140,6 @@ def predict(model, data, tokenizer, label_map, batch_size=1): ...@@ -150,18 +140,6 @@ def predict(model, data, tokenizer, label_map, batch_size=1):
Pad(axis=0, pad_val=tokenizer.pad_token_id), # tilte_segment Pad(axis=0, pad_val=tokenizer.pad_token_id), # tilte_segment
): [data for data in fn(samples)] ): [data for data in fn(samples)]
# Seperates data into some batches.
batches = []
one_batch = []
for example in examples:
one_batch.append(example)
if len(one_batch) == batch_size:
batches.append(one_batch)
one_batch = []
if one_batch:
# The last batch whose size is less than the config batch_size setting.
batches.append(one_batch)
results = [] results = []
model.eval() model.eval()
for batch in batches: for batch in batches:
...@@ -186,18 +164,11 @@ def predict(model, data, tokenizer, label_map, batch_size=1): ...@@ -186,18 +164,11 @@ def predict(model, data, tokenizer, label_map, batch_size=1):
if __name__ == "__main__": if __name__ == "__main__":
args = parse_args()
paddle.set_device("gpu" if args.n_gpu else "cpu") paddle.set_device("gpu" if args.n_gpu else "cpu")
args.model_type = args.model_type.lower() # ErnieTinyTokenizer is special for ernie-tiny pretained model.
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
if args.model_name == 'ernie-tiny':
# ErnieTinyTokenizer is special for ernie_tiny pretained model.
tokenizer = ppnlp.transformers.ErnieTinyTokenizer.from_pretrained( tokenizer = ppnlp.transformers.ErnieTinyTokenizer.from_pretrained(
args.model_name) 'ernie-tiny')
else:
tokenizer = tokenizer_class.from_pretrained(args.model_name)
data = [ data = [
['世界上什么东西最小', '世界上什么东西最小?'], ['世界上什么东西最小', '世界上什么东西最小?'],
...@@ -206,7 +177,8 @@ if __name__ == "__main__": ...@@ -206,7 +177,8 @@ if __name__ == "__main__":
] ]
label_map = {0: 'dissimilar', 1: 'similar'} label_map = {0: 'dissimilar', 1: 'similar'}
pretrained_model = model_class.from_pretrained(args.model_name) pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained(
"ernie-tiny")
model = SentenceTransformer(pretrained_model) model = SentenceTransformer(pretrained_model)
if args.params_path and os.path.isfile(args.params_path): if args.params_path and os.path.isfile(args.params_path):
......
...@@ -27,95 +27,28 @@ import paddlenlp as ppnlp ...@@ -27,95 +27,28 @@ import paddlenlp as ppnlp
from model import SentenceTransformer from model import SentenceTransformer
MODEL_CLASSES = { # yapf: disable
"bert": (ppnlp.transformers.BertModel, ppnlp.transformers.BertTokenizer), parser = argparse.ArgumentParser()
'ernie': (ppnlp.transformers.ErnieModel, ppnlp.transformers.ErnieTokenizer), parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
'roberta': parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. "
(ppnlp.transformers.RobertaModel, ppnlp.transformers.RobertaTokenizer), "Sequences longer than this will be truncated, sequences shorter will be padded.")
} parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
def parse_args(): parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
parser = argparse.ArgumentParser() parser.add_argument("--warmup_proption", default=0.0, type=float, help="Linear warmup proption over the training process.")
# Required parameters parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
parser.add_argument( parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
"--model_type", parser.add_argument("--n_gpu", type=int, default=1, help="Number of GPUs to use, 0 for CPU.")
default='ernie', args = parser.parse_args()
required=True, # yapf: enable
type=str,
help="Model type selected in the list: " +
", ".join(MODEL_CLASSES.keys())) def set_seed(seed):
parser.add_argument(
"--model_name",
default='ernie-1.0',
required=True,
type=str,
help="Path to pre-trained model or shortcut name selected in the list: "
+ ", ".join(
sum([
list(classes[-1].pretrained_init_configuration.keys())
for classes in MODEL_CLASSES.values()
], [])))
parser.add_argument(
"--save_dir",
default='./checkpoint',
required=True,
type=str,
help="The output directory where the model checkpoints will be written.")
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after tokenization. "
"Sequences longer than this will be truncated, sequences shorter will be padded."
)
parser.add_argument(
"--batch_size",
default=32,
type=int,
help="Batch size per GPU/CPU for training.")
parser.add_argument(
"--learning_rate",
default=5e-5,
type=float,
help="The initial learning rate for Adam.")
parser.add_argument(
"--weight_decay",
default=0.0,
type=float,
help="Weight decay if we apply some.")
parser.add_argument(
"--epochs",
default=3,
type=int,
help="Total number of training epochs to perform.")
parser.add_argument(
"--warmup_proption",
default=0.0,
type=float,
help="Linear warmup proption over the training process.")
parser.add_argument(
"--init_from_ckpt",
type=str,
default=None,
help="The path of checkpoint to be loaded.")
parser.add_argument(
"--seed", type=int, default=1000, help="random seed for initialization")
parser.add_argument(
"--n_gpu",
type=int,
default=1,
help="Number of GPUs to use, 0 for CPU.")
args = parser.parse_args()
return args
def set_seed(args):
"""sets random seed""" """sets random seed"""
random.seed(args.seed) random.seed(seed)
np.random.seed(args.seed) np.random.seed(seed)
paddle.seed(args.seed) paddle.seed(seed)
@paddle.no_grad() @paddle.no_grad()
...@@ -135,8 +68,8 @@ def evaluate(model, criterion, metric, data_loader): ...@@ -135,8 +68,8 @@ def evaluate(model, criterion, metric, data_loader):
for batch in data_loader: for batch in data_loader:
query_input_ids, query_segment_ids, title_input_ids, title_segment_ids, labels = batch query_input_ids, query_segment_ids, title_input_ids, title_segment_ids, labels = batch
probs = model( probs = model(
query_input_ids, query_input_ids=query_input_ids,
title_input_ids, title_input_ids=title_input_ids,
query_token_type_ids=query_segment_ids, query_token_type_ids=query_segment_ids,
title_token_type_ids=title_segment_ids) title_token_type_ids=title_segment_ids)
loss = criterion(probs, labels) loss = criterion(probs, labels)
...@@ -236,24 +169,28 @@ def create_dataloader(dataset, ...@@ -236,24 +169,28 @@ def create_dataloader(dataset,
return_list=True) return_list=True)
def do_train(args): def do_train():
set_seed(args) set_seed(args.seed)
paddle.set_device("gpu" if args.n_gpu else "cpu") paddle.set_device("gpu" if args.n_gpu else "cpu")
world_size = paddle.distributed.get_world_size() world_size = paddle.distributed.get_world_size()
if world_size > 1: if world_size > 1:
paddle.distributed.init_parallel_env() paddle.distributed.init_parallel_env()
args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
train_dataset, dev_dataset, test_dataset = ppnlp.datasets.LCQMC.get_datasets( train_dataset, dev_dataset, test_dataset = ppnlp.datasets.LCQMC.get_datasets(
['train', 'dev', 'test']) ['train', 'dev', 'test'])
if args.model_name == 'ernie-tiny':
# If you wanna use bert/roberta pretrained model,
# pretrained_model = ppnlp.transformers.BertModel.from_pretrained('bert-base-chinese')
# pretrained_model = ppnlp.transformers.RobertaModel.from_pretrained('roberta-wwm-ext')
pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained(
'ernie-tiny')
# If you wanna use bert/roberta pretrained model,
# tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained('bert-base-chinese')
# tokenizer = ppnlp.transformers.RobertaTokenizer.from_pretrained('roberta-wwm-ext')
# ErnieTinyTokenizer is special for ernie-tiny pretained model. # ErnieTinyTokenizer is special for ernie-tiny pretained model.
tokenizer = ppnlp.transformers.ErnieTinyTokenizer.from_pretrained( tokenizer = ppnlp.transformers.ErnieTinyTokenizer.from_pretrained(
args.model_name) 'ernie-tiny')
else:
tokenizer = tokenizer_class.from_pretrained(args.model_name)
trans_func = partial( trans_func = partial(
convert_example, convert_example,
...@@ -286,7 +223,6 @@ def do_train(args): ...@@ -286,7 +223,6 @@ def do_train(args):
batchify_fn=batchify_fn, batchify_fn=batchify_fn,
trans_fn=trans_func) trans_fn=trans_func)
pretrained_model = model_class.from_pretrained(args.model_name)
model = SentenceTransformer(pretrained_model) model = SentenceTransformer(pretrained_model)
if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
...@@ -326,8 +262,8 @@ def do_train(args): ...@@ -326,8 +262,8 @@ def do_train(args):
for step, batch in enumerate(train_data_loader, start=1): for step, batch in enumerate(train_data_loader, start=1):
query_input_ids, query_segment_ids, title_input_ids, title_segment_ids, labels = batch query_input_ids, query_segment_ids, title_input_ids, title_segment_ids, labels = batch
probs = model( probs = model(
query_input_ids, query_input_ids=query_input_ids,
title_input_ids, title_input_ids=title_input_ids,
query_token_type_ids=query_segment_ids, query_token_type_ids=query_segment_ids,
title_token_type_ids=title_segment_ids) title_token_type_ids=title_segment_ids)
loss = criterion(probs, labels) loss = criterion(probs, labels)
...@@ -361,8 +297,7 @@ def do_train(args): ...@@ -361,8 +297,7 @@ def do_train(args):
if __name__ == "__main__": if __name__ == "__main__":
args = parse_args()
if args.n_gpu > 1: if args.n_gpu > 1:
paddle.distributed.spawn(do_train, args=(args, ), nprocs=args.n_gpu) paddle.distributed.spawn(do_train, nprocs=args.n_gpu)
else: else:
do_train(args) do_train()
...@@ -38,7 +38,7 @@ SimNet框架在百度各产品上广泛应用,主要包括BOW、CNN、RNN、MM ...@@ -38,7 +38,7 @@ SimNet框架在百度各产品上广泛应用,主要包括BOW、CNN、RNN、MM
* PaddleNLP 安装 * PaddleNLP 安装
```shell ```shell
pip install paddlenlp pip install paddlenlp>=2.0rc0
``` ```
* 环境依赖 * 环境依赖
......
...@@ -16,23 +16,24 @@ from functools import partial ...@@ -16,23 +16,24 @@ from functools import partial
import argparse import argparse
import paddle import paddle
import paddlenlp as ppnlp
import paddle.nn.functional as F import paddle.nn.functional as F
import paddlenlp as ppnlp
from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
from utils import load_vocab, generate_batch, preprocess_prediction_data from utils import preprocess_prediction_data
# yapf: disable # yapf: disable
parser = argparse.ArgumentParser(__doc__) parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--use_gpu", type=eval, default=False, help="Whether use GPU for training, input should be True or False") parser.add_argument("--use_gpu", type=eval, default=False, help="Whether use GPU for training, input should be True or False")
parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.")
parser.add_argument("--vocab_path", type=str, default="./data/term2id.dict", help="The path to vocabulary.") parser.add_argument("--vocab_path", type=str, default="./simnet_word_dict.txt", help="The path to vocabulary.")
parser.add_argument('--network', type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?") parser.add_argument('--network', type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?")
parser.add_argument("--params_path", type=str, default='./chekpoints/final.pdparams', help="The path of model parameter to be loaded.") parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.")
args = parser.parse_args() args = parser.parse_args()
# yapf: enable # yapf: enable
def predict(model, data, label_map, collate_fn, batch_size=1, pad_token_id=0): def predict(model, data, label_map, batch_size=1, pad_token_id=0):
""" """
Predicts the data labels. Predicts the data labels.
...@@ -41,37 +42,35 @@ def predict(model, data, label_map, collate_fn, batch_size=1, pad_token_id=0): ...@@ -41,37 +42,35 @@ def predict(model, data, label_map, collate_fn, batch_size=1, pad_token_id=0):
data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
A Example object contains `text`(word_ids) and `seq_len`(sequence length). A Example object contains `text`(word_ids) and `seq_len`(sequence length).
label_map(obj:`dict`): The label id (key) to label str (value) map. label_map(obj:`dict`): The label id (key) to label str (value) map.
collate_fn(obj: `callable`): function to generate mini-batch data by merging
the sample list.
batch_size(obj:`int`, defaults to 1): The number of batch. batch_size(obj:`int`, defaults to 1): The number of batch.
pad_token_id(obj:`int`, optinal, defaults to 0) : The pad token index. pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
Returns: Returns:
results(obj:`dict`): All the predictions labels. results(obj:`dict`): All the predictions labels.
""" """
# Seperates data into some batches. # Seperates data into some batches.
batches = [] batches = [
one_batch = [] data[idx:idx + batch_size] for idx in range(0, len(data), batch_size)
for example in data: ]
one_batch.append(example)
if len(one_batch) == batch_size: batchify_fn = lambda samples, fn=Tuple(
batches.append(one_batch) Pad(axis=0, pad_val=pad_token_id), # query_ids
one_batch = [] Pad(axis=0, pad_val=pad_token_id), # title_ids
if one_batch: Stack(dtype="int64"), # query_seq_lens
# The last batch whose size is less than the config batch_size setting. Stack(dtype="int64"), # title_seq_lens
batches.append(one_batch) ): [data for data in fn(samples)]
results = [] results = []
model.eval() model.eval()
for batch in batches: for batch in batches:
queries, titles, query_seq_lens, title_seq_lens = collate_fn( query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(
batch, pad_token_id=pad_token_id, return_label=False) batch)
queries = paddle.to_tensor(queries) query_ids = paddle.to_tensor(query_ids)
titles = paddle.to_tensor(titles) title_ids = paddle.to_tensor(title_ids)
query_seq_lens = paddle.to_tensor(query_seq_lens) query_seq_lens = paddle.to_tensor(query_seq_lens)
title_seq_lens = paddle.to_tensor(title_seq_lens) title_seq_lens = paddle.to_tensor(title_seq_lens)
logits = model(queries, titles, query_seq_lens, title_seq_lens) logits = model(query_ids, title_ids, query_seq_lens, title_seq_lens)
probs = F.softmax(logits, axis=1) probs = F.softmax(logits, axis=1)
idx = paddle.argmax(probs, axis=1).numpy() idx = paddle.argmax(probs, axis=1).numpy()
idx = idx.tolist() idx = idx.tolist()
...@@ -83,7 +82,9 @@ def predict(model, data, label_map, collate_fn, batch_size=1, pad_token_id=0): ...@@ -83,7 +82,9 @@ def predict(model, data, label_map, collate_fn, batch_size=1, pad_token_id=0):
if __name__ == "__main__": if __name__ == "__main__":
paddle.set_device("gpu") if args.use_gpu else paddle.set_device("cpu") paddle.set_device("gpu") if args.use_gpu else paddle.set_device("cpu")
# Loads vocab. # Loads vocab.
vocab = load_vocab(args.vocab_path) vocab = Vocab.load_vocabulary(
args.vocab_path, unk_token='[UNK]', pad_token='[PAD]')
tokenizer = JiebaTokenizer(vocab)
label_map = {0: 'dissimilar', 1: 'similar'} label_map = {0: 'dissimilar', 1: 'similar'}
# Constructs the newtork. # Constructs the newtork.
...@@ -101,13 +102,13 @@ if __name__ == "__main__": ...@@ -101,13 +102,13 @@ if __name__ == "__main__":
['光眼睛大就好看吗', '眼睛好看吗?'], ['光眼睛大就好看吗', '眼睛好看吗?'],
['小蝌蚪找妈妈怎么样', '小蝌蚪找妈妈是谁画的'], ['小蝌蚪找妈妈怎么样', '小蝌蚪找妈妈是谁画的'],
] ]
examples = preprocess_prediction_data(data, vocab) examples = preprocess_prediction_data(data, tokenizer)
results = predict( results = predict(
model, model,
examples, examples,
label_map=label_map, label_map=label_map,
batch_size=args.batch_size, batch_size=args.batch_size,
collate_fn=generate_batch) pad_token_id=vocab.token_to_idx.get('[PAD]', 0))
for idx, text in enumerate(data): for idx, text in enumerate(data):
print('Data: {} \t Label: {}'.format(text, results[idx])) print('Data: {} \t Label: {}'.format(text, results[idx]))
...@@ -20,16 +20,17 @@ import time ...@@ -20,16 +20,17 @@ import time
import paddle import paddle
import paddlenlp as ppnlp import paddlenlp as ppnlp
from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
from paddlenlp.datasets import LCQMC from paddlenlp.datasets import LCQMC
from utils import load_vocab, generate_batch, convert_example from utils import convert_example
# yapf: disable # yapf: disable
parser = argparse.ArgumentParser(__doc__) parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.") parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.")
parser.add_argument('--use_gpu', type=eval, default=False, help="Whether use GPU for training, input should be True or False") parser.add_argument('--use_gpu', type=eval, default=False, help="Whether use GPU for training, input should be True or False")
parser.add_argument("--lr", type=float, default=5e-4, help="Learning rate used to train.") parser.add_argument("--lr", type=float, default=5e-4, help="Learning rate used to train.")
parser.add_argument("--save_dir", type=str, default='chekpoints/', help="Directory to save model checkpoint") parser.add_argument("--save_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.")
parser.add_argument("--vocab_path", type=str, default="./simnet_word_dict.txt", help="The directory to dataset.") parser.add_argument("--vocab_path", type=str, default="./simnet_word_dict.txt", help="The directory to dataset.")
parser.add_argument('--network', type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?") parser.add_argument('--network', type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?")
...@@ -43,16 +44,19 @@ def create_dataloader(dataset, ...@@ -43,16 +44,19 @@ def create_dataloader(dataset,
mode='train', mode='train',
batch_size=1, batch_size=1,
use_gpu=False, use_gpu=False,
pad_token_id=0): batchify_fn=None):
""" """
Creats dataloader. Creats dataloader.
Args: Args:
dataset(obj:`paddle.io.Dataset`): Dataset instance. dataset(obj:`paddle.io.Dataset`): Dataset instance.
trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
use_gpu(obj:`bool`, optional, defaults to obj:`False`): Whether to use gpu to run. use_gpu(obj:`bool`, optional, defaults to obj:`False`): Whether to use gpu to run.
pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
the sample list, None for only stack each fields of sample in axis
0(same as :attr::`np.stack(..., axis=0)`).
Returns: Returns:
dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
...@@ -71,7 +75,7 @@ def create_dataloader(dataset, ...@@ -71,7 +75,7 @@ def create_dataloader(dataset,
dataset, dataset,
batch_sampler=sampler, batch_sampler=sampler,
return_list=True, return_list=True,
collate_fn=lambda batch: generate_batch(batch, pad_token_id=pad_token_id)) collate_fn=batchify_fn)
return dataloader return dataloader
...@@ -82,11 +86,11 @@ if __name__ == "__main__": ...@@ -82,11 +86,11 @@ if __name__ == "__main__":
if not os.path.exists(args.vocab_path): if not os.path.exists(args.vocab_path):
raise RuntimeError('The vocab_path can not be found in the path %s' % raise RuntimeError('The vocab_path can not be found in the path %s' %
args.vocab_path) args.vocab_path)
vocab = load_vocab(args.vocab_path) vocab = Vocab.load_vocabulary(
args.vocab_path, unk_token='[UNK]', pad_token='[PAD]')
# Loads dataset. # Loads dataset.
train_ds, dev_dataset, test_ds = LCQMC.get_datasets( train_ds, dev_ds, test_ds = LCQMC.get_datasets(['train', 'dev', 'test'])
['train', 'dev', 'test'])
# Constructs the newtork. # Constructs the newtork.
label_list = train_ds.get_labels() label_list = train_ds.get_labels()
...@@ -95,18 +99,41 @@ if __name__ == "__main__": ...@@ -95,18 +99,41 @@ if __name__ == "__main__":
vocab_size=len(vocab), vocab_size=len(vocab),
num_classes=len(label_list)) num_classes=len(label_list))
model = paddle.Model(model) model = paddle.Model(model)
new_vocab_file = open("./new_simnet_word_dict.txt", 'w', encoding='utf8')
for token, index in vocab.token_to_idx.items():
new_vocab_file.write(token + "\n")
# Reads data and generates mini-batches. # Reads data and generates mini-batches.
trans_fn = partial(convert_example, vocab=vocab, is_test=False) batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=vocab.token_to_idx.get('[PAD]', 0)), # query_ids
Pad(axis=0, pad_val=vocab.token_to_idx.get('[PAD]', 0)), # title_ids
Stack(dtype="int64"), # query_seq_lens
Stack(dtype="int64"), # title_seq_lens
Stack(dtype="int64") # label
): [data for data in fn(samples)]
tokenizer = ppnlp.data.JiebaTokenizer(vocab)
trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False)
train_loader = create_dataloader( train_loader = create_dataloader(
train_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode='train') train_ds,
trans_fn=trans_fn,
batch_size=args.batch_size,
mode='train',
use_gpu=args.use_gpu,
batchify_fn=batchify_fn)
dev_loader = create_dataloader( dev_loader = create_dataloader(
dev_dataset, dev_ds,
trans_fn=trans_fn, trans_fn=trans_fn,
batch_size=args.batch_size, batch_size=args.batch_size,
mode='validation') mode='validation',
use_gpu=args.use_gpu,
batchify_fn=batchify_fn)
test_loader = create_dataloader( test_loader = create_dataloader(
test_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode='test') test_ds,
trans_fn=trans_fn,
batch_size=args.batch_size,
mode='test',
use_gpu=args.use_gpu,
batchify_fn=batchify_fn)
optimizer = paddle.optimizer.Adam( optimizer = paddle.optimizer.Adam(
parameters=model.parameters(), learning_rate=args.lr) parameters=model.parameters(), learning_rate=args.lr)
......
...@@ -11,105 +11,17 @@ ...@@ -11,105 +11,17 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import jieba
import numpy as np import numpy as np
def load_vocab(vocab_file): def convert_example(example, tokenizer, is_test=False):
"""Loads a vocabulary file into a dictionary."""
vocab = {}
with open(vocab_file, "r", encoding="utf-8") as reader:
tokens = reader.readlines()
for index, token in enumerate(tokens):
token = token.rstrip("\n").split("\t")[0]
vocab[token] = index
return vocab
def convert_ids_to_tokens(wids, inversed_vocab):
""" Converts a token string (or a sequence of tokens) in a single integer id
(or a sequence of ids), using the vocabulary.
"""
tokens = []
for wid in wids:
wstr = inversed_vocab.get(wid, None)
if wstr:
tokens.append(wstr)
return tokens
def convert_tokens_to_ids(tokens, vocab):
""" Converts a token id (or a sequence of id) in a token string
(or a sequence of tokens), using the vocabulary.
"""
ids = []
unk_id = vocab.get('[UNK]', None)
for token in tokens:
wid = vocab.get(token, unk_id)
if wid:
ids.append(wid)
return ids
def pad_texts_to_max_seq_len(texts, max_seq_len, pad_token_id=0):
"""
Padded the texts to the max sequence length if the length of text is lower than it.
Unless it truncates the text.
Args:
texts(obj:`list`): Texts which contrains a sequence of word ids.
max_seq_len(obj:`int`): Max sequence length.
pad_token_id(obj:`int`, optinal, defaults to 0) : The pad token index.
"""
for index, text in enumerate(texts):
seq_len = len(text)
if seq_len < max_seq_len:
padded_tokens = [pad_token_id for _ in range(max_seq_len - seq_len)]
new_text = text + padded_tokens
texts[index] = new_text
elif seq_len > max_seq_len:
new_text = text[:max_seq_len]
texts[index] = new_text
def generate_batch(batch, pad_token_id=0, return_label=True):
"""
Generates a batch whose text will be padded to the max sequence length in the batch.
Args:
batch(obj:`List[Example]`) : One batch, which contains texts, labels and the true sequence lengths.
pad_token_id(obj:`int`, optinal, defaults to 0) : The pad token index.
Returns:
batch(:obj:`Tuple[list]`): The batch data which contains texts, seq_lens and labels.
"""
queries = [entry[0] for entry in batch]
titles = [entry[1] for entry in batch]
query_seq_lens = [entry[2] for entry in batch]
title_seq_lens = [entry[3] for entry in batch]
query_batch_max_seq_len = max(query_seq_lens)
pad_texts_to_max_seq_len(queries, query_batch_max_seq_len, pad_token_id)
title_batch_max_seq_len = max(title_seq_lens)
pad_texts_to_max_seq_len(titles, title_batch_max_seq_len, pad_token_id)
if return_label:
labels = [entry[-1] for entry in batch]
return queries, titles, query_seq_lens, title_seq_lens, labels
else:
return queries, titles, query_seq_lens, title_seq_lens
def convert_example(example, vocab, unk_token_id=1, is_test=False):
""" """
Builds model inputs from a sequence for sequence classification tasks. Builds model inputs from a sequence for sequence classification tasks.
It use `jieba.cut` to tokenize text. It use `jieba.cut` to tokenize text.
Args: Args:
example(obj:`list[str]`): List of input data, containing text and label if it have label. example(obj:`list[str]`): List of input data, containing text and label if it have label.
vocab(obj:`dict`): The vocabulary. tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
unk_token_id(obj:`int`, defaults to 1): The unknown token id.
is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
Returns: Returns:
...@@ -121,13 +33,10 @@ def convert_example(example, vocab, unk_token_id=1, is_test=False): ...@@ -121,13 +33,10 @@ def convert_example(example, vocab, unk_token_id=1, is_test=False):
""" """
query, title = example[0], example[1] query, title = example[0], example[1]
query_tokens = jieba.lcut(query) query_ids = np.array(tokenizer.encode(query), dtype="int64")
title_tokens = jieba.lcut(title) query_seq_len = np.array(len(query_ids), dtype="int64")
title_ids = np.array(tokenizer.encode(title), dtype="int64")
query_ids = convert_tokens_to_ids(query_tokens, vocab) title_seq_len = np.array(len(title_ids), dtype="int64")
query_seq_len = len(query_ids)
title_ids = convert_tokens_to_ids(title_tokens, vocab)
title_seq_len = len(title_ids)
if not is_test: if not is_test:
label = np.array(example[-1], dtype="int64") label = np.array(example[-1], dtype="int64")
...@@ -136,7 +45,7 @@ def convert_example(example, vocab, unk_token_id=1, is_test=False): ...@@ -136,7 +45,7 @@ def convert_example(example, vocab, unk_token_id=1, is_test=False):
return query_ids, title_ids, query_seq_len, title_seq_len return query_ids, title_ids, query_seq_len, title_seq_len
def preprocess_prediction_data(data, vocab): def preprocess_prediction_data(data, tokenizer):
""" """
It process the prediction data as the format used as training. It process the prediction data as the format used as training.
...@@ -144,6 +53,7 @@ def preprocess_prediction_data(data, vocab): ...@@ -144,6 +53,7 @@ def preprocess_prediction_data(data, vocab):
data (obj:`List[List[str, str]]`): data (obj:`List[List[str, str]]`):
The prediction data whose each element is a text pair. The prediction data whose each element is a text pair.
Each text will be tokenized by jieba.lcut() function. Each text will be tokenized by jieba.lcut() function.
tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
Returns: Returns:
examples (obj:`list`): The processed data whose each element examples (obj:`list`): The processed data whose each element
...@@ -157,9 +67,7 @@ def preprocess_prediction_data(data, vocab): ...@@ -157,9 +67,7 @@ def preprocess_prediction_data(data, vocab):
""" """
examples = [] examples = []
for query, title in data: for query, title in data:
query_tokens = jieba.lcut(query) query_ids = tokenizer.encode(query)
title_tokens = jieba.lcut(title) title_ids = tokenizer.encode(title)
query_ids = convert_tokens_to_ids(query_tokens, vocab)
title_ids = convert_tokens_to_ids(title_tokens, vocab)
examples.append([query_ids, title_ids, len(query_ids), len(title_ids)]) examples.append([query_ids, title_ids, len(query_ids), len(title_ids)])
return examples return examples
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
__version__ = '2.0.0b3' __version__ = '2.0.0rc1'
from . import data from . import data
from . import datasets from . import datasets
......
...@@ -112,13 +112,7 @@ class BoWModel(nn.Layer): ...@@ -112,13 +112,7 @@ class BoWModel(nn.Layer):
a word embedding. Then, we encode these epresentations with a `BoWEncoder`. a word embedding. Then, we encode these epresentations with a `BoWEncoder`.
Lastly, we take the output of the encoder to create a final representation, Lastly, we take the output of the encoder to create a final representation,
which is passed through some feed-forward layers to output a logits (`output_layer`). which is passed through some feed-forward layers to output a logits (`output_layer`).
Args:
vocab_size (obj:`int`): The vocabulary size.
emb_dim (obj:`int`, optional, defaults to 128): The embedding dimension.
padding_idx (obj:`int`, optinal, defaults to 0) : The pad token index.
hidden_size (obj:`int`, optional, defaults to 128): The first full-connected layer hidden size.
fc_hidden_size (obj:`int`, optional, defaults to 96): The second full-connected layer hidden size.
num_classes (obj:`int`): All the labels that the data has.
""" """
def __init__(self, def __init__(self,
...@@ -331,7 +325,7 @@ class SelfAttention(nn.Layer): ...@@ -331,7 +325,7 @@ class SelfAttention(nn.Layer):
Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (Zhou et al., 2016). Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (Zhou et al., 2016).
ref: https://www.aclweb.org/anthology/P16-2034/ ref: https://www.aclweb.org/anthology/P16-2034/
Args: Args:
hidden_size (obj:`int`): The number of expected features in the input x. hidden_size (int): The number of expected features in the input x.
""" """
def __init__(self, hidden_size): def __init__(self, hidden_size):
...@@ -343,9 +337,10 @@ class SelfAttention(nn.Layer): ...@@ -343,9 +337,10 @@ class SelfAttention(nn.Layer):
def forward(self, input, mask=None): def forward(self, input, mask=None):
""" """
Args: Args:
input (obj: `paddle.Tensor`) of shape (batch, seq_len, input_size): Tensor containing the features of the input sequence. input (paddle.Tensor) of shape (batch, seq_len, input_size): Tensor containing the features of the input sequence.
mask (obj: `paddle.Tensor`, optional, defaults to `None`) of shape (batch, seq_len) : mask (paddle.Tensor) of shape (batch, seq_len) :
Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not. Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not.
Defaults to `None`.
""" """
forward_input, backward_input = paddle.chunk(input, chunks=2, axis=2) forward_input, backward_input = paddle.chunk(input, chunks=2, axis=2)
# elementwise-sum forward_x and backward_x # elementwise-sum forward_x and backward_x
...@@ -378,7 +373,7 @@ class SelfInteractiveAttention(nn.Layer): ...@@ -378,7 +373,7 @@ class SelfInteractiveAttention(nn.Layer):
A close implementation of attention network of NAACL 2016 paper, Hierarchical Attention Networks for Document Classification (Yang et al., 2016). A close implementation of attention network of NAACL 2016 paper, Hierarchical Attention Networks for Document Classification (Yang et al., 2016).
ref: https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf ref: https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf
Args: Args:
hidden_size (obj:`int`): The number of expected features in the input x. hidden_size (int): The number of expected features in the input x.
""" """
def __init__(self, hidden_size): def __init__(self, hidden_size):
...@@ -393,9 +388,10 @@ class SelfInteractiveAttention(nn.Layer): ...@@ -393,9 +388,10 @@ class SelfInteractiveAttention(nn.Layer):
def forward(self, input, mask=None): def forward(self, input, mask=None):
""" """
Args: Args:
input (obj: `paddle.Tensor`) of shape (batch, seq_len, input_size): Tensor containing the features of the input sequence. input (paddle.Tensor) of shape (batch, seq_len, input_size): Tensor containing the features of the input sequence.
mask (obj: `paddle.Tensor`, optional, defaults to `None`) of shape (batch, seq_len) : mask (paddle.Tensor) of shape (batch, seq_len) :
Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not. Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not.
Defaults to `None
""" """
weight = self.input_weight.tile( weight = self.input_weight.tile(
repeat_times=(paddle.shape(input)[0], 1, 1)) repeat_times=(paddle.shape(input)[0], 1, 1))
...@@ -434,11 +430,7 @@ class CNNModel(nn.Layer): ...@@ -434,11 +430,7 @@ class CNNModel(nn.Layer):
outputs from the convolution layer and outputs the max. outputs from the convolution layer and outputs the max.
Lastly, we take the output of the encoder to create a final representation, Lastly, we take the output of the encoder to create a final representation,
which is passed through some feed-forward layers to output a logits (`output_layer`). which is passed through some feed-forward layers to output a logits (`output_layer`).
Args:
vocab_size (obj:`int`): The vocabulary size.
emb_dim (obj:`int`, optional, defaults to 128): The embedding dimension.
padding_idx (obj:`int`, optinal, defaults to 0) : The pad token index.
num_classes (obj:`int`): All the labels that the data has.
""" """
def __init__(self, def __init__(self,
...@@ -483,11 +475,7 @@ class TextCNNModel(nn.Layer): ...@@ -483,11 +475,7 @@ class TextCNNModel(nn.Layer):
outputs from the convolution layer and outputs the max. outputs from the convolution layer and outputs the max.
Lastly, we take the output of the encoder to create a final representation, Lastly, we take the output of the encoder to create a final representation,
which is passed through some feed-forward layers to output a logits (`output_layer`). which is passed through some feed-forward layers to output a logits (`output_layer`).
Args:
vocab_size (obj:`int`): The vocabulary size.
emb_dim (obj:`int`, optional, defaults to 128): The embedding dimension.
padding_idx (obj:`int`, optinal, defaults to 0) : The pad token index.
num_classes (obj:`int`): All the labels that the data has.
""" """
def __init__(self, def __init__(self,
......
...@@ -31,8 +31,7 @@ class BoWEncoder(nn.Layer): ...@@ -31,8 +31,7 @@ class BoWEncoder(nn.Layer):
and the output is of shape `(batch_size, emb_dim)`. and the output is of shape `(batch_size, emb_dim)`.
Args: Args:
# TODO: unify the docstring style with PaddlePaddle. emb_dim(int): It is the input dimension to the encoder.
emb_dim(obj:`int`, required): It is the input dimension to the encoder.
""" """
def __init__(self, emb_dim): def __init__(self, emb_dim):
...@@ -59,12 +58,12 @@ class BoWEncoder(nn.Layer): ...@@ -59,12 +58,12 @@ class BoWEncoder(nn.Layer):
It simply sums the embeddings of a sequence across the time dimension. It simply sums the embeddings of a sequence across the time dimension.
Args: Args:
inputs (obj: `paddle.Tensor`): Shape as `(batch_size, num_tokens, emb_dim)` inputs (paddle.Tensor): Shape as `(batch_size, num_tokens, emb_dim)`
mask (obj: `paddle.Tensor`, optional, defaults to `None`): Shape same as `inputs`. Its each elements identify whether is padding token or not. mask (obj: `paddle.Tensor`, optional, defaults to `None`): Shape same as `inputs`. Its each elements identify whether is padding token or not.
If True, not padding token. If False, padding token. If True, not padding token. If False, padding token.
Returns: Returns:
summed (obj: `paddle.Tensor`): Shape of `(batch_size, emb_dim)`. The result vector of BagOfEmbedding. summed (paddle.Tensor): Shape of `(batch_size, emb_dim)`. The result vector of BagOfEmbedding.
""" """
if mask is not None: if mask is not None:
...@@ -97,18 +96,18 @@ class CNNEncoder(nn.Layer): ...@@ -97,18 +96,18 @@ class CNNEncoder(nn.Layer):
ref: https://arxiv.org/abs/1510.03820 ref: https://arxiv.org/abs/1510.03820
Args: Args:
emb_dim(object:`int`, required): emb_dim(int):
This is the input dimension to the encoder. This is the input dimension to the encoder.
num_filter(object:`int`, required): num_filter(int):
This is the output dim for each convolutional layer, which is the number of "filters" This is the output dim for each convolutional layer, which is the number of "filters"
learned by that layer. learned by that layer.
ngram_filter_sizes(object: `Tuple[int]`, optional, default to `(2, 3, 4, 5)`): ngram_filter_sizes(Tuple[int]):
This specifies both the number of convolutional layers we will create and their sizes. The This specifies both the number of convolutional layers we will create and their sizes. The
default of `(2, 3, 4, 5)` will have four convolutional layers, corresponding to encoding default of `(2, 3, 4, 5)` will have four convolutional layers, corresponding to encoding
ngrams of size 2 to 5 with some number of filters. ngrams of size 2 to 5 with some number of filters.
conv_layer_activation(object: `str`, optional, default to `tanh`): conv_layer_activation(str):
Activation to use after the convolution layers. Activation to use after the convolution layers.
output_dim(object: `int`, optional, default to `None`): output_dim(int):
After doing convolutions and pooling, we'll project the collected features into a vector of After doing convolutions and pooling, we'll project the collected features into a vector of
this size. If this value is `None`, we will just return the result of the max pooling, this size. If this value is `None`, we will just return the result of the max pooling,
giving an output of shape `len(ngram_filter_sizes) * num_filter`. giving an output of shape `len(ngram_filter_sizes) * num_filter`.
...@@ -165,13 +164,13 @@ class CNNEncoder(nn.Layer): ...@@ -165,13 +164,13 @@ class CNNEncoder(nn.Layer):
The combination of multiple convolution layers and max pooling layers. The combination of multiple convolution layers and max pooling layers.
Args: Args:
inputs (obj: `paddle.Tensor`, required): Shape as `(batch_size, num_tokens, emb_dim)` inputs (paddle.Tensor): Shape as `(batch_size, num_tokens, emb_dim)`
mask (obj: `paddle.Tensor`, optional, defaults to `None`): Shape same as `inputs`. mask (obj: `paddle.Tensor`, optional, defaults to `None`): Shape same as `inputs`.
Its each elements identify whether is padding token or not. Its each elements identify whether is padding token or not.
If True, not padding token. If False, padding token. If True, not padding token. If False, padding token.
Returns: Returns:
result (obj: `paddle.Tensor`): If output_dim is None, the result shape result (paddle.Tensor): If output_dim is None, the result shape
is of `(batch_size, output_dim)`; if not, the result shape is of `(batch_size, output_dim)`; if not, the result shape
is of `(batch_size, len(ngram_filter_sizes) * num_filter)`. is of `(batch_size, len(ngram_filter_sizes) * num_filter)`.
...@@ -188,8 +187,8 @@ class CNNEncoder(nn.Layer): ...@@ -188,8 +187,8 @@ class CNNEncoder(nn.Layer):
self._activation(conv(inputs)).squeeze(3) for conv in self.convs self._activation(conv(inputs)).squeeze(3) for conv in self.convs
] ]
maxpool_out = [ maxpool_out = [
F.max_pool1d( F.adaptive_max_pool1d(
t, kernel_size=t.shape[2]).squeeze(2) for t in convs_out t, output_size=1).squeeze(2) for t in convs_out
] ]
result = paddle.concat(maxpool_out, axis=1) result = paddle.concat(maxpool_out, axis=1)
...@@ -221,8 +220,8 @@ class GRUEncoder(nn.Layer): ...@@ -221,8 +220,8 @@ class GRUEncoder(nn.Layer):
E.g., setting num_layers=2 would mean stacking two GRUs together to form a stacked GRU, E.g., setting num_layers=2 would mean stacking two GRUs together to form a stacked GRU,
with the second GRU taking in outputs of the first GRU and computing the final results. with the second GRU taking in outputs of the first GRU and computing the final results.
direction (obj:`str`, optional, defaults to obj:`forward`): The direction of the network. direction (obj:`str`, optional, defaults to obj:`forward`): The direction of the network.
It can be "forward" and "bidirect" (it means bidirection network). It can be `forward` and `bidirect` (it means bidirection network).
When "bidirect", the way to merge outputs of forward and backward is concatenating. If `biderect`, it is a birectional GRU, and returns the concat output from both directions.
dropout (obj:`float`, optional, defaults to 0.0): If non-zero, introduces a Dropout layer dropout (obj:`float`, optional, defaults to 0.0): If non-zero, introduces a Dropout layer
on the outputs of each GRU layer except the last layer, with dropout probability equal to dropout. on the outputs of each GRU layer except the last layer, with dropout probability equal to dropout.
pooling_type (obj: `str`, optional, defaults to obj:`None`): If `pooling_type` is None, pooling_type (obj: `str`, optional, defaults to obj:`None`): If `pooling_type` is None,
...@@ -280,11 +279,11 @@ class GRUEncoder(nn.Layer): ...@@ -280,11 +279,11 @@ class GRUEncoder(nn.Layer):
If not, output is of shape `(batch_size, hidden_size)`. If not, output is of shape `(batch_size, hidden_size)`.
Args: Args:
inputs (obj:`Paddle.Tensor`, required): Shape as `(batch_size, num_tokens, input_size)`. inputs (paddle.Tensor): Shape as `(batch_size, num_tokens, input_size)`.
sequence_length (obj:`Paddle.Tensor`, required): Shape as `(batch_size)`. sequence_length (paddle.Tensor): Shape as `(batch_size)`.
Returns: Returns:
last_hidden (obj:`Paddle.Tensor`, required): Shape as `(batch_size, hidden_size)`. last_hidden (paddle.Tensor): Shape as `(batch_size, hidden_size)`.
The hidden state at the last time step for every layer. The hidden state at the last time step for every layer.
""" """
...@@ -305,8 +304,8 @@ class GRUEncoder(nn.Layer): ...@@ -305,8 +304,8 @@ class GRUEncoder(nn.Layer):
else: else:
# We exploit the `encoded_text` (the hidden state at the every time step for last layer) # We exploit the `encoded_text` (the hidden state at the every time step for last layer)
# to create a single vector. We perform pooling on the encoded text. # to create a single vector. We perform pooling on the encoded text.
# If gru is not bidirection, output is shape of `(batch_size, hidden_size)`. # The output shape is `(batch_size, hidden_size*2)` if use bidirectional GRU,
# If gru is bidirection, then output is shape of `(batch_size, hidden_size*2)`. # otherwise the output shape is `(batch_size, hidden_size*2)`.
if self._pooling_type == 'sum': if self._pooling_type == 'sum':
output = paddle.sum(encoded_text, axis=1) output = paddle.sum(encoded_text, axis=1)
elif self._pooling_type == 'max': elif self._pooling_type == 'max':
...@@ -338,17 +337,17 @@ class LSTMEncoder(nn.Layer): ...@@ -338,17 +337,17 @@ class LSTMEncoder(nn.Layer):
lstm and backward lstm layer to create a single vector (shape of `(batch_size, hidden_size*2)`). lstm and backward lstm layer to create a single vector (shape of `(batch_size, hidden_size*2)`).
Args: Args:
input_size (obj:`int`, required): The number of expected features in the input (the last dimension). input_size (int): The number of expected features in the input (the last dimension).
hidden_size (obj:`int`, required): The number of features in the hidden state. hidden_size (int): The number of features in the hidden state.
num_layers (obj:`int`, optional, defaults to 1): Number of recurrent layers. num_layers (int): Number of recurrent layers.
E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM,
with the second LSTM taking in outputs of the first LSTM and computing the final results. with the second LSTM taking in outputs of the first LSTM and computing the final results.
direction (obj:`str`, optional, defaults to obj:`forwrd`): The direction of the network. direction (str): The direction of the network.
It can be "forward" and "bidirect" (it means bidirection network). It can be `forward` or `bidirect` (it means bidirection network).
When "bidirection", the way to merge outputs of forward and backward is concatenating. If `biderect`, it is a birectional LSTM, and returns the concat output from both directions.
dropout (obj:`float`, optional, defaults to 0.0): If non-zero, introduces a Dropout layer dropout (float): If non-zero, introduces a Dropout layer
on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout. on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout.
pooling_type (obj: `str`, optional, defaults to obj:`None`): If `pooling_type` is None, pooling_type (str): If `pooling_type` is None,
then the LSTMEncoder will return the hidden state of the last time step at last layer as a single vector. then the LSTMEncoder will return the hidden state of the last time step at last layer as a single vector.
If pooling_type is not None, it must be one of `sum`, `max` and `mean`. Then it will be pooled on If pooling_type is not None, it must be one of `sum`, `max` and `mean`. Then it will be pooled on
the LSTM output (the hidden state of every time step at last layer) to create a single vector. the LSTM output (the hidden state of every time step at last layer) to create a single vector.
...@@ -404,11 +403,11 @@ class LSTMEncoder(nn.Layer): ...@@ -404,11 +403,11 @@ class LSTMEncoder(nn.Layer):
If not, output is of shape `(batch_size, hidden_size)`. If not, output is of shape `(batch_size, hidden_size)`.
Args: Args:
inputs (obj:`Paddle.Tensor`, required): Shape as `(batch_size, num_tokens, input_size)`. inputs (paddle.Tensor): Shape as `(batch_size, num_tokens, input_size)`.
sequence_length (obj:`Paddle.Tensor`, required): Shape as `(batch_size)`. sequence_length (paddle.Tensor): Shape as `(batch_size)`.
Returns: Returns:
last_hidden (obj:`Paddle.Tensor`, required): Shape as `(batch_size, hidden_size)`. last_hidden (paddle.Tensor): Shape as `(batch_size, hidden_size)`.
The hidden state at the last time step for every layer. The hidden state at the last time step for every layer.
""" """
...@@ -429,8 +428,8 @@ class LSTMEncoder(nn.Layer): ...@@ -429,8 +428,8 @@ class LSTMEncoder(nn.Layer):
else: else:
# We exploit the `encoded_text` (the hidden state at the every time step for last layer) # We exploit the `encoded_text` (the hidden state at the every time step for last layer)
# to create a single vector. We perform pooling on the encoded text. # to create a single vector. We perform pooling on the encoded text.
# If lstm is not bidirection, output is shape of `(batch_size, hidden_size)`. # The output shape is `(batch_size, hidden_size*2)` if use bidirectional LSTM,
# If lstm is bidirection, then output is shape of `(batch_size, hidden_size*2)`. # otherwise the output shape is `(batch_size, hidden_size*2)`.
if self._pooling_type == 'sum': if self._pooling_type == 'sum':
output = paddle.sum(encoded_text, axis=1) output = paddle.sum(encoded_text, axis=1)
elif self._pooling_type == 'max': elif self._pooling_type == 'max':
...@@ -467,9 +466,9 @@ class RNNEncoder(nn.Layer): ...@@ -467,9 +466,9 @@ class RNNEncoder(nn.Layer):
num_layers (obj:`int`, optional, defaults to 1): Number of recurrent layers. num_layers (obj:`int`, optional, defaults to 1): Number of recurrent layers.
E.g., setting num_layers=2 would mean stacking two RNNs together to form a stacked RNN, E.g., setting num_layers=2 would mean stacking two RNNs together to form a stacked RNN,
with the second RNN taking in outputs of the first RNN and computing the final results. with the second RNN taking in outputs of the first RNN and computing the final results.
direction (obj:`str`, optional, defaults to obj:`forwrd`): The direction of the network. direction (obj:`str`, optional, defaults to obj:`forward`): The direction of the network.
It can be "forward" and "bidirect" (it means bidirection network). It can be "forward" and "bidirect" (it means bidirection network).
When "bidirection", the way to merge outputs of forward and backward is concatenating. If `biderect`, it is a birectional RNN, and returns the concat output from both directions.
dropout (obj:`float`, optional, defaults to 0.0): If non-zero, introduces a Dropout layer dropout (obj:`float`, optional, defaults to 0.0): If non-zero, introduces a Dropout layer
on the outputs of each RNN layer except the last layer, with dropout probability equal to dropout. on the outputs of each RNN layer except the last layer, with dropout probability equal to dropout.
pooling_type (obj: `str`, optional, defaults to obj:`None`): If `pooling_type` is None, pooling_type (obj: `str`, optional, defaults to obj:`None`): If `pooling_type` is None,
...@@ -528,11 +527,11 @@ class RNNEncoder(nn.Layer): ...@@ -528,11 +527,11 @@ class RNNEncoder(nn.Layer):
If not, output is of shape `(batch_size, hidden_size)`. If not, output is of shape `(batch_size, hidden_size)`.
Args: Args:
inputs (obj:`Paddle.Tensor`, required): Shape as `(batch_size, num_tokens, input_size)`. inputs (paddle.Tensor): Shape as `(batch_size, num_tokens, input_size)`.
sequence_length (obj:`Paddle.Tensor`, required): Shape as `(batch_size)`. sequence_length (paddle.Tensor): Shape as `(batch_size)`.
Returns: Returns:
last_hidden (obj:`Paddle.Tensor`, required): Shape as `(batch_size, hidden_size)`. last_hidden (paddle.Tensor): Shape as `(batch_size, hidden_size)`.
The hidden state at the last time step for every layer. The hidden state at the last time step for every layer.
""" """
...@@ -553,8 +552,8 @@ class RNNEncoder(nn.Layer): ...@@ -553,8 +552,8 @@ class RNNEncoder(nn.Layer):
else: else:
# We exploit the `encoded_text` (the hidden state at the every time step for last layer) # We exploit the `encoded_text` (the hidden state at the every time step for last layer)
# to create a single vector. We perform pooling on the encoded text. # to create a single vector. We perform pooling on the encoded text.
# If rnn is not bidirection, output is shape of `(batch_size, hidden_size)`. # The output shape is `(batch_size, hidden_size*2)` if use bidirectional RNN,
# If rnn is bidirection, then output is shape of `(batch_size, hidden_size*2)`. # otherwise the output shape is `(batch_size, hidden_size*2)`.
if self._pooling_type == 'sum': if self._pooling_type == 'sum':
output = paddle.sum(encoded_text, axis=1) output = paddle.sum(encoded_text, axis=1)
elif self._pooling_type == 'max': elif self._pooling_type == 'max':
...@@ -676,10 +675,10 @@ class TCNEncoder(nn.Layer): ...@@ -676,10 +675,10 @@ class TCNEncoder(nn.Layer):
such as LSTMs in many tasks. See https://arxiv.org/pdf/1803.01271.pdf for more details. such as LSTMs in many tasks. See https://arxiv.org/pdf/1803.01271.pdf for more details.
Args: Args:
input_size (obj:`int`, required): The number of expected features in the input (the last dimension). input_size (int): The number of expected features in the input (the last dimension).
num_channels (obj:`list` or obj:`tuple`, required): The number of channels in different layer. num_channels (list): The number of channels in different layer.
kernel_size (obj:`int`, optional): The kernel size. Defaults to 2. kernel_size (int): The kernel size. Defaults to 2.
dropout (obj:`float`, optional): The dropout probability. Defaults to 0.2. dropout (float): The dropout probability. Defaults to 0.2.
""" """
def __init__(self, input_size, num_channels, kernel_size=2, dropout=0.2): def __init__(self, input_size, num_channels, kernel_size=2, dropout=0.2):
...@@ -733,10 +732,10 @@ class TCNEncoder(nn.Layer): ...@@ -733,10 +732,10 @@ class TCNEncoder(nn.Layer):
receptive filed = $2 * \sum_{i=0}^{len(num\_channels)-1}2^i(kernel\_size-1)$. receptive filed = $2 * \sum_{i=0}^{len(num\_channels)-1}2^i(kernel\_size-1)$.
Args: Args:
inputs (obj:`Paddle.Tensor`, required): The input tensor with shape `[batch_size, num_tokens, input_size]`. inputs (paddle.Tensor): The input tensor with shape `[batch_size, num_tokens, input_size]`.
Returns: Returns:
output (obj:`Paddle.Tensor`): The output tensor with shape `[batch_size, num_channels[-1]]`. output (paddle.Tensor): The output tensor with shape `[batch_size, num_channels[-1]]`.
""" """
inputs_t = inputs.transpose([0, 2, 1]) inputs_t = inputs.transpose([0, 2, 1])
output = self.network(inputs_t).transpose([2, 0, 1])[-1] output = self.network(inputs_t).transpose([2, 0, 1])[-1]
......
...@@ -442,6 +442,39 @@ class BertTokenizer(PretrainedTokenizer): ...@@ -442,6 +442,39 @@ class BertTokenizer(PretrainedTokenizer):
return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 +
_sep) * [1] _sep) * [1]
def get_special_tokens_mask(self,
token_ids_0,
token_ids_1=None,
already_has_special_tokens=False):
"""
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``encode`` methods.
Args:
token_ids_0 (List[int]): List of ids of the first sequence.
token_ids_1 (List[int], optinal): List of ids of the second sequence.
already_has_special_tokens (bool, optional): Whether or not the token list is already
formatted with special tokens for the model. Defaults to None.
Returns:
results (List[int]): The list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
if token_ids_1 is not None:
raise ValueError(
"You should not supply a second sequence if the provided sequence of "
"ids is already formatted with special tokens for the model."
)
return list(
map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0,
token_ids_0))
if token_ids_1 is not None:
return [1] + ([0] * len(token_ids_0)) + [1] + (
[0] * len(token_ids_1)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1]
def encode(self, def encode(self,
text, text,
text_pair=None, text_pair=None,
......
...@@ -211,6 +211,39 @@ class ElectraTokenizer(PretrainedTokenizer): ...@@ -211,6 +211,39 @@ class ElectraTokenizer(PretrainedTokenizer):
return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 +
_sep) * [1] _sep) * [1]
def get_special_tokens_mask(self,
token_ids_0,
token_ids_1=None,
already_has_special_tokens=False):
"""
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``encode`` methods.
Args:
token_ids_0 (List[int]): List of ids of the first sequence.
token_ids_1 (List[int], optinal): List of ids of the second sequence.
already_has_special_tokens (bool, optional): Whether or not the token list is already
formatted with special tokens for the model. Defaults to None.
Returns:
results (List[int]): The list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
if token_ids_1 is not None:
raise ValueError(
"You should not supply a second sequence if the provided sequence of "
"ids is already formatted with special tokens for the model."
)
return list(
map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0,
token_ids_0))
if token_ids_1 is not None:
return [1] + ([0] * len(token_ids_0)) + [1] + (
[0] * len(token_ids_1)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1]
def encode(self, def encode(self,
text, text,
text_pair=None, text_pair=None,
......
...@@ -650,6 +650,39 @@ class ErnieTinyTokenizer(PretrainedTokenizer): ...@@ -650,6 +650,39 @@ class ErnieTinyTokenizer(PretrainedTokenizer):
return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 +
_sep) * [1] _sep) * [1]
def get_special_tokens_mask(self,
token_ids_0,
token_ids_1=None,
already_has_special_tokens=False):
"""
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``encode`` methods.
Args:
token_ids_0 (List[int]): List of ids of the first sequence.
token_ids_1 (List[int], optinal): List of ids of the second sequence.
already_has_special_tokens (bool, optional): Whether or not the token list is already
formatted with special tokens for the model. Defaults to None.
Returns:
results (List[int]): The list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
if token_ids_1 is not None:
raise ValueError(
"You should not supply a second sequence if the provided sequence of "
"ids is already formatted with special tokens for the model."
)
return list(
map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0,
token_ids_0))
if token_ids_1 is not None:
return [1] + ([0] * len(token_ids_0)) + [1] + (
[0] * len(token_ids_1)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1]
def encode(self, def encode(self,
text, text,
text_pair=None, text_pair=None,
......
...@@ -216,6 +216,39 @@ class RobertaTokenizer(PretrainedTokenizer): ...@@ -216,6 +216,39 @@ class RobertaTokenizer(PretrainedTokenizer):
return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 +
_sep) * [1] _sep) * [1]
def get_special_tokens_mask(self,
token_ids_0,
token_ids_1=None,
already_has_special_tokens=False):
"""
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``encode`` methods.
Args:
token_ids_0 (List[int]): List of ids of the first sequence.
token_ids_1 (List[int], optinal): List of ids of the second sequence.
already_has_special_tokens (bool, optional): Whether or not the token list is already
formatted with special tokens for the model. Defaults to None.
Returns:
results (List[int]): The list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
if token_ids_1 is not None:
raise ValueError(
"You should not supply a second sequence if the provided sequence of "
"ids is already formatted with special tokens for the model."
)
return list(
map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0,
token_ids_0))
if token_ids_1 is not None:
return [1] + ([0] * len(token_ids_0)) + [1] + (
[0] * len(token_ids_1)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1]
def encode(self, def encode(self,
text, text,
text_pair=None, text_pair=None,
......
...@@ -437,3 +437,23 @@ class PretrainedTokenizer(object): ...@@ -437,3 +437,23 @@ class PretrainedTokenizer(object):
"Truncation_strategy should be selected in ['longest_first', 'only_first', 'only_second', 'do_not_truncate']" "Truncation_strategy should be selected in ['longest_first', 'only_first', 'only_second', 'do_not_truncate']"
) )
return (ids, pair_ids, overflowing_tokens) return (ids, pair_ids, overflowing_tokens)
def get_special_tokens_mask(self,
token_ids_0,
token_ids_1=None,
already_has_special_tokens=False):
"""
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``encode`` methods.
Args:
token_ids_0 (List[int]): List of ids of the first sequence.
token_ids_1 (List[int], optinal): List of ids of the second sequence.
already_has_special_tokens (bool, optional): Whether or not the token list is already
formatted with special tokens for the model. Defaults to None.
Returns:
results (List[int]): The list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
return [0] * ((len(token_ids_1)
if token_ids_1 else 0) + len(token_ids_0))
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册