未验证 提交 91fcce52 编写于 作者: C Chang Xu 提交者: GitHub

Update NLP & HuggingFace Demo (#1212)

* Change Tokenizer in HuggingFace Demo

* Change Tokenizer in HuggingFace Demo

* Change Tokenizer in HuggingFace Demo

* Change Tokenizer in HuggingFace Demo

* Change Tokenizer in HuggingFace Demo

* Change Tokenizer in HuggingFace Demo
上级 3a026b6a
...@@ -107,20 +107,14 @@ tar -zxvf afqmc.tar ...@@ -107,20 +107,14 @@ tar -zxvf afqmc.tar
```shell ```shell
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
python run.py \ python run.py --config_path='./configs/pp-minilm/auto/afqmc.yaml' --save_dir='./save_afqmc_pruned/'
--model_type='ppminilm' \
--model_dir='./afqmc' \
--model_filename='inference.pdmodel' \
--params_filename='inference.pdiparams' \
--dataset='clue' \
--save_dir='./save_afqmc_pruned/' \
--batch_size=16 \
--max_seq_length=128 \
--task_name='afqmc' \
--config_path='./configs/pp-minilm/auto/afqmc.yaml'
``` ```
如仅需验证模型精度,在启动```run.py```脚本时,命令加上```--eval=True```即可。
如仅需验证模型精度,或验证压缩之后模型精度,在启动```run.py```脚本时,将配置文件中模型文件夹 ```model_dir``` 改为压缩之后保存的文件夹路径 ```./output/cola/``` ,命令加上```--eval True```即可:
```shell
export CUDA_VISIBLE_DEVICES=0
python run.py --config_path=./configs/cola.yaml --eval True
```
## 4. 压缩配置介绍 ## 4. 压缩配置介绍
自动压缩需要准备config文件,并传入```config_path```字段,configs文件夹下可查看不同任务的配置文件,以下示例以afqmc数据集为例介绍。训练参数需要自行配置。蒸馏、剪枝和离线量化的相关配置,自动压缩策略可以自动获取得到,也可以自行配置。PaddleNLP模型的自动压缩实验默认使用剪枝、蒸馏和离线量化的策略。 自动压缩需要准备config文件,并传入```config_path```字段,configs文件夹下可查看不同任务的配置文件,以下示例以afqmc数据集为例介绍。训练参数需要自行配置。蒸馏、剪枝和离线量化的相关配置,自动压缩策略可以自动获取得到,也可以自行配置。PaddleNLP模型的自动压缩实验默认使用剪枝、蒸馏和离线量化的策略。
......
Global:
model_dir: ./AFQMC
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: afqmc
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 6 epochs: 6
eval_iter: 1070 eval_iter: 1070
......
Global:
model_dir: ./CLUEWSC
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: cluewsc
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 100 epochs: 100
eval_iter: 70 eval_iter: 70
......
Global:
model_dir: ./CMNLI
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: cmnli
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 6 epochs: 6
eval_iter: 2000 eval_iter: 2000
......
Global:
model_dir: ./CSL
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: csl
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 16 epochs: 16
eval_iter: 1000 eval_iter: 1000
......
Global:
model_dir: ./IFLYTEK
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: iflytek
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 12 epochs: 12
eval_iter: 750 eval_iter: 750
......
Global:
model_dir: ./OCNLI
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: ocnli
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 20 epochs: 20
eval_iter: 1050 eval_iter: 1050
......
Global:
model_dir: ./TNEWS
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: tnews
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 6 epochs: 6
eval_iter: 1110 eval_iter: 1110
......
Global:
model_dir: ./afqmc
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: afqmc
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 6 epochs: 6
eval_iter: 1070 eval_iter: 1070
......
Global:
model_dir: ./cluewsc
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: cluewsc
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 100 epochs: 100
eval_iter: 70 eval_iter: 70
......
Global:
model_dir: ./cmnli
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: cmnli
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 6 epochs: 6
eval_iter: 2000 eval_iter: 2000
......
Global:
model_dir: ./csl
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: csl
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 16 epochs: 16
eval_iter: 1000 eval_iter: 1000
......
Global:
model_dir: ./iflytek
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: iflytek
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 12 epochs: 12
eval_iter: 750 eval_iter: 750
......
Global:
model_dir: ./ocnli
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: ocnli
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 20 epochs: 20
eval_iter: 1050 eval_iter: 1050
......
Global:
model_dir: ./tnews
model_filename: inference.pdmodel
params_filename: inference.pdiparams
task_name: tnews
dataset: clue
batch_size: 16
max_seq_length: 128
TrainConfig: TrainConfig:
epochs: 6 epochs: 6
eval_iter: 1110 eval_iter: 1110
......
import os import os
import sys import sys
sys.path[0] = os.path.join(
os.path.dirname("__file__"), os.path.pardir, os.path.pardir)
import argparse import argparse
import functools import functools
from functools import partial from functools import partial
import numpy as np import numpy as np
import paddle import paddle
import paddle.nn as nn import paddle.nn as nn
from paddle.io import Dataset, BatchSampler, DataLoader from paddle.io import Dataset, BatchSampler, DataLoader
from paddle.metric import Metric, Accuracy, Precision, Recall from paddle.metric import Metric, Accuracy, Precision, Recall
from paddlenlp.transformers import PPMiniLMForSequenceClassification, PPMiniLMTokenizer from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer
from paddlenlp.transformers import BertForSequenceClassification, BertTokenizer
from paddlenlp.datasets import load_dataset from paddlenlp.datasets import load_dataset
from paddlenlp.data import Stack, Tuple, Pad from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.data.sampler import SamplerHelper from paddlenlp.data.sampler import SamplerHelper
from paddlenlp.metrics import Mcc, PearsonAndSpearman from paddlenlp.metrics import Mcc, PearsonAndSpearman
from paddleslim.auto_compression.config_helpers import load_config from paddleslim.auto_compression.config_helpers import load_config
from paddleslim.auto_compression.compressor import AutoCompression from paddleslim.auto_compression.compressor import AutoCompression
from utility import add_arguments
parser = argparse.ArgumentParser(description=__doc__) def argsparser():
add_arg = functools.partial(add_arguments, argparser=parser) parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
# yapf: disable '--config_path',
add_arg('model_type', str, None, "model type can be bert or ppminilm.") type=str,
add_arg('model_dir', str, None, "inference model directory.") default=None,
add_arg('model_filename', str, None, "inference model filename.") help="path of compression strategy config.",
add_arg('params_filename', str, None, "inference params filename.") required=True)
add_arg('dataset', str, None, "datset name.") parser.add_argument(
add_arg('save_dir', str, None, "directory to save compressed model.") '--save_dir',
add_arg('max_seq_length', int, 128, "max sequence length after tokenization.") type=str,
add_arg('batch_size', int, 1, "train batch size.") default='output',
add_arg('task_name', str, 'sst-2', "task name in glue.") help="directory to save compressed model.")
add_arg('config_path', str, None, "path of compression strategy config.") parser.add_argument(
add_arg('eval', bool, False, "whether validate the model only.") '--eval',
type=bool,
# yapf: enable default=False,
help="whether validate the model only.")
return parser
METRIC_CLASSES = { METRIC_CLASSES = {
"cola": Mcc, "cola": Mcc,
...@@ -61,11 +61,11 @@ def convert_example(example, ...@@ -61,11 +61,11 @@ def convert_example(example,
label_list, label_list,
max_seq_length=512, max_seq_length=512,
is_test=False): is_test=False):
assert args.dataset in [ assert global_config['dataset'] in [
'glue', 'clue' 'glue', 'clue'
], "This demo only supports for dataset glue or clue" ], "This demo only supports for dataset glue or clue"
"""Convert a glue example into necessary features.""" """Convert a glue example into necessary features."""
if args.dataset == 'glue': if global_config['dataset'] == 'glue':
if not is_test: if not is_test:
# `label_list == None` is for regression task # `label_list == None` is for regression task
label_dtype = "int64" if label_list else "float32" label_dtype = "int64" if label_list else "float32"
...@@ -80,7 +80,7 @@ def convert_example(example, ...@@ -80,7 +80,7 @@ def convert_example(example,
else: else:
return example['input_ids'], example['token_type_ids'] return example['input_ids'], example['token_type_ids']
else: #if args.dataset == 'clue': else: #if global_config['dataset'] == 'clue':
if not is_test: if not is_test:
# `label_list == None` is for regression task # `label_list == None` is for regression task
label_dtype = "int64" if label_list else "float32" label_dtype = "int64" if label_list else "float32"
...@@ -151,18 +151,19 @@ def create_data_holder(task_name): ...@@ -151,18 +151,19 @@ def create_data_holder(task_name):
def reader(): def reader():
# Create the tokenizer and dataset # Create the tokenizer and dataset
if args.model_type == 'bert':
tokenizer = BertTokenizer.from_pretrained(args.model_dir) tokenizer = AutoTokenizer.from_pretrained(global_config['model_dir'])
else: # ppminilm
tokenizer = PPMiniLMTokenizer.from_pretrained(args.model_dir)
train_ds, dev_ds = load_dataset( train_ds, dev_ds = load_dataset(
args.dataset, args.task_name, splits=('train', 'dev')) global_config['dataset'],
global_config['task_name'],
splits=('train', 'dev'))
trans_func = partial( trans_func = partial(
convert_example, convert_example,
tokenizer=tokenizer, tokenizer=tokenizer,
label_list=train_ds.label_list, label_list=train_ds.label_list,
max_seq_length=args.max_seq_length, max_seq_length=global_config['max_seq_length'],
is_test=True) is_test=True)
train_ds = train_ds.map(trans_func, lazy=True) train_ds = train_ds.map(trans_func, lazy=True)
...@@ -173,9 +174,10 @@ def reader(): ...@@ -173,9 +174,10 @@ def reader():
): fn(samples) ): fn(samples)
train_batch_sampler = paddle.io.BatchSampler( train_batch_sampler = paddle.io.BatchSampler(
train_ds, batch_size=args.batch_size, shuffle=True) train_ds, batch_size=global_config['batch_size'], shuffle=True)
[input_ids, token_type_ids, labels] = create_data_holder(args.task_name) [input_ids, token_type_ids, labels] = create_data_holder(global_config[
'task_name'])
feed_list_name = [] feed_list_name = []
train_data_loader = DataLoader( train_data_loader = DataLoader(
dataset=train_ds, dataset=train_ds,
...@@ -189,7 +191,7 @@ def reader(): ...@@ -189,7 +191,7 @@ def reader():
convert_example, convert_example,
tokenizer=tokenizer, tokenizer=tokenizer,
label_list=train_ds.label_list, label_list=train_ds.label_list,
max_seq_length=args.max_seq_length) max_seq_length=global_config['max_seq_length'])
dev_batchify_fn = lambda samples, fn=Tuple( dev_batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type
...@@ -197,7 +199,7 @@ def reader(): ...@@ -197,7 +199,7 @@ def reader():
): fn(samples) ): fn(samples)
dev_ds = dev_ds.map(dev_trans_func, lazy=True) dev_ds = dev_ds.map(dev_trans_func, lazy=True)
dev_batch_sampler = paddle.io.BatchSampler( dev_batch_sampler = paddle.io.BatchSampler(
dev_ds, batch_size=args.batch_size, shuffle=False) dev_ds, batch_size=global_config['batch_size'], shuffle=False)
dev_data_loader = DataLoader( dev_data_loader = DataLoader(
dataset=dev_ds, dataset=dev_ds,
batch_sampler=dev_batch_sampler, batch_sampler=dev_batch_sampler,
...@@ -233,11 +235,11 @@ def eval(): ...@@ -233,11 +235,11 @@ def eval():
places = paddle.device._convert_to_place(devices) places = paddle.device._convert_to_place(devices)
exe = paddle.static.Executor(places) exe = paddle.static.Executor(places)
val_program, feed_target_names, fetch_targets = paddle.static.load_inference_model( val_program, feed_target_names, fetch_targets = paddle.static.load_inference_model(
args.model_dir, global_config['model_dir'],
exe, exe,
model_filename=args.model_filename, model_filename=global_config['model_filename'],
params_filename=args.params_filename) params_filename=global_config['params_filename'])
print('Loaded model from: {}'.format(args.model_dir)) print('Loaded model from: {}'.format(global_config['model_dir']))
metric.reset() metric.reset()
print('Evaluating...') print('Evaluating...')
for data in eval_dataloader(): for data in eval_dataloader():
...@@ -268,17 +270,23 @@ def apply_decay_param_fun(name): ...@@ -268,17 +270,23 @@ def apply_decay_param_fun(name):
return False return False
if __name__ == '__main__': def main():
args = parser.parse_args()
paddle.enable_static()
all_config = load_config(args.config_path) all_config = load_config(args.config_path)
if "TrainConfig" in all_config: global global_config
all_config["TrainConfig"]["optimizer_builder"][ assert "Global" in all_config, "Key Global not found in config file."
global_config = all_config["Global"]
if 'TrainConfig' in all_config:
all_config['TrainConfig']['optimizer_builder'][
'apply_decay_param_fun'] = apply_decay_param_fun 'apply_decay_param_fun'] = apply_decay_param_fun
global train_dataloader, eval_dataloader
train_dataloader, eval_dataloader = reader() train_dataloader, eval_dataloader = reader()
metric_class = METRIC_CLASSES[args.task_name]
global metric
metric_class = METRIC_CLASSES[global_config['task_name']]
metric = metric_class() metric = metric_class()
if args.eval: if args.eval:
...@@ -287,9 +295,9 @@ if __name__ == '__main__': ...@@ -287,9 +295,9 @@ if __name__ == '__main__':
sys.exit(0) sys.exit(0)
ac = AutoCompression( ac = AutoCompression(
model_dir=args.model_dir, model_dir=global_config['model_dir'],
model_filename=args.model_filename, model_filename=global_config['model_filename'],
params_filename=args.params_filename, params_filename=global_config['params_filename'],
save_dir=args.save_dir, save_dir=args.save_dir,
config=all_config, config=all_config,
train_dataloader=train_dataloader, train_dataloader=train_dataloader,
...@@ -298,3 +306,10 @@ if __name__ == '__main__': ...@@ -298,3 +306,10 @@ if __name__ == '__main__':
eval_dataloader=eval_dataloader) eval_dataloader=eval_dataloader)
ac.compress() ac.compress()
if __name__ == '__main__':
paddle.enable_static()
parser = argsparser()
args = parser.parse_args()
main()
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
export FLAGS_cudnn_deterministic=True export FLAGS_cudnn_deterministic=True
python run.py \ python run.py --config_path='./configs/pp-minilm/auto/afqmc.yaml' --save_dir='./save_afqmc_pruned/'
--model_type='ppminilm' \
--model_dir='./afqmc' \
--model_filename='inference.pdmodel' \
--params_filename='inference.pdiparams' \
--dataset='clue' \
--save_dir='./save_afqmc_pruned/' \
--batch_size=16 \
--max_seq_length=128 \
--task_name='afqmc' \
--config_path='./configs/pp-minilm/auto/afqmc.yaml'
...@@ -46,6 +46,7 @@ ...@@ -46,6 +46,7 @@
- PaddlePaddle >= 2.3 (可从[Paddle官网](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)下载安装) - PaddlePaddle >= 2.3 (可从[Paddle官网](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)下载安装)
- PaddleSlim develop版本或PaddleSlim>=2.3.0 - PaddleSlim develop版本或PaddleSlim>=2.3.0
- X2Paddle develop版本 - X2Paddle develop版本
- transformers >= 4.18.0
- PaddleNLP >= 2.3 - PaddleNLP >= 2.3
- tensorflow == 1.14 (如需压缩TensorFlow模型) - tensorflow == 1.14 (如需压缩TensorFlow模型)
- onnx >= 1.6.0 (如需压缩ONNX模型) - onnx >= 1.6.0 (如需压缩ONNX模型)
...@@ -73,12 +74,18 @@ git checkout develop ...@@ -73,12 +74,18 @@ git checkout develop
python setup.py install python setup.py install
``` ```
安装transformers:
```shell
pip install transformers
```
注:安装transformers的目的是为了使用transformers中的Tokenizer。
安装paddlenlp: 安装paddlenlp:
```shell ```shell
pip install paddlenlp pip install paddlenlp
``` ```
注:安装PaddleNLP的目的是为了下载PaddleNLP中的数据集和Tokenizer 注:安装PaddleNLP的目的是为了下载PaddleNLP中的数据集。
#### 3.2 准备数据集 #### 3.2 准备数据集
...@@ -165,11 +172,11 @@ def main(x0, x1, x2): ...@@ -165,11 +172,11 @@ def main(x0, x1, x2):
sepc_list = list() sepc_list = list()
sepc_list.append( sepc_list.append(
paddle.static.InputSpec( paddle.static.InputSpec(
shape=[-1, 128], name="x0", dtype="int64"), shape=[-1, 128], name="x2paddle_input_ids", dtype="int64"),
paddle.static.InputSpec( paddle.static.InputSpec(
shape=[-1, 128], name="x1", dtype="int64"), shape=[-1, 128], name="x2paddle_attention_mask", dtype="int64"),
paddle.static.InputSpec( paddle.static.InputSpec(
shape=[-1, 128], name="x2", dtype="int64")) shape=[-1, 128], name="x2paddle_token_type_ids", dtype="int64"))
static_model = paddle.jit.to_static(model, input_spec=sepc_list) static_model = paddle.jit.to_static(model, input_spec=sepc_list)
paddle.jit.save(static_model, "./x2paddle_cola") paddle.jit.save(static_model, "./x2paddle_cola")
``` ```
...@@ -184,10 +191,10 @@ export CUDA_VISIBLE_DEVICES=0 ...@@ -184,10 +191,10 @@ export CUDA_VISIBLE_DEVICES=0
python run.py --config_path=./configs/cola.yaml --save_dir='./output/cola/' python run.py --config_path=./configs/cola.yaml --save_dir='./output/cola/'
``` ```
如仅需验证模型精度,在启动```run.py```脚本时,命令加上```--eval True```即可: 如仅需验证模型精度,或验证压缩之后模型精度,在启动```run.py```脚本时,将配置文件中模型文件夹 ```model_dir``` 改为压缩之后保存的文件夹路径 ```./output/cola/``` ,命令加上```--eval True```即可:
```shell ```shell
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
python run.py --config_path=./configs/cola.yaml --save_dir='./output/cola/' --eval True python run.py --config_path=./configs/cola.yaml --eval True
``` ```
## 4. 预测部署 ## 4. 预测部署
...@@ -211,4 +218,6 @@ python -u ./infer.py \ ...@@ -211,4 +218,6 @@ python -u ./infer.py \
- ```fp16```:是否启用```FP16``` - ```fp16```:是否启用```FP16```
若使用 TesorRT 预测引擎,需安装 ```WITH_TRT=ON``` 的Paddle,下载地址:[Python预测库](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html#python)
## 5. FAQ ## 5. FAQ
...@@ -22,7 +22,7 @@ import functools ...@@ -22,7 +22,7 @@ import functools
from functools import partial from functools import partial
from paddle.io import Dataset, BatchSampler, DataLoader from paddle.io import Dataset, BatchSampler, DataLoader
from paddle.metric import Metric, Accuracy from paddle.metric import Metric, Accuracy
from paddlenlp.transformers import BertForSequenceClassification, BertTokenizer from transformers import AutoTokenizer
from paddlenlp.datasets import load_dataset from paddlenlp.datasets import load_dataset
from paddlenlp.data import Stack, Tuple, Pad from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
...@@ -136,7 +136,8 @@ def create_data_holder(task_name, input_names): ...@@ -136,7 +136,8 @@ def create_data_holder(task_name, input_names):
def reader(): def reader():
# Create the tokenizer and dataset # Create the tokenizer and dataset
tokenizer = BertTokenizer.from_pretrained(global_config['model_dir']) tokenizer = AutoTokenizer.from_pretrained(
global_config['model_dir'], use_fast=False)
train_ds = load_dataset( train_ds = load_dataset(
global_config['dataset'], global_config['task_name'], splits="train") global_config['dataset'], global_config['task_name'], splits="train")
...@@ -344,7 +345,7 @@ def main(): ...@@ -344,7 +345,7 @@ def main():
model_filename=global_config['model_filename'], model_filename=global_config['model_filename'],
params_filename=global_config['params_filename'], params_filename=global_config['params_filename'],
save_dir=args.save_dir, save_dir=args.save_dir,
config=args.config_path, config=all_config,
train_dataloader=train_dataloader, train_dataloader=train_dataloader,
eval_callback=eval_function if eval_callback=eval_function if
(len(list(all_config.keys())) == 2 and 'TrainConfig' in all_config) or (len(list(all_config.keys())) == 2 and 'TrainConfig' in all_config) or
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册