提交 e9c7c30e 编写于 作者: SYSU_BOND's avatar SYSU_BOND 提交者: bbking

Update PaddleNLP LAC model for new codestyle (#3463)

* reconstruct the run_sequence_labeling.py into train.py predict.py eval.py & add yalm configure

* reconstruct the ERNIE based LAC model

* Update train_ernie.py

recurrent multi-GPU nan

* update the ernie base model

* configure update

* update configure

* add inference model

* specification code

* delete unused run_sequence_labeling.py

* rename evaluate.py to compare.py

* add postfix '.pdckpt' to model

* update README.md

* add LAC class(for predict conveniently)

* add LAC class(for predict conveniently)

* update README.md

* fixed bug on run_ernie

* update default setting

* fix infenence bug in windows

* fix infenence bug in windows

* update new model and dateset

* delete the postfix .pdckpt of model checkpoint directory

* update new model's performance

* fixed the bug for empty input

* unusing of tqdm tools

* fix the bug of train_data
上级 0cc14636
......@@ -2,13 +2,13 @@
## 1. 简介
Lexical Analysis of Chinese,简称 LAC,是一个联合的词法分析模型,能整体性地完成中文分词、词性标注、专名识别任务。我们在自建的数据集上对分词、词性标注、专名识别进行整体的评估效果,具体数值见下表;此外,我们在百度开放的 [ERNIE](https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE) 模型上 finetune,并对比基线模型、BERT finetuned 和 ERNIE finetuned 的效果,可以看出会有显著的提升。可通过 [AI开放平台-词法分析](http://ai.baidu.com/tech/nlp/lexical) 线上体验百度的词法分析服务。
Lexical Analysis of Chinese,简称 LAC,是一个联合的词法分析模型,在单个模型中完成中文分词、词性标注、专名识别任务。我们在自建的数据集上对分词、词性标注、专名识别进行整体的评估效果,具体数值见下表;此外,我们在百度开放的 [ERNIE](https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE) 模型上 finetune,并对比基线模型、BERT finetuned 和 ERNIE finetuned 的效果,可以看出会有显著的提升。可通过 [AI开放平台-词法分析](http://ai.baidu.com/tech/nlp/lexical) 线上体验百度的词法分析服务。
|模型|Precision|Recall|F1-score|
|:-:|:-:|:-:|:-:|
|Lexical Analysis|88.0%|88.7%|88.4%|
|Lexical Analysis|89.2%|89.4%|89.3%|
|BERT finetuned|90.2%|90.4%|90.3%|
|ERNIE finetuned|92.0%|92.0%|92.0%|
|ERNIE finetuned|91.7%|91.7%|91.7%|
## 2. 快速开始
......@@ -16,7 +16,7 @@ Lexical Analysis of Chinese,简称 LAC,是一个联合的词法分析模型
#### 1.PaddlePaddle 安装
本项目依赖 PaddlePaddle 1.3.2 及以上版本,安装请参考官网 [快速安装](http://www.paddlepaddle.org/paddle#quick-start)
本项目依赖 PaddlePaddle 1.4.0 及以上版本和PaddleHub 1.0.0及以上版本 ,PaddlePaddle安装请参考官网 [快速安装](http://www.paddlepaddle.org/paddle#quick-start),PaddleHub安装参考 [PaddleHub](https://github.com/PaddlePaddle/PaddleHub)
> Warning: GPU 和 CPU 版本的 PaddlePaddle 分别是 paddlepaddle-gpu 和 paddlepaddle,请安装时注意区别。
......@@ -27,20 +27,32 @@ Lexical Analysis of Chinese,简称 LAC,是一个联合的词法分析模型
cd models/PaddleNLP/lexical_analysis
```
### 数据准备
#### 1. 快速下载
本项目涉及的**数据集****预训练模型**的数据可通过执行以下脚本进行快速下载,若仅需使用部分数据,可根据需要参照下列介绍进行部分下载
```bash
sh download.sh
```
#### 2. 训练数据集
下载数据集文件,解压后会生成 `./data/` 文件夹
```bash
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/lexical_analysis-dataset-1.0.0.tar.gz
tar xvf lexical_analysis-dataset-1.0.0.tar.gz
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/lexical_analysis-dataset-2.0.0.tar.gz
tar xvf lexical_analysis-dataset-2.0.0.tar.gz
```
### 模型下载
#### 3. 预训练模型
我们开源了在自建数据集上训练的词法分析模型,可供用户直接使用,这里提供两种下载方式:
方式一:基于 PaddleHub 命令行工具,PaddleHub 的安装参考 [PaddleHub](https://github.com/PaddlePaddle/PaddleHub)
```bash
# download baseline model
hub download lexical_analysis
tar xvf lexical_analysis-1.0.0.tar.gz
tar xvf lexical_analysis-2.0.0.tar.gz
# download ERNIE finetuned model
hub download lexical_analysis_finetuned
......@@ -50,17 +62,18 @@ tar xvf lexical_analysis_finetuned-1.0.0.tar.gz
方式二:直接下载
```bash
# download baseline model
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/lexical_analysis-1.0.0.tar.gz
tar xvf lexical_analysis-1.0.0.tar.gz
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/lexical_analysis-2.0.0.tar.gz
tar xvf lexical_analysis-2.0.0.tar.gz
# download ERNIE finetuned model
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/lexical_analysis_finetuned-1.0.0.tar.gz
tar xvf lexical_analysis_finetuned-1.0.0.tar.gz
```
注:下载 ERNIE 开放的模型请参考 [ERNIE](https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE),下载后可放在 `./pretrained/` 目录下。
注:若需进行ERNIE Finetune训练,需自行下载 [ERNIE](https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz) 开放的模型,下载链接为: [https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz](https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz),下载后解压至 `./pretrained/` 目录下。
### 模型评估
我们基于自建的数据集训练了一个词法分析的模型,可以直接用这个模型对测试集 `./data/test.tsv` 进行验证,
```bash
# baseline model
......@@ -71,16 +84,33 @@ sh run_ernie.sh eval
```
### 模型训练
基于示例的数据集,可以运行下面的命令,在训练集 `./data/train.tsv` 上进行训练
基于示例的数据集,可通过下面的命令,在训练集 `./data/train.tsv` 上进行训练,示例包含程序在单机单卡/多卡,以及CPU多线程的运行设置
> Warning: 若需进行ERNIE Finetune训练,需自行下载 [ERNIE](https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz) 开放的模型,下载链接为: [https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz](https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz),下载后解压至 `./pretrained/` 目录下。
```bash
# baseline model
sh run.sh train
# baseline model, using single GPU
sh run.sh train_single_gpu
# baseline model, using multi GPU
sh run.sh train_multi_gpu
# baseline model, using multi CPU
sh run.sh train_multi_cpu
# ERNIE finetuned model
sh run_ernie.sh train
# ERNIE finetuned model, using single GPU
sh run_ernie.sh train_single_gpu
# ERNIE finetuned model, using multi CPU
sh run_ernie.sh train_multi_cpu
```
注:基于ERNIE 的序列标注模型暂不支持多GPU
### 模型预测
加载已有的模型,对未知的数据进行预测
```bash
# baseline model
......@@ -90,6 +120,20 @@ sh run.sh infer
sh run_ernie.sh infer
```
### 模型保存
将预训练好的模型转换为部署和预测用的模型
```bash
# baseline model
export PYTHONIOENCODING=UTF-8 # 模型输出为Unicode编码,Python2若无此设置容易报错
python inference_model.py \
--init_checkpoint ./model_baseline \
--inference_save_dir ./inference_model
```
## 3. 进阶使用
### 任务定义与建模
......@@ -99,7 +143,7 @@ sh run_ernie.sh infer
3. 字向量序列作为双向 GRU 的输入,学习输入序列的特征表示,得到新的特性表示序列,我们堆叠了两层双向GRU以增加学习能力;
4. CRF 以 GRU 学习到的特征为输入,以标记序列为监督信号,实现序列标注。
词性和专名类别标签集合如下表,其中词性标签 24 个(小写字母),专名类别标签 4 个(大写字母)。这里需要说明的是,人名、地名、机名和时间四个类别,在上表中存在两套标签(PER / LOC / ORG / TIME 和 nr / ns / nt / t),被标注为第二套标签的词,是模型判断为低置信度的人名、地名、机构名和时间词。开发者可以基于这两套标签,在四个类别的准确、召回之间做出自己的权衡。
词性和专名类别标签集合如下表,其中词性标签 24 个(小写字母),专名类别标签 4 个(大写字母)。这里需要说明的是,人名、地名、机名和时间四个类别,在上表中存在两套标签(PER / LOC / ORG / TIME 和 nr / ns / nt / t),被标注为第二套标签的词,是模型判断为低置信度的人名、地名、机构名和时间词。开发者可以基于这两套标签,在四个类别的准确、召回之间做出自己的权衡。
| 标签 | 含义 | 标签 | 含义 | 标签 | 含义 | 标签 | 含义 |
| ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- |
......@@ -141,14 +185,19 @@ sh run_ernie.sh infer
```text
.
├── README.md # 本文档
├── conf/ # 词典目录
├── conf/ # 词典及程序默认配置的目录
├── compare.py # 执行LAC与其他开源分词的对比脚本
├── creator.py # 执行创建网络和数据读取器的脚本
├── data/ # 存放数据集的目录
├── downloads.sh # 用于下载数据和模型的脚本
├── eval.py # 词法分析评估的脚本
├── inference_model.py # 执行保存inference_model的脚本,用于准备上线部署环境
├── gru-crf-model.png # README 用到的模型图片
├── predict.py # 执行预测功能的脚本
├── reader.py # 文件读取相关函数
├── run_ernie_sequence_labeling.py # 用于 finetune ERNIE 的代码
├── run_ernie.sh # 启用上面代码的脚本
├── run_sequence_labeling.py # 词法分析任务代码
├── train.py # 词法分析训练脚本
├── run.sh # 启用上面代码的脚本
└── utils.py # 常用工具函数
```
......
......@@ -11,7 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#coding=utf-8
# -*- coding: UTF-8 -*-
"""
evaluate wordseg for LAC and other open-source wordseg tools
"""
......@@ -275,6 +275,4 @@ def evaluate_all():
if __name__ == "__main__":
import ipdb
#ipdb.set_trace()
evaluate_all()
model:
word_emb_dim:
val: 128
meaning: "The dimension in which a word is embedded."
grnn_hidden_dim:
val: 128
meaning: "The number of hidden nodes in the GRNN layer."
bigru_num:
val: 2
meaning: "The number of bi_gru layers in the network."
init_checkpoint:
val: ""
meaning: "Path to init model"
inference_save_dir:
val: ""
meaning: "Path to save inference model"
train:
random_seed:
val: 0
meaning: "Random seed for training"
print_steps:
val: 1
meaning: "Print the result per xxx batch of training"
save_steps:
val: 10
meaning: "Save the model once per xxxx batch of training"
validation_steps:
val: 10
meaning: "Do the validation once per xxxx batch of training"
batch_size:
val: 300
meaning: "The number of sequences contained in a mini-batch"
epoch:
val: 10
meaning: "Corpus iteration num"
use_cuda:
val: False
meaning: "If set, use GPU for training."
traindata_shuffle_buffer:
val: 20000
meaning: "The buffer size used in shuffle the training data."
base_learning_rate:
val: 0.001
meaning: "The basic learning rate that affects the entire network."
emb_learning_rate:
val: 2
meaning: "The real learning rate of the embedding layer will be (emb_learning_rate * base_learning_rate)."
crf_learning_rate:
val: 0.2
meaning: "The real learning rate of the embedding layer will be (crf_learning_rate * base_learning_rate)."
enable_ce:
val: false
meaning: 'If set, run the task with continuous evaluation logs.'
cpu_num:
val: 10
meaning: "The number of cpu used to train model, this argument wouldn't be valid if use_cuda=true"
data:
word_dict_path:
val: "./conf/word.dic"
meaning: "The path of the word dictionary."
label_dict_path:
val: "./conf/tag.dic"
meaning: "The path of the label dictionary."
word_rep_dict_path:
val: "./conf/q2b.dic"
meaning: "The path of the word replacement Dictionary."
train_data:
val: "./data/train.tsv"
meaning: "The folder where the training data is located."
test_data:
val: "./data/test.tsv"
meaning: "The folder where the test data is located."
infer_data:
val: "./data/infer.tsv"
meaning: "The folder where the infer data is located."
model_save_dir:
val: "./models"
meaning: "The model will be saved in this path."
model:
ernie_config_path:
val: "../LARK/ERNIE/config/ernie_config.json"
meaning: "Path to the json file for ernie model config."
init_checkpoint:
val: ""
meaning: "Path to init model"
mode:
val: "train"
meaning: "Setting to train or eval or infer"
init_pretraining_params:
val: "pretrained/params/"
meaning: "Init pre-training params which preforms fine-tuning from. If the arg 'init_checkpoint' has been set, this argument wouldn't be valid."
train:
random_seed:
val: 0
meaning: "Random seed for training"
batch_size:
val: 10
meaning: "The number of sequences contained in a mini-batch"
epoch:
val: 10
meaning: "Corpus iteration num"
use_cuda:
val: True
meaning: "If set, use GPU for training."
base_learning_rate:
val: 0.0002
meaning: "The basic learning rate that affects the entire network."
init_bound:
val: 0.1
meaning: "init bound for initialization."
crf_learning_rate:
val: 0.2
meaning: "The real learning rate of the embedding layer will be (crf_learning_rate * base_learning_rate)."
cpu_num:
val: 10
meaning: "The number of cpu used to train model, it works when use_cuda=False"
print_steps:
val: 1
meaning: "Print the result per xxx batch of training"
save_steps:
val: 10
meaning: "Save the model once per xxxx batch of training"
validation_steps:
val: 5
meaning: "Do the validation once per xxxx batch of training"
data:
vocab_path:
val: "../LARK/ERNIE/config/vocab.txt"
meaning: "The path of the vocabulary."
label_map_config:
val: "./conf/label_map.json"
meaning: "The path of the label dictionary."
num_labels:
val: 57
meaning: "label number"
max_seq_len:
val: 128
meaning: "Number of words of the longest seqence."
do_lower_case:
val: True
meaning: "Whether to lower case the input text. Should be True for uncased models and False for cased models."
train_data:
val: "./data/train.tsv"
meaning: "The folder where the training data is located."
test_data:
val: "./data/test.tsv"
meaning: "The folder where the test data is located."
infer_data:
val: "./data/test.tsv"
meaning: "The folder where the infer data is located."
model_save_dir:
val: "./ernie_models"
meaning: "The model will be saved in this path."
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# -*- coding: UTF-8 -*-
"""
The function lex_net(args) define the lexical analysis network structure
"""
import sys
import os
import math
import paddle
import paddle.fluid as fluid
from paddle.fluid.initializer import NormalInitializer
from reader import Dataset
sys.path.append("..")
from models.sequence_labeling import nets
from models.representation.ernie import ernie_encoder
from preprocess.ernie import task_reader
def create_model(args, vocab_size, num_labels, mode = 'train'):
"""create lac model"""
# model's input data
words = fluid.layers.data(name='words', shape=[-1, 1], dtype='int64',lod_level=1)
targets = fluid.layers.data(name='targets', shape=[-1, 1], dtype='int64', lod_level= 1)
# for inference process
if mode=='infer':
crf_decode = nets.lex_net(words, args, vocab_size, num_labels, for_infer=True, target=None)
return { "feed_list":[words],"words":words, "crf_decode":crf_decode,}
# for test or train process
avg_cost, crf_decode = nets.lex_net(words, args, vocab_size, num_labels, for_infer=False, target=targets)
(precision, recall, f1_score, num_infer_chunks, num_label_chunks,
num_correct_chunks) = fluid.layers.chunk_eval(
input=crf_decode,
label=targets,
chunk_scheme="IOB",
num_chunk_types=int(math.ceil((num_labels - 1) / 2.0)))
chunk_evaluator = fluid.metrics.ChunkEvaluator()
chunk_evaluator.reset()
ret = {
"feed_list":[words, targets],
"words": words,
"targets": targets,
"avg_cost":avg_cost,
"crf_decode": crf_decode,
"precision" : precision,
"recall": recall,
"f1_score": f1_score,
"chunk_evaluator": chunk_evaluator,
"num_infer_chunks": num_infer_chunks,
"num_label_chunks": num_label_chunks,
"num_correct_chunks": num_correct_chunks
}
return ret
def create_pyreader(args, file_name, feed_list, place, model='lac', reader=None, return_reader=False, mode='train'):
# init reader
pyreader = fluid.io.PyReader(
feed_list=feed_list,
capacity=300,
use_double_buffer=True,
iterable=True
)
if model == 'lac':
if reader==None:
reader = Dataset(args)
# create lac pyreader
if mode == 'train':
pyreader.decorate_sample_list_generator(
paddle.batch(
paddle.reader.shuffle(
reader.file_reader(file_name),
buf_size=args.traindata_shuffle_buffer
),
batch_size=args.batch_size
),
places=place
)
else:
pyreader.decorate_sample_list_generator(
paddle.batch(
reader.file_reader(file_name, mode=mode),
batch_size=args.batch_size
),
places=place
)
elif model == 'ernie':
# create ernie pyreader
if reader==None:
reader = task_reader.SequenceLabelReader(
vocab_path=args.vocab_path,
label_map_config=args.label_map_config,
max_seq_len=args.max_seq_len,
do_lower_case=args.do_lower_case,
in_tokens=False,
random_seed=args.random_seed)
if mode == 'train':
pyreader.decorate_batch_generator(
reader.data_generator(
file_name, args.batch_size, args.epoch, shuffle=True, phase="train"
),
places=place
)
else:
pyreader.decorate_batch_generator(
reader.data_generator(
file_name, args.batch_size, epoch=1, shuffle=False, phase=mode
),
places=place
)
if return_reader:
return pyreader, reader
else:
return pyreader
def create_ernie_model(args, ernie_config):
"""
Create Model for LAC based on ERNIE encoder
"""
# ERNIE's input data
src_ids = fluid.layers.data(name='src_ids', shape=[args.max_seq_len, 1], dtype='int64',lod_level=0)
sent_ids = fluid.layers.data(name='sent_ids', shape=[args.max_seq_len, 1], dtype='int64',lod_level=0)
pos_ids = fluid.layers.data(name='pos_ids', shape=[args.max_seq_len, 1], dtype='int64',lod_level=0)
input_mask = fluid.layers.data(name='input_mask', shape=[args.max_seq_len, 1], dtype='int64',lod_level=0)
padded_labels =fluid.layers.data(name='padded_labels', shape=[args.max_seq_len, 1], dtype='int64',lod_level=0)
seq_lens = fluid.layers.data(name='seq_lens', shape=[1], dtype='int64',lod_level=0)
ernie_inputs = {
"src_ids": src_ids,
"sent_ids": sent_ids,
"pos_ids": pos_ids,
"input_mask": input_mask,
"seq_lens": seq_lens
}
embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_config)
words = fluid.layers.sequence_unpad(src_ids, seq_lens)
labels = fluid.layers.sequence_unpad(padded_labels, seq_lens)
token_embeddings = embeddings["token_embeddings"]
emission = fluid.layers.fc(
size=args.num_labels,
input=token_embeddings,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-args.init_bound, high=args.init_bound),
regularizer=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4)))
crf_cost = fluid.layers.linear_chain_crf(
input=emission,
label=labels,
param_attr=fluid.ParamAttr(
name='crfw',
learning_rate=args.crf_learning_rate))
avg_cost = fluid.layers.mean(x=crf_cost)
crf_decode = fluid.layers.crf_decoding(
input=emission, param_attr=fluid.ParamAttr(name='crfw'))
(precision, recall, f1_score, num_infer_chunks, num_label_chunks,
num_correct_chunks) = fluid.layers.chunk_eval(
input=crf_decode,
label=labels,
chunk_scheme="IOB",
num_chunk_types=int(math.ceil((args.num_labels - 1) / 2.0)))
chunk_evaluator = fluid.metrics.ChunkEvaluator()
chunk_evaluator.reset()
ret = {
"feed_list": [src_ids, sent_ids, pos_ids, input_mask, padded_labels, seq_lens],
"words":words,
"labels":labels,
"avg_cost":avg_cost,
"crf_decode":crf_decode,
"precision" : precision,
"recall": recall,
"f1_score": f1_score,
"chunk_evaluator":chunk_evaluator,
"num_infer_chunks":num_infer_chunks,
"num_label_chunks":num_label_chunks,
"num_correct_chunks":num_correct_chunks
}
return ret
......@@ -5,9 +5,9 @@ if [ -d ./model_baseline/ ]
then
echo "./model_baseline/ directory already existed, ignore download"
else
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/lexical_analysis-1.0.0.tar.gz
tar xvf lexical_analysis-1.0.0.tar.gz
/bin/rm lexical_analysis-1.0.0.tar.gz
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/lexical_analysis-2.0.0.tar.gz
tar xvf lexical_analysis-2.0.0.tar.gz
/bin/rm lexical_analysis-2.0.0.tar.gz
fi
# download dataset file to ./data/
......@@ -15,9 +15,9 @@ if [ -d ./data/ ]
then
echo "./data/ directory already existed, ignore download"
else
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/lexical_analysis-dataset-1.0.0.tar.gz
tar xvf lexical_analysis-dataset-1.0.0.tar.gz
/bin/rm lexical_analysis-dataset-1.0.0.tar.gz
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/lexical_analysis-dataset-2.0.0.tar.gz
tar xvf lexical_analysis-dataset-2.0.0.tar.gz
/bin/rm lexical_analysis-dataset-2.0.0.tar.gz
fi
# download ERNIE pretrained model to ./pretrained/
......
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# -*- coding: UTF-8 -*-
import argparse
import os
import time
import sys
import paddle.fluid as fluid
import paddle
import utils
import reader
import creator
sys.path.append('../models/')
from model_check import check_cuda
parser = argparse.ArgumentParser(__doc__)
# 1. model parameters
model_g = utils.ArgumentGroup(parser, "model", "model configuration")
model_g.add_arg("word_emb_dim", int, 128, "The dimension in which a word is embedded.")
model_g.add_arg("grnn_hidden_dim", int, 128, "The number of hidden nodes in the GRNN layer.")
model_g.add_arg("bigru_num", int, 2, "The number of bi_gru layers in the network.")
model_g.add_arg("use_cuda", bool, False, "If set, use GPU for training.")
# 2. data parameters
data_g = utils.ArgumentGroup(parser, "data", "data paths")
data_g.add_arg("word_dict_path", str, "./conf/word.dic", "The path of the word dictionary.")
data_g.add_arg("label_dict_path", str, "./conf/tag.dic", "The path of the label dictionary.")
data_g.add_arg("word_rep_dict_path", str, "./conf/q2b.dic", "The path of the word replacement Dictionary.")
data_g.add_arg("test_data", str, "./data/test.tsv", "The folder where the training data is located.")
data_g.add_arg("init_checkpoint", str, "./model_baseline", "Path to init model")
data_g.add_arg("batch_size", int, 200, "The number of sequences contained in a mini-batch, "
"or the maximum number of tokens (include paddings) contained in a mini-batch.")
def do_eval(args):
dataset = reader.Dataset(args)
test_program = fluid.Program()
with fluid.program_guard(test_program, fluid.default_startup_program()):
with fluid.unique_name.guard():
test_ret = creator.create_model(
args, dataset.vocab_size, dataset.num_labels, mode='test')
test_program = test_program.clone(for_test=True)
# init executor
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else:
place = fluid.CPUPlace()
pyreader = creator.create_pyreader(args, file_name=args.test_data,
feed_list=test_ret['feed_list'],
place=place,
model='lac',
reader=dataset,
mode='test')
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
# load model
utils.init_checkpoint(exe, args.init_checkpoint, test_program)
test_process(exe=exe,
program=test_program,
reader=pyreader,
test_ret=test_ret
)
def test_process(exe, program, reader, test_ret):
"""
the function to execute the infer process
:param exe: the fluid Executor
:param program: the infer_program
:param reader: data reader
:return: the list of prediction result
"""
test_ret["chunk_evaluator"].reset()
start_time = time.time()
for data in reader():
nums_infer, nums_label, nums_correct = exe.run(program,
fetch_list=[
test_ret["num_infer_chunks"],
test_ret["num_label_chunks"],
test_ret["num_correct_chunks"],
],
feed=data,
)
test_ret["chunk_evaluator"].update(nums_infer, nums_label, nums_correct)
precision, recall, f1 = test_ret["chunk_evaluator"].eval()
end_time = time.time()
print("[test] P: %.5f, R: %.5f, F1: %.5f, elapsed time: %.3f s"
% (precision, recall, f1, end_time - start_time))
if __name__ == '__main__':
args = parser.parse_args()
check_cuda(args.use_cuda)
do_eval(args)
# -*- coding: UTF-8 -*-
import argparse
import sys
import os
import numpy as np
import paddle.fluid as fluid
import creator
import reader
import utils
sys.path.append('../models/')
from model_check import check_cuda
def save_inference_model(args):
# model definition
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else:
place = fluid.CPUPlace()
dataset = reader.Dataset(args)
infer_program = fluid.Program()
with fluid.program_guard(infer_program, fluid.default_startup_program()):
with fluid.unique_name.guard():
infer_ret = creator.create_model(
args, dataset.vocab_size, dataset.num_labels, mode='infer')
infer_program = infer_program.clone(for_test=True)
# load pretrain check point
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
utils.init_checkpoint(exe, args.init_checkpoint, infer_program)
fluid.io.save_inference_model(args.inference_save_dir,
['words'],
infer_ret['crf_decode'],
exe,
main_program=infer_program,
model_filename='model.pdmodel',
params_filename='params.pdparams',
)
def test_inference_model(model_dir, text_list, dataset):
"""
:param model_dir: model's dir
:param text_list: a list of input text, which decode as unicode
:param dataset:
:return:
"""
# init executor
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else:
place = fluid.CPUPlace()
exe = fluid.Executor(place)
# transfer text data to input tensor
lod = []
for text in text_list:
lod.append(np.array(dataset.word_to_ids(text.strip())).astype(np.int64))
base_shape = [[len(c) for c in lod]]
tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
# for empty input, output the same empty
if(sum(base_shape[0]) == 0 ):
crf_decode = [tensor_words]
else:
# load inference model
inference_scope = fluid.core.Scope()
with fluid.scope_guard(inference_scope):
[inferencer, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(model_dir, exe,
model_filename='model.pdmodel',
params_filename='params.pdparams',
)
assert feed_target_names[0] == "words"
print("Load inference model from %s"%(model_dir))
# get lac result
crf_decode = exe.run(inferencer,
feed={feed_target_names[0]:tensor_words},
fetch_list=fetch_targets,
return_numpy=False,
use_program_cache=True,
)
# parse the crf_decode result
result = utils.parse_result(tensor_words,crf_decode[0], dataset)
for i,(sent, tags) in enumerate(result):
result_list = ['(%s, %s)'%(ch, tag) for ch, tag in zip(sent,tags)]
print(''.join(result_list))
if __name__=="__main__":
parser = argparse.ArgumentParser(__doc__)
utils.load_yaml(parser,'conf/args.yaml')
args = parser.parse_args()
check_cuda(args.use_cuda)
print("save inference model")
save_inference_model(args)
print("inference model save in %s"%args.inference_save_dir)
print("test inference model")
dataset = reader.Dataset(args)
test_data = [u'百度是一家高科技公司', u'中山大学是岭南第一学府']
test_inference_model(args.inference_save_dir, test_data, dataset)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# -*- coding: UTF-8 -*-
import argparse
import os
import time
import sys
import paddle.fluid as fluid
import paddle
import utils
import reader
import creator
sys.path.append('../models/')
from model_check import check_cuda
parser = argparse.ArgumentParser(__doc__)
# 1. model parameters
model_g = utils.ArgumentGroup(parser, "model", "model configuration")
model_g.add_arg("word_emb_dim", int, 128, "The dimension in which a word is embedded.")
model_g.add_arg("grnn_hidden_dim", int, 256, "The number of hidden nodes in the GRNN layer.")
model_g.add_arg("bigru_num", int, 2, "The number of bi_gru layers in the network.")
model_g.add_arg("use_cuda", bool, False, "If set, use GPU for training.")
# 2. data parameters
data_g = utils.ArgumentGroup(parser, "data", "data paths")
data_g.add_arg("word_dict_path", str, "./conf/word.dic", "The path of the word dictionary.")
data_g.add_arg("label_dict_path", str, "./conf/tag.dic", "The path of the label dictionary.")
data_g.add_arg("word_rep_dict_path", str, "./conf/q2b.dic", "The path of the word replacement Dictionary.")
data_g.add_arg("infer_data", str, "./data/infer.tsv", "The folder where the training data is located.")
data_g.add_arg("init_checkpoint", str, "./model_baseline", "Path to init model")
data_g.add_arg("batch_size", int, 200, "The number of sequences contained in a mini-batch, "
"or the maximum number of tokens (include paddings) contained in a mini-batch.")
def do_infer(args):
dataset = reader.Dataset(args)
infer_program = fluid.Program()
with fluid.program_guard(infer_program, fluid.default_startup_program()):
with fluid.unique_name.guard():
infer_ret = creator.create_model(
args, dataset.vocab_size, dataset.num_labels, mode='infer')
infer_program = infer_program.clone(for_test=True)
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else:
place = fluid.CPUPlace()
pyreader = creator.create_pyreader(args, file_name=args.infer_data,
feed_list=infer_ret['feed_list'],
place=place,
model='lac',
reader=dataset,
mode='infer')
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
# load model
utils.init_checkpoint(exe, args.init_checkpoint, infer_program)
result = infer_process(
exe=exe,
program=infer_program,
reader=pyreader,
fetch_vars=[infer_ret['words'], infer_ret['crf_decode']],
dataset=dataset
)
for sent, tags in result:
result_list = ['(%s, %s)' % (ch, tag) for ch, tag in zip(sent, tags)]
print(''.join(result_list))
def infer_process(exe, program, reader, fetch_vars, dataset):
"""
the function to execute the infer process
:param exe: the fluid Executor
:param program: the infer_program
:param reader: data reader
:return: the list of prediction result
"""
def input_check(data):
if data[0]['words'].lod()[0][-1]==0:
return data[0]['words']
return None
results = []
for data in reader():
crf_decode = input_check(data)
if crf_decode:
results += utils.parse_result(crf_decode, crf_decode, dataset)
continue
words, crf_decode = exe.run(program,
fetch_list=fetch_vars,
feed=data,
return_numpy=False,
use_program_cache=True,
)
results += utils.parse_result(words, crf_decode, dataset)
return results
if __name__=="__main__":
args = parser.parse_args()
check_cuda(args.use_cuda)
do_infer(args)
......@@ -73,18 +73,18 @@ class Dataset(object):
def get_num_examples(self, filename):
"""num of line of file"""
return sum(1 for line in io.open(filename, "r", encoding='utf-8'))
return sum(1 for line in open(filename, "r"))
def word_to_ids(self, words):
"""convert word to word index"""
word_ids = []
for word in words:
if word in self.word_replace_dict:
word = self.word_replace_dict[word]
word = self.word_replace_dict.get(word, word)
if word not in self.word2id_dict:
word = "OOV"
word_id = self.word2id_dict[word]
word_ids.append(word_id)
return word_ids
def label_to_ids(self, labels):
......@@ -105,20 +105,19 @@ class Dataset(object):
def wrapper():
fread = io.open(filename, "r", encoding="utf-8")
headline = next(fread)
headline = headline.strip().split("\t")
if mode == "infer":
assert len(headline) == 1 and headline[0] == "text_a"
for line in fread:
words = line.strip("\n").split("\002")
words= line.strip()
word_ids = self.word_to_ids(words)
yield word_ids[0:max_seq_len], [0 for _ in word_ids][
0:max_seq_len]
yield (word_ids[0:max_seq_len],)
else:
assert len(headline) == 2 and headline[
0] == "text_a" and headline[1] == "label"
headline = next(fread)
headline = headline.strip().split('\t')
assert len(headline) == 2 and headline[0] == "text_a" and headline[1] == "label"
for line in fread:
words, labels = line.strip("\n").split("\t")
if len(words)<1:
continue
word_ids = self.word_to_ids(words.split("\002"))
label_ids = self.label_to_ids(labels.split("\002"))
assert len(word_ids) == len(label_ids)
......
#!/bin/bash
export FLAGS_fraction_of_gpu_memory_to_use=0.5
export FLAGS_fraction_of_gpu_memory_to_use=0.02
export FLAGS_eager_delete_tensor_gb=0.0
export FLAGS_fast_eager_deletion_mode=1
export CUDA_VISIBLE_DEVICES=2 # which GPU to use
#alias python='./anaconda2/bin/python'
export CUDA_VISIBLE_DEVICES=0,1,2,3 # which GPU to use
function run_train() {
echo "training"
python run_sequence_labeling.py \
--do_train True \
--do_test True \
--do_infer False \
python train.py \
--train_data ./data/train.tsv \
--test_data ./data/test.tsv \
--model_save_dir ./models \
--valid_model_per_batches 1000 \
--save_model_per_batches 10000 \
--batch_size 100 \
--validation_steps 2 \
--save_steps 10 \
--print_steps 1 \
--batch_size 300 \
--epoch 10 \
--use_cuda true \
--traindata_shuffle_buffer 200000 \
--word_emb_dim 768 \
--grnn_hidden_dim 768 \
--traindata_shuffle_buffer 20000 \
--word_emb_dim 128 \
--grnn_hidden_dim 128 \
--bigru_num 2 \
--base_learning_rate 1e-3 \
--emb_learning_rate 5 \
--emb_learning_rate 2 \
--crf_learning_rate 0.2 \
--word_dict_path ./conf/word.dic \
--label_dict_path ./conf/tag.dic \
--word_rep_dict_path ./conf/q2b.dic
--word_rep_dict_path ./conf/q2b.dic \
--enable_ce false \
--use_cuda false \
--cpu_num 1
}
function run_train_single_gpu() {
echo "single gpu training" # which GPU to use
export CUDA_VISIBLE_DEVICES=0
python train.py \
--use_cuda true
}
function run_train_multi_gpu() {
echo "multi gpu training"
export CUDA_VISIBLE_DEVICES=0,1,2,3 # which GPU to use
python train.py \
--use_cuda true
}
function run_train_multi_cpu() {
echo "multi cpu training"
python train.py \
--use_cuda false \
--cpu_num 10 #cpu_num works only when use_cuda=false
}
function run_eval() {
echo "evaluating"
echo "this may cost about 5 minutes if run on you CPU machine"
python run_sequence_labeling.py \
--do_train False \
--do_test True \
--do_infer False \
--batch_size 80 \
--word_emb_dim 768 \
--grnn_hidden_dim 768 \
python eval.py \
--batch_size 200 \
--word_emb_dim 128 \
--grnn_hidden_dim 128 \
--bigru_num 2 \
--use_cuda True \
--use_cuda False \
--init_checkpoint ./model_baseline \
--test_data ./data/test.tsv \
--word_dict_path ./conf/word.dic \
......@@ -54,42 +70,66 @@ function run_eval() {
function run_infer() {
echo "infering"
python run_sequence_labeling.py \
--do_train False \
--do_test False \
--do_infer True \
--batch_size 80 \
--word_emb_dim 768 \
--grnn_hidden_dim 768 \
python predict.py \
--batch_size 200 \
--word_emb_dim 128 \
--grnn_hidden_dim 128 \
--bigru_num 2 \
--use_cuda True \
--init_checkpoint ./model_baseline/ \
--infer_data ./data/test.tsv \
--use_cuda False \
--init_checkpoint ./model_baseline \
--infer_data ./data/infer.tsv \
--word_dict_path ./conf/word.dic \
--label_dict_path ./conf/tag.dic \
--word_rep_dict_path ./conf/q2b.dic
}
function run_inference() {
echo "inference model"
python inference_model.py \
--word_emb_dim 128 \
--grnn_hidden_dim 128 \
--bigru_num 2 \
--use_cuda False \
--init_checkpoint ./model_baseline \
--word_dict_path ./conf/word.dic \
--label_dict_path ./conf/tag.dic \
--word_rep_dict_path ./conf/q2b.dic \
--inference_save_dir ./infer_model
}
function main() {
local cmd=${1:-help}
case "${cmd}" in
train)
run_train "$@";
;;
train_single_gpu)
run_train_single_gpu "$@";
;;
train_multi_gpu)
run_train_multi_gpu "$@";
;;
train_multi_cpu)
run_train_multi_cpu "$@";
;;
eval)
run_eval "$@";
;;
infer)
run_infer "$@";
;;
inference)
run_inference "$@";
;;
help)
echo "Usage: ${BASH_SOURCE} {train|test|infer}";
echo "Usage: ${BASH_SOURCE} {train|train_single_gpu|train_multi_gpu|train_multi_cpu|eval|infer}";
return 0;
;;
*)
echo "unsupport command [${cmd}]";
echo "Usage: ${BASH_SOURCE} {train|eval|infer}";
echo "Usage: ${BASH_SOURCE} {train|train_single_gpu|train_multi_gpu|train_multi_cpu|eval|infer}";
return 1;
;;
esac
......
set -eux
#set -eux
export FLAGS_fraction_of_gpu_memory_to_use=0.02
export FLAGS_eager_delete_tensor_gb=0.0
export FLAGS_fast_eager_deletion_mode=1
export FLAGS_sync_nccl_allreduce=1
export FLAGS_selected_gpus=0 # which GPU to use
export CUDA_VISIBLE_DEVICES=0
# export FLAGS_sync_nccl_allreduce=1
# export NCCL_DEBUG=INFO
# export NCCL_IB_GID_INDEX=3
# export GLOG_v=1
# export GLOG_logtostderr=1
export CUDA_VISIBLE_DEVICES=0 # which GPU to use
ERNIE_PRETRAINED_MODEL_PATH=./pretrained/
ERNIE_FINETUNED_MODEL_PATH=./model_finetuned/
ERNIE_FINETUNED_MODEL_PATH=./model_finetuned
DATA_PATH=./data/
# train
function run_train() {
echo "training"
python run_ernie_sequence_labeling.py \
--mode train \
--ernie_config_path "${ERNIE_PRETRAINED_MODEL_PATH}/ernie_config.json" \
--checkpoints "./checkpoints" \
--model_save_dir "./ernie_models" \
--init_pretraining_params "${ERNIE_PRETRAINED_MODEL_PATH}/params/" \
--epoch 10 \
--save_steps 1000 \
--validation_steps 1000 \
--lr 2e-4 \
--save_steps 5 \
--validation_steps 5 \
--base_learning_rate 2e-4 \
--crf_learning_rate 0.2 \
--init_bound 0.1 \
--skip_steps 1 \
--print_steps 1 \
--vocab_path "${ERNIE_PRETRAINED_MODEL_PATH}/vocab.txt" \
--batch_size 64 \
--batch_size 3 \
--random_seed 0 \
--num_labels 57 \
--max_seq_len 128 \
--train_set "${DATA_PATH}/train.tsv" \
--test_set "${DATA_PATH}/test.tsv" \
--train_data "${DATA_PATH}/train.tsv" \
--test_data "${DATA_PATH}/test.tsv" \
--label_map_config "./conf/label_map.json" \
--do_lower_case true \
--use_cuda false \
--do_train true \
--do_test true \
--do_infer false
--cpu_num 1
}
function run_train_single_gpu() {
echo "single gpu training" # which GPU to use
export CUDA_VISIBLE_DEVICES=0
python run_ernie_sequence_labeling.py \
--mode train \
--ernie_config_path "${ERNIE_PRETRAINED_MODEL_PATH}/ernie_config.json" \
--init_pretraining_params "${ERNIE_PRETRAINED_MODEL_PATH}/params/" \
--vocab_path "${ERNIE_PRETRAINED_MODEL_PATH}/vocab.txt" \
--use_cuda true
}
function run_train_multi_cpu() {
echo "multi cpu training"
python run_ernie_sequence_labeling.py \
--mode train \
--ernie_config_path "${ERNIE_PRETRAINED_MODEL_PATH}/ernie_config.json" \
--init_pretraining_params "${ERNIE_PRETRAINED_MODEL_PATH}/params/" \
--vocab_path "${ERNIE_PRETRAINED_MODEL_PATH}/vocab.txt" \
--use_cuda false \
--batch_size 64 \
--cpu_num 8 #cpu_num works only when use_cuda=false
}
function run_eval() {
echo "evaluating"
python run_ernie_sequence_labeling.py \
--mode eval \
--ernie_config_path "${ERNIE_PRETRAINED_MODEL_PATH}/ernie_config.json" \
--init_pretraining_params "${ERNIE_PRETRAINED_MODEL_PATH}/params/" \
--init_checkpoint "${ERNIE_FINETUNED_MODEL_PATH}" \
--init_bound 0.1 \
--vocab_path "${ERNIE_PRETRAINED_MODEL_PATH}/vocab.txt" \
......@@ -52,21 +77,18 @@ function run_eval() {
--random_seed 0 \
--num_labels 57 \
--max_seq_len 128 \
--test_set "${DATA_PATH}/test.tsv" \
--test_data "${DATA_PATH}/test.tsv" \
--label_map_config "./conf/label_map.json" \
--do_lower_case true \
--use_cuda true \
--do_train false \
--do_test true \
--do_infer false
}
--use_cuda false
}
function run_infer() {
echo "infering"
python run_ernie_sequence_labeling.py \
--mode infer \
--ernie_config_path "${ERNIE_PRETRAINED_MODEL_PATH}/ernie_config.json" \
--init_pretraining_params "${ERNIE_PRETRAINED_MODEL_PATH}/params/" \
--init_checkpoint "${ERNIE_FINETUNED_MODEL_PATH}" \
--init_bound 0.1 \
--vocab_path "${ERNIE_PRETRAINED_MODEL_PATH}/vocab.txt" \
......@@ -74,13 +96,11 @@ function run_infer() {
--random_seed 0 \
--num_labels 57 \
--max_seq_len 128 \
--infer_set "${DATA_PATH}/test.tsv" \
--test_data "${DATA_PATH}/test.tsv" \
--label_map_config "./conf/label_map.json" \
--do_lower_case true \
--use_cuda true \
--do_train false \
--do_test false \
--do_infer true
--use_cuda false
}
......@@ -90,6 +110,12 @@ function main() {
train)
run_train "$@";
;;
train_single_gpu)
run_train_single_gpu "$@";
;;
train_multi_cpu)
run_train_multi_cpu "$@";
;;
eval)
run_eval "$@";
;;
......@@ -97,12 +123,12 @@ function main() {
run_infer "$@";
;;
help)
echo "Usage: ${BASH_SOURCE} {train|test|infer}";
echo "Usage: ${BASH_SOURCE} {train|train_single_gpu|train_multi_cpu|eval|infer}";
return 0;
;;
*)
echo "unsupport command [${cmd}]";
echo "Usage: ${BASH_SOURCE} {train|eval|infer}";
echo "Usage: ${BASH_SOURCE} {train|train_single_gpu|train_multi_cpu|eval|infer}";
return 1;
;;
esac
......
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Baidu's open-source Lexical Analysis tool for Chinese, including:
1. Word Segmentation,
2. Part-of-Speech Tagging
3. Named Entity Recognition
"""
from __future__ import print_function
import os
import sys
import math
import time
import random
import argparse
import multiprocessing
import numpy as np
import paddle
import paddle.fluid as fluid
import reader
import utils
sys.path.append("../")
from models.sequence_labeling import nets
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
# 1. model parameters
model_g = utils.ArgumentGroup(parser, "model", "model configuration")
model_g.add_arg("word_emb_dim", int, 128, "The dimension in which a word is embedded.")
model_g.add_arg("grnn_hidden_dim", int, 256, "The number of hidden nodes in the GRNN layer.")
model_g.add_arg("bigru_num", int, 2, "The number of bi_gru layers in the network.")
# 2. data parameters
data_g = utils.ArgumentGroup(parser, "data", "data paths")
data_g.add_arg("word_dict_path", str, "./conf/word.dic", "The path of the word dictionary.")
data_g.add_arg("label_dict_path", str, "./conf/tag.dic", "The path of the label dictionary.")
data_g.add_arg("word_rep_dict_path", str, "./conf/q2b.dic", "The path of the word replacement Dictionary.")
data_g.add_arg("train_data", str, "./data/train.tsv", "The folder where the training data is located.")
data_g.add_arg("test_data", str, "./data/test.tsv", "The folder where the training data is located.")
data_g.add_arg("infer_data", str, "./data/test.tsv", "The folder where the training data is located.")
data_g.add_arg("model_save_dir", str, "./models", "The model will be saved in this path.")
data_g.add_arg("init_checkpoint", str, "", "Path to init model")
# 3. train parameters
train_g = utils.ArgumentGroup(parser, "training", "training options")
train_g.add_arg("do_train", bool, True, "whether to perform training")
train_g.add_arg("do_test", bool, True, "whether to perform testing")
train_g.add_arg("do_infer", bool, False, "whether to perform inference")
train_g.add_arg("random_seed", int, 0, "random seed for training")
train_g.add_arg("save_model_per_batches", int, 10000, "Save the model once per xxxx batch of training")
train_g.add_arg("valid_model_per_batches", int, 1000, "Do the validation once per xxxx batch of training")
train_g.add_arg("batch_size", int, 80, "The number of sequences contained in a mini-batch, "
"or the maximum number of tokens (include paddings) contained in a mini-batch.")
train_g.add_arg("epoch", int, 10, "corpus iteration num")
train_g.add_arg("use_cuda", bool, False, "If set, use GPU for training.")
train_g.add_arg("traindata_shuffle_buffer", int, 200, "The buffer size used in shuffle the training data.")
train_g.add_arg("base_learning_rate", float, 1e-3, "The basic learning rate that affects the entire network.")
train_g.add_arg("emb_learning_rate", float, 5,
"The real learning rate of the embedding layer will be (emb_learning_rate * base_learning_rate).")
train_g.add_arg("crf_learning_rate", float, 0.2,
"The real learning rate of the embedding layer will be (crf_learning_rate * base_learning_rate).")
parser.add_argument('--enable_ce', action='store_true', help='If set, run the task with continuous evaluation logs.')
args = parser.parse_args()
# yapf: enable.
sys.path.append('../models/')
from model_check import check_cuda
check_cuda(args.use_cuda)
print(args)
def create_model(args, pyreader_name, vocab_size, num_labels):
"""create lac model"""
pyreader = fluid.layers.py_reader(
capacity=50,
shapes=([-1, 1], [-1, 1]),
dtypes=('int64', 'int64'),
lod_levels=(1, 1),
name=pyreader_name,
use_double_buffer=False)
words, targets = fluid.layers.read_file(pyreader)
avg_cost, crf_decode = nets.lex_net(words, targets, args, vocab_size, num_labels)
(precision, recall, f1_score, num_infer_chunks, num_label_chunks,
num_correct_chunks) = fluid.layers.chunk_eval(
input=crf_decode,
label=targets,
chunk_scheme="IOB",
num_chunk_types=int(math.ceil((num_labels - 1) / 2.0)))
chunk_evaluator = fluid.metrics.ChunkEvaluator()
chunk_evaluator.reset()
ret = {
"pyreader":pyreader,
"words":words,
"targets":targets,
"avg_cost":avg_cost,
"crf_decode":crf_decode,
"chunk_evaluator":chunk_evaluator,
"num_infer_chunks":num_infer_chunks,
"num_label_chunks":num_label_chunks,
"num_correct_chunks":num_correct_chunks
}
return ret
def evaluate(exe, test_program, test_ret):
"""evaluate for test data"""
test_ret["pyreader"].start()
test_ret["chunk_evaluator"].reset()
loss = []
precision = []
recall = []
f1 = []
start_time = time.time()
while True:
try:
avg_loss, nums_infer, nums_label, nums_correct = exe.run(
test_program,
fetch_list=[
test_ret["avg_cost"],
test_ret["num_infer_chunks"],
test_ret["num_label_chunks"],
test_ret["num_correct_chunks"],
],
)
loss.append(avg_loss)
test_ret["chunk_evaluator"].update(nums_infer, nums_label, nums_correct)
p, r, f = test_ret["chunk_evaluator"].eval()
precision.append(p)
recall.append(r)
f1.append(f)
except fluid.core.EOFException:
test_ret["pyreader"].reset()
break
end_time = time.time()
print("[test] avg loss: %.5f, P: %.5f, R: %.5f, F1: %.5f, elapsed time: %.3f s"
% (np.mean(loss), np.mean(precision),
np.mean(recall), np.mean(f1), end_time - start_time))
def main(args):
startup_program = fluid.Program()
if args.random_seed is not None:
startup_program.random_seed = args.random_seed
# prepare dataset
dataset = reader.Dataset(args)
if args.do_train:
train_program = fluid.Program()
if args.random_seed is not None:
train_program.random_seed = args.random_seed
with fluid.program_guard(train_program, startup_program):
with fluid.unique_name.guard():
train_ret = create_model(
args, "train_reader", dataset.vocab_size, dataset.num_labels)
train_ret["pyreader"].decorate_paddle_reader(
paddle.batch(
paddle.reader.shuffle(
dataset.file_reader(args.train_data),
buf_size=args.traindata_shuffle_buffer
),
batch_size=args.batch_size
)
)
optimizer = fluid.optimizer.Adam(learning_rate=args.base_learning_rate)
optimizer.minimize(train_ret["avg_cost"])
if args.do_test:
test_program = fluid.Program()
with fluid.program_guard(test_program, startup_program):
with fluid.unique_name.guard():
test_ret = create_model(
args, "test_reader", dataset.vocab_size, dataset.num_labels)
test_ret["pyreader"].decorate_paddle_reader(
paddle.batch(
dataset.file_reader(args.test_data),
batch_size=args.batch_size
)
)
test_program = test_program.clone(for_test=True) # to share parameters with train model
if args.do_infer:
infer_program = fluid.Program()
with fluid.program_guard(infer_program, startup_program):
with fluid.unique_name.guard():
infer_ret = create_model(
args, "infer_reader", dataset.vocab_size, dataset.num_labels)
infer_ret["pyreader"].decorate_paddle_reader(
paddle.batch(
dataset.file_reader(args.infer_data),
batch_size=args.batch_size
)
)
infer_program = infer_program.clone(for_test=True)
# init executor
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = multiprocessing.cpu_count()
exe = fluid.Executor(place)
exe.run(startup_program)
# load checkpoints
if args.do_train:
if args.init_checkpoint:
utils.init_checkpoint(exe, args.init_checkpoint, train_program)
elif args.do_test:
if not args.init_checkpoint:
raise ValueError("args 'init_checkpoint' should be set if only doing validation or testing!")
utils.init_checkpoint(exe, args.init_checkpoint, test_program)
if args.do_infer:
utils.init_checkpoint(exe, args.init_checkpoint, infer_program)
# do start to train
if args.do_train:
num_train_examples = dataset.get_num_examples(args.train_data)
max_train_steps = args.epoch * num_train_examples // args.batch_size
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
ce_info = []
batch_id = 0
for epoch_id in range(args.epoch):
train_ret["pyreader"].start()
ce_time = 0
try:
while True:
start_time = time.time()
avg_cost, nums_infer, nums_label, nums_correct = exe.run(
train_program,
fetch_list=[
train_ret["avg_cost"],
train_ret["num_infer_chunks"],
train_ret["num_label_chunks"],
train_ret["num_correct_chunks"],
],
)
end_time = time.time()
train_ret["chunk_evaluator"].reset()
train_ret["chunk_evaluator"].update(nums_infer, nums_label, nums_correct)
precision, recall, f1_score = train_ret["chunk_evaluator"].eval()
batch_id += 1
print("[train] batch_id = %d, loss = %.5f, P: %.5f, R: %.5f, F1: %.5f, elapsed time %.5f " % (
batch_id, avg_cost, precision, recall, f1_score, end_time - start_time))
ce_time += end_time - start_time
ce_info.append([ce_time, avg_cost, precision, recall, f1_score])
# save checkpoints
if (batch_id % args.save_model_per_batches == 0):
save_path = os.path.join(args.model_save_dir, "step_" + str(batch_id))
fluid.io.save_persistables(exe, save_path, train_program)
# evaluate
if (batch_id % args.valid_model_per_batches == 0) and args.do_test:
evaluate(exe, test_program, test_ret)
except fluid.core.EOFException:
save_path = os.path.join(args.model_save_dir, "step_" + str(batch_id))
fluid.io.save_persistables(exe, save_path, train_program)
train_ret["pyreader"].reset()
# break?
if args.do_train and args.enable_ce:
card_num = get_cards()
ce_cost = 0
ce_f1 = 0
ce_p = 0
ce_r = 0
ce_time = 0
try:
ce_time = ce_info[-2][0]
ce_cost = ce_info[-2][1]
ce_p = ce_info[-2][2]
ce_r = ce_info[-2][3]
ce_f1 = ce_info[-2][4]
except:
print("ce info error")
print("kpis\teach_step_duration_card%s\t%s" %
(card_num, ce_time))
print("kpis\ttrain_cost_card%s\t%f" %
(card_num, ce_cost))
print("kpis\ttrain_precision_card%s\t%f" %
(card_num, ce_p))
print("kpis\ttrain_recall_card%s\t%f" %
(card_num, ce_r))
print("kpis\ttrain_f1_card%s\t%f" %
(card_num, ce_f1))
# only test
if args.do_test:
evaluate(exe, test_program, test_ret)
if args.do_infer:
infer_ret["pyreader"].start()
while True:
try:
(words, crf_decode, ) = exe.run(infer_program,
fetch_list=[
infer_ret["words"],
infer_ret["crf_decode"],
],
return_numpy=False)
results = utils.parse_result(words, crf_decode, dataset)
for result in results:
print(result)
except fluid.core.EOFException:
infer_ret["pyreader"].reset()
break
def get_cards():
num = 0
cards = os.environ.get('CUDA_VISIBLE_DEVICES', '')
if cards != '':
num = len(cards.split(","))
return num
if __name__ == "__main__":
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# -*- coding: UTF-8 -*-
import os
import sys
import math
import time
import random
import argparse
import multiprocessing
import numpy as np
import paddle
import paddle.fluid as fluid
import reader
import utils
import creator
from eval import test_process
sys.path.append('../models/')
from model_check import check_cuda
# the function to train model
def do_train(args):
train_program = fluid.default_main_program()
startup_program = fluid.default_startup_program()
dataset = reader.Dataset(args)
with fluid.program_guard(train_program, startup_program):
train_program.random_seed = args.random_seed
startup_program.random_seed = args.random_seed
with fluid.unique_name.guard():
train_ret = creator.create_model(
args, dataset.vocab_size, dataset.num_labels, mode='train')
test_program = train_program.clone(for_test=True)
optimizer = fluid.optimizer.Adam(learning_rate=args.base_learning_rate)
optimizer.minimize(train_ret["avg_cost"])
# init executor
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
dev_count = fluid.core.get_cuda_device_count()
else:
dev_count = min(multiprocessing.cpu_count(), args.cpu_num)
if (dev_count < args.cpu_num):
print("WARNING: The total CPU NUM in this machine is %d, which is less than cpu_num parameter you set. "
"Change the cpu_num from %d to %d" % (dev_count, args.cpu_num, dev_count))
os.environ['CPU_NUM'] = str(dev_count)
place = fluid.CPUPlace()
train_reader = creator.create_pyreader(args, file_name=args.train_data,
feed_list=train_ret['feed_list'],
place=place,
model='lac',
reader=dataset)
test_reader = creator.create_pyreader(args, file_name=args.test_data,
feed_list=train_ret['feed_list'],
place=place,
model='lac',
reader=dataset,
mode='test')
exe = fluid.Executor(place)
exe.run(startup_program)
if args.init_checkpoint:
utils.init_checkpoint(exe, args.init_checkpoint, train_program)
if dev_count>1:
device = "GPU" if args.use_cuda else "CPU"
print("%d %s are used to train model"%(dev_count, device))
# multi cpu/gpu config
exec_strategy = fluid.ExecutionStrategy()
# exec_strategy.num_threads = dev_count * 6
build_strategy = fluid.compiler.BuildStrategy()
# build_strategy.enable_inplace = True
compiled_prog = fluid.compiler.CompiledProgram(train_program).with_data_parallel(
loss_name=train_ret['avg_cost'].name,
build_strategy=build_strategy,
exec_strategy=exec_strategy
)
else:
compiled_prog = fluid.compiler.CompiledProgram(train_program)
# start training
num_train_examples = dataset.get_num_examples(args.train_data)
max_train_steps = args.epoch * num_train_examples // args.batch_size
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
ce_info = []
step = 0
for epoch_id in range(args.epoch):
ce_time = 0
for data in train_reader():
# this is for minimizing the fetching op, saving the training speed.
if step % args.print_steps == 0:
fetch_list = [
train_ret["avg_cost"],
train_ret["precision"],
train_ret["recall"],
train_ret["f1_score"]
]
else:
fetch_list = []
start_time = time.time()
outputs = exe.run(
compiled_prog,
fetch_list=fetch_list,
feed=data[0],
)
end_time = time.time()
if step % args.print_steps == 0:
avg_cost, precision, recall, f1_score = [np.mean(x) for x in outputs]
print("[train] step = %d, loss = %.5f, P: %.5f, R: %.5f, F1: %.5f, elapsed time %.5f" % (
step, avg_cost, precision, recall, f1_score, end_time - start_time))
if step % args.validation_steps == 0:
test_process(exe, test_program, test_reader, train_ret)
ce_time += end_time - start_time
ce_info.append([ce_time, avg_cost, precision, recall, f1_score])
# save checkpoints
if step % args.save_steps == 0 and step != 0:
save_path = os.path.join(args.model_save_dir, "step_" + str(step))
fluid.io.save_persistables(exe, save_path, train_program)
step += 1
if args.enable_ce:
card_num = get_cards()
ce_cost = 0
ce_f1 = 0
ce_p = 0
ce_r = 0
ce_time = 0
try:
ce_time = ce_info[-2][0]
ce_cost = ce_info[-2][1]
ce_p = ce_info[-2][2]
ce_r = ce_info[-2][3]
ce_f1 = ce_info[-2][4]
except:
print("ce info error")
print("kpis\teach_step_duration_card%s\t%s" %
(card_num, ce_time))
print("kpis\ttrain_cost_card%s\t%f" %
(card_num, ce_cost))
print("kpis\ttrain_precision_card%s\t%f" %
(card_num, ce_p))
print("kpis\ttrain_recall_card%s\t%f" %
(card_num, ce_r))
print("kpis\ttrain_f1_card%s\t%f" %
(card_num, ce_f1))
def get_cards():
num = 0
cards = os.environ.get('CUDA_VISIBLE_DEVICES', '')
if cards != '':
num = len(cards.split(","))
return num
if __name__ == "__main__":
# 参数控制可以根据需求使用argparse,yaml或者json
# 对NLP任务推荐使用PALM下定义的configure,可以统一argparse,yaml或者json格式的配置文件。
parser = argparse.ArgumentParser(__doc__)
utils.load_yaml(parser,'conf/args.yaml')
args = parser.parse_args()
check_cuda(args.use_cuda)
print(args)
do_train(args)
......@@ -19,6 +19,7 @@ import os
import sys
import numpy as np
import paddle.fluid as fluid
import yaml
def str2bool(v):
......@@ -47,6 +48,21 @@ class ArgumentGroup(object):
help=help + ' Default: %(default)s.',
**kwargs)
def load_yaml(parser, file_name, **kwargs):
with open(file_name) as f:
args = yaml.load(f)
for title in args:
group = parser.add_argument_group(title=title, description='')
for name in args[title]:
_type = type(args[title][name]['val'])
_type = str2bool if _type==bool else _type
group.add_argument(
"--"+name,
default=args[title][name]['val'],
type=_type,
help=args[title][name]['meaning'] + ' Default: %(default)s.',
**kwargs)
def print_arguments(args):
"""none"""
......@@ -82,7 +98,7 @@ def to_lodtensor(data, place):
lod.append(cur_len)
flattened_data = np.concatenate(data, axis=0).astype("int64")
flattened_data = flattened_data.reshape([len(flattened_data), 1])
res = fluid.LoDTensor()
res = fluid.Tensor()
res.set(flattened_data, place)
res.set_lod([lod])
return res
......@@ -94,35 +110,38 @@ def parse_result(words, crf_decode, dataset):
words = np.array(words)
crf_decode = np.array(crf_decode)
batch_size = len(offset_list) - 1
batch_out_str = []
for sent_index in range(batch_size):
sent_out_str = ""
sent_len = offset_list[sent_index + 1] - offset_list[sent_index]
last_word = ""
last_tag = ""
for tag_index in range(sent_len): # iterate every word in sent
index = tag_index + offset_list[sent_index]
cur_word_id = str(words[index][0])
cur_tag_id = str(crf_decode[index][0])
cur_word = dataset.id2word_dict[cur_word_id]
cur_tag = dataset.id2label_dict[cur_tag_id]
if last_word == "":
last_word = cur_word
last_tag = cur_tag[:-2]
elif cur_tag.endswith("-B") or cur_tag == "O":
sent_out_str += last_word + u"/" + last_tag + u" "
last_word = cur_word
last_tag = cur_tag[:-2]
elif cur_tag.endswith("-I"):
last_word += cur_word
else:
raise ValueError("invalid tag: %s" % (cur_tag))
if cur_word != "":
sent_out_str += last_word + u"/" + last_tag + u" "
sent_out_str = to_str(sent_out_str.strip())
batch_out_str.append(sent_out_str)
return batch_out_str
batch_out = []
for sent_index in range(batch_size):
begin, end = offset_list[sent_index], offset_list[sent_index + 1]
sent = [dataset.id2word_dict[str(id[0])] for id in words[begin:end]]
tags = [dataset.id2label_dict[str(id[0])] for id in crf_decode[begin:end]]
sent_out = []
tags_out = []
parital_word = ""
for ind, tag in enumerate(tags):
# for the first word
if parital_word == "":
parital_word = sent[ind]
tags_out.append(tag.split('-')[0])
continue
# for the beginning of word
if tag.endswith("-B") or (tag == "O" and tags[ind-1]!="O"):
sent_out.append(parital_word)
tags_out.append(tag.split('-')[0])
parital_word = sent[ind]
continue
parital_word += sent[ind]
# append the last word, except for len(tags)=0
if len(sent_out)<len(tags_out):
sent_out.append(parital_word)
batch_out.append([sent_out,tags_out])
return batch_out
def init_checkpoint(exe, init_checkpoint_path, main_program):
"""
......@@ -146,7 +165,6 @@ def init_checkpoint(exe, init_checkpoint_path, main_program):
predicate=existed_persitables)
print("Load model from {}".format(init_checkpoint_path))
def init_pretraining_params(exe,
pretraining_params_path,
main_program,
......
......@@ -21,15 +21,20 @@ import math
import paddle.fluid as fluid
from paddle.fluid.initializer import NormalInitializer
def lex_net(word, target, args, vocab_size, num_labels):
def lex_net(word, args, vocab_size, num_labels, for_infer = True, target=None):
"""
define the lexical analysis network structure
word: stores the input of the model
for_infer: a boolean value, indicating if the model to be created is for training or predicting.
return:
for infer: return the prediction
otherwise: return the prediction
"""
word_emb_dim = args.word_emb_dim
grnn_hidden_dim = args.grnn_hidden_dim
emb_lr = args.emb_learning_rate
crf_lr = args.crf_learning_rate
emb_lr = args.emb_learning_rate if 'emb_learning_rate' in dir(args) else 1.0
crf_lr = args.emb_learning_rate if 'crf_learning_rate' in dir(args) else 1.0
bigru_num = args.bigru_num
init_bound = 0.1
IS_SPARSE = True
......@@ -76,7 +81,7 @@ def lex_net(word, target, args, vocab_size, num_labels):
bi_merge = fluid.layers.concat(input=[gru, gru_r], axis=1)
return bi_merge
def _net_conf(word, target):
def _net_conf(word, target=None):
"""
Configure the network
"""
......@@ -105,16 +110,31 @@ def lex_net(word, target, args, vocab_size, num_labels):
regularizer=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4)))
crf_cost = fluid.layers.linear_chain_crf(
input=emission,
label=target,
param_attr=fluid.ParamAttr(
name='crfw', learning_rate=crf_lr))
crf_decode = fluid.layers.crf_decoding(
input=emission, param_attr=fluid.ParamAttr(name='crfw'))
avg_cost = fluid.layers.mean(x=crf_cost)
return avg_cost, crf_decode
if target is not None:
crf_cost = fluid.layers.linear_chain_crf(
input=emission,
label=target,
param_attr=fluid.ParamAttr(
name='crfw',
learning_rate=crf_lr))
avg_cost = fluid.layers.mean(x=crf_cost)
crf_decode = fluid.layers.crf_decoding(
input=emission, param_attr=fluid.ParamAttr(name='crfw'))
return avg_cost,crf_decode
else:
size = emission.shape[1]
fluid.layers.create_parameter(shape = [size + 2, size],
dtype=emission.dtype,
name='crfw')
crf_decode = fluid.layers.crf_decoding(
input=emission, param_attr=fluid.ParamAttr(name='crfw'))
return crf_decode
avg_cost, crf_decode = _net_conf(word, target)
if for_infer:
return _net_conf(word)
return avg_cost, crf_decode
else:
# assert target != None, "target is necessary for training"
return _net_conf(word, target)
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册