未验证 提交 434c5c20 编写于 作者: P pkpk 提交者: GitHub

Merge pull request #39 from 0YuanZhang0/sequence_tagging

add_predict_and_eval
......@@ -25,7 +25,8 @@ from hapi.text.text import TransformerDecoderLayer as TransformerDecoderLayer
from hapi.text.text import TransformerEncoder as TransformerEncoder
from hapi.text.text import TransformerDecoder as TransformerDecoder
from hapi.text.text import TransformerBeamSearchDecoder as TransformerBeamSearchDecoder
from hapi.text.text import DynamicGRU as DynamicGRU
from hapi.text.text import GRUCell as GRUCell
from hapi.text.text import GRUEncoderCell as GRUEncoderCell
from hapi.text.text import BiGRU as BiGRU
from hapi.text.text import Linear_chain_crf as Linear_chain_crf
from hapi.text.text import Crf_decoding as Crf_decoding
......
此差异已折叠。
# 序列标注任务
## 1. 简介
Sequence Tagging,是一个序列标注模型,模型可用于实现,分词、词性标注、专名识别等序列标注任务。我们在自建的数据集上对分词、词性标注、专名识别进行整体的评估效果(即联合标签模型),具体数值见下表;
|模型|Precision|Recall|F1-score|
|:-:|:-:|:-:|:-:|
|Lexical Analysis|88.26%|89.20%|88.73%|
## 2. 快速开始
### 安装说明
#### 1.PaddlePaddle 安装
本项目依赖 PaddlePaddle 1.7 及以上版本和PaddleHub 1.0.0及以上版本 ,PaddlePaddle安装请参考官网 [快速安装](http://www.paddlepaddle.org/paddle#quick-start),PaddleHub安装参考 [PaddleHub](https://github.com/PaddlePaddle/PaddleHub)
> Warning: GPU 和 CPU 版本的 PaddlePaddle 分别是 paddlepaddle-gpu 和 paddlepaddle,请安装时注意区别。
#### 2. 克隆代码
克隆工具集代码库到本地
```bash
git clone https://github.com/PaddlePaddle/hapi.git
cd hapi/sequence_tagging
```
#### 3. 环境依赖
PaddlePaddle的版本要求是:Python 2 版本是 2.7.15+、Python 3 版本是 3.5.1+/3.6/3.7。sequence tagging的代码可支持Python2/3,无具体版本限制
### 数据准备
#### 1. 快速下载
本项目涉及的**数据集****训练模型**的数据可通过执行以下脚本进行快速下载,若仅需使用部分数据或者模型,可根据需要参照2和3进行下载
```bash
python downloads.py all
```
或在支持运行shell脚本的环境下执行:
```bash
sh downloads.sh
```
#### 2. 训练数据集
下载数据集文件,解压后会生成 `./data/` 文件夹
```bash
python downloads.py dataset
```
#### 3. 已训练模型
我们开源了在自建数据集上训练的词法分析模型,可供用户直接使用,可通过下述链接进行下载:
```bash
# download baseline model
python downloads.py lac
```
### 模型训练
基于示例的数据集,可通过下面的命令,在训练集 `./data/train.tsv` 上进行训练;
GPU上单卡训练
```
# setting visible devices for training
export CUDA_VISIBLE_DEVICES=0
python -u train.py \
--train_file ./data/train.tsv \
--test_file ./data/test.tsv \
--word_dict_path ./conf/word.dic \
--label_dict_path ./conf/tag.dic \
--word_rep_dict_path ./conf/q2b.dic \
--device gpu \
--grnn_hidden_dim 128 \
--word_emb_dim 128 \
--bigru_num 2 \
--base_learning_rate 1e-3 \
--batch_size 300 \
--epoch 10 \
--save_dir ./model \
--num_devices 1 \
-d
# -d: 是否使用动态图模式进行训练,如果使用静态图训练,命令行请删除-d参数
```
GPU上多卡训练
```
# setting visible devices for training
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 train.py \
--train_file ./data/train.tsv \
--test_file ./data/test.tsv \
--word_dict_path ./conf/word.dic \
--label_dict_path ./conf/tag.dic \
--word_rep_dict_path ./conf/q2b.dic \
--device gpu \
--grnn_hidden_dim 128 \
--word_emb_dim 128 \
--bigru_num 2 \
--base_learning_rate 1e-3 \
--batch_size 300 \
--epoch 10 \
--save_dir ./model \
-d
# -d: 是否使用动态图模式进行训练,如果使用静态图训练,命令行请删除-d参数
```
CPU上训练
```
python -u train.py \
--train_file ./data/train.tsv \
--test_file ./data/test.tsv \
--word_dict_path ./conf/word.dic \
--label_dict_path ./conf/tag.dic \
--word_rep_dict_path ./conf/q2b.dic \
--device cpu \
--grnn_hidden_dim 128 \
--word_emb_dim 128 \
--bigru_num 2 \
--base_learning_rate 1e-3 \
--batch_size 300 \
--epoch 10 \
--save_dir ./model \
-d
```
### 模型预测
加载已有的模型,对未知的数据进行预测
```bash
python predict.py \
--predict_file ./data/infer.tsv \
--word_dict_path ./conf/word.dic \
--label_dict_path ./conf/tag.dic \
--word_rep_dict_path ./conf/q2b.dic \
--init_from_checkpoint model_baseline/params \
--output_file predict.result \
--mode predict \
--device cpu \
-d
# -d: 是否使用动态图模式进行训练,如果使用静态图训练,命令行请删除-d参数
```
### 模型评估
我们基于自建的数据集训练了一个词法分析的模型,可以直接用这个模型对测试集 `./data/test.tsv` 进行验证,
```bash
# baseline model
python eval.py \
--test_file ./data/test.tsv \
--word_dict_path ./conf/word.dic \
--label_dict_path ./conf/tag.dic \
--word_rep_dict_path ./conf/q2b.dic \
--init_from_checkpoint ./model_baseline/params \
--device cpu \
-d
# -d: 是否使用动态图模式进行训练,如果使用静态图训练,命令行请删除-d参数
```
## 3. 进阶使用
### 任务定义与建模
序列标注任务的输入是一个字符串(我们后面使用『句子』来指代它),而输出是句子中的词边界和类别。序列标注是词法分析的经典建模方式。我们使用基于 GRU 的网络结构学习特征,将学习到的特征接入 CRF 解码层完成序列标注。CRF 解码层本质上是将传统 CRF 中的线性模型换成了非线性神经网络,基于句子级别的似然概率,因而能够更好的解决标记偏置问题。模型要点如下。
1. 输入采用 one-hot 方式表示,每个字以一个 id 表示
2. one-hot 序列通过字表,转换为实向量表示的字向量序列;
3. 字向量序列作为双向 GRU 的输入,学习输入序列的特征表示,得到新的特性表示序列,我们堆叠了两层双向GRU以增加学习能力;
4. CRF 以 GRU 学习到的特征为输入,以标记序列为监督信号,实现序列标注。
可供用户下载的自有数据是对分词、词性标注、专名识别同时标注的联合数据集,进行词性和专名类别标签集合如下表,其中词性标签 24 个(小写字母),专名类别标签 4 个(大写字母)。这里需要说明的是,人名、地名、机构名和时间四个类别,在上表中存在两套标签(PER / LOC / ORG / TIME 和 nr / ns / nt / t),被标注为第二套标签的词,是模型判断为低置信度的人名、地名、机构名和时间词。开发者可以基于这两套标签,在四个类别的准确、召回之间做出自己的权衡。
| 标签 | 含义 | 标签 | 含义 | 标签 | 含义 | 标签 | 含义 |
| ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- |
| n | 普通名词 | f | 方位名词 | s | 处所名词 | t | 时间 |
| nr | 人名 | ns | 地名 | nt | 机构名 | nw | 作品名 |
| nz | 其他专名 | v | 普通动词 | vd | 动副词 | vn | 名动词 |
| a | 形容词 | ad | 副形词 | an | 名形词 | d | 副词 |
| m | 数量词 | q | 量词 | r | 代词 | p | 介词 |
| c | 连词 | u | 助词 | xc | 其他虚词 | w | 标点符号 |
| PER | 人名 | LOC | 地名 | ORG | 机构名 | TIME | 时间 |
### 模型原理介绍
上面介绍的模型原理如下图所示:<br />
<p align="center">
<img src="./images/gru-crf-model.png" width = "340" height = "300" /> <br />
Overall Architecture of GRU-CRF-MODEL
</p>
### 数据格式
训练使用的数据可以由用户根据实际的应用场景,自己组织数据。除了第一行是 `text_a\tlabel` 固定的开头,后面的每行数据都是由两列组成,以制表符分隔,第一列是 utf-8 编码的中文文本,以 `\002` 分割,第二列是对应每个字的标注,以 `\002` 分隔。我们采用 IOB2 标注体系,即以 X-B 作为类型为 X 的词的开始,以 X-I 作为类型为 X 的词的持续,以 O 表示不关注的字(实际上,在词性、专名联合标注中,不存在 O )。示例如下:
```text
除\002了\002他\002续\002任\002十\002二\002届\002政\002协\002委\002员\002,\002马\002化\002腾\002,\002雷\002军\002,\002李\002彦\002宏\002也\002被\002推\002选\002为\002新\002一\002届\002全\002国\002人\002大\002代\002表\002或\002全\002国\002政\002协\002委\002员 p-B\002p-I\002r-B\002v-B\002v-I\002m-B\002m-I\002m-I\002ORG-B\002ORG-I\002n-B\002n-I\002w-B\002PER-B\002PER-I\002PER-I\002w-B\002PER-B\002PER-I\002w-B\002PER-B\002PER-I\002PER-I\002d-B\002p-B\002v-B\002v-I\002v-B\002a-B\002m-B\002m-I\002ORG-B\002ORG-I\002ORG-I\002ORG-I\002n-B\002n-I\002c-B\002n-B\002n-I\002ORG-B\002ORG-I\002n-B\002n-I
```
+ 我们随同代码一并发布了完全版的模型和相关的依赖数据。但是,由于模型的训练数据过于庞大,我们没有发布训练数据,仅在`data`目录下放置少数样本用以示例输入数据格式。
+ 模型依赖数据包括:
1. 输入文本的词典,在`conf`目录下,对应`word.dic`
2. 对输入文本中特殊字符进行转换的字典,在`conf`目录下,对应`q2b.dic`
3. 标记标签的词典,在`conf`目录下,对应`tag.dic`
+ 在训练和预测阶段,我们都需要进行原始数据的预处理,具体处理工作包括:
1. 从原始数据文件中抽取出句子和标签,构造句子序列和标签序列
2. 将句子序列中的特殊字符进行转换
3. 依据词典获取词对应的整数索引
### 代码结构说明
```text
├── README.md # 本文档
├── data/ # 存放数据集的目录
├── conf/ # 词典及程序默认配置的目录
├── images/ # 文档图片存放位置
├── utils/ # 常用工具函数
├── train.py # 训练脚本
├── predict.py # 预测脚本
├── eval.py # 词法分析评估的脚本
├── downloads.py # 用于下载数据和模型的脚本
├── downloads.sh # 用于下载数据和模型的脚本
└──reader.py # 文件读取相关函数
```
## 4. 其他
### 在论文中引用 sequence tagging
如果您的学术工作成果中使用了 sequence tagging,请您增加下述引用。我们非常欣慰sequence tagging模型能够对您的学术工作带来帮助。
```text
@article{jiao2018LAC,
title={Chinese Lexical Analysis with Deep Bi-GRU-CRF Network},
author={Jiao, Zhenyu and Sun, Shuqi and Sun, Ke},
journal={arXiv preprint arXiv:1807.01882},
year={2018},
url={https://arxiv.org/abs/1807.01882}
}
```
### 如何贡献代码
如果你可以修复某个 issue 或者增加一个新功能,欢迎给我们提交PR。如果对应的PR被接受了,我们将根据贡献的质量和难度 进行打分(0-5分,越高越好)。如果你累计获得了 10 分,可以联系我们获得面试机会或为你写推荐信。
 
、 ,
。 .
— -
~ ~
‖ |
… .
‘ '
’ '
“ "
” "
〔 (
〕 )
〈 <
〉 >
「 '
」 '
『 "
』 "
〖 [
〗 ]
【 [
】 ]
∶ :
$ $
! !
" "
# #
% %
& &
' '
( (
) )
* *
+ +
, ,
- -
. .
/ /
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
: :
; ;
< <
= =
> >
? ?
@ @
A a
B b
C c
D d
E e
F f
G g
H h
I i
J j
K k
L l
M m
N n
O o
P p
Q q
R r
S s
T t
U u
V v
W w
X x
Y y
Z z
[ [
\ \
] ]
^ ^
_ _
` `
a a
b b
c c
d d
e e
f f
g g
h h
i i
j j
k k
l l
m m
n n
o o
p p
q q
r r
s s
t t
u u
v v
w w
x x
y y
z z
{ {
| |
} }
 ̄ ~
〝 "
〞 "
﹐ ,
﹑ ,
﹒ .
﹔ ;
﹕ :
﹖ ?
﹗ !
﹙ (
﹚ )
﹛ {
﹜ {
﹝ [
﹞ ]
﹟ #
﹠ &
﹡ *
﹢ +
﹣ -
﹤ <
﹥ >
﹦ =
﹨ \
﹩ $
﹪ %
﹫ @
,
A a
B b
C c
D d
E e
F f
G g
H h
I i
J j
K k
L l
M m
N n
O o
P p
Q q
R r
S s
T t
U u
V v
W w
X x
Y y
Z z
0 a-B
1 a-I
2 ad-B
3 ad-I
4 an-B
5 an-I
6 c-B
7 c-I
8 d-B
9 d-I
10 f-B
11 f-I
12 m-B
13 m-I
14 n-B
15 n-I
16 nr-B
17 nr-I
18 ns-B
19 ns-I
20 nt-B
21 nt-I
22 nw-B
23 nw-I
24 nz-B
25 nz-I
26 p-B
27 p-I
28 q-B
29 q-I
30 r-B
31 r-I
32 s-B
33 s-I
34 t-B
35 t-I
36 u-B
37 u-I
38 v-B
39 v-I
40 vd-B
41 vd-I
42 vn-B
43 vn-I
44 w-B
45 w-I
46 xc-B
47 xc-I
48 PER-B
49 PER-I
50 LOC-B
51 LOC-I
52 ORG-B
53 ORG-I
54 TIME-B
55 TIME-I
56 O
此差异已折叠。
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Download script, download dataset and pretrain models.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import io
import os
import sys
import time
import hashlib
import tarfile
import requests
FILE_INFO = {
'BASE_URL': 'https://baidu-nlp.bj.bcebos.com/',
'DATA': {
'name': 'lexical_analysis-dataset-2.0.0.tar.gz',
'md5': '71e4a9a36d0f0177929a1bccedca7dba'
},
'LAC_MODEL': {
'name': 'lexical_analysis-2.0.0.tar.gz',
'md5': "fc1daef00de9564083c7dc7b600504ca"
},
}
def usage():
desc = ("\nDownload datasets and pretrained models for LAC.\n"
"Usage:\n"
" 1. python download.py all\n"
" 2. python download.py dataset\n"
" 3. python download.py lac\n")
print(desc)
def md5file(fname):
hash_md5 = hashlib.md5()
with io.open(fname, "rb") as fin:
for chunk in iter(lambda: fin.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
def extract(fname, dir_path):
"""
Extract tar.gz file
"""
try:
tar = tarfile.open(fname, "r:gz")
file_names = tar.getnames()
for file_name in file_names:
tar.extract(file_name, dir_path)
print(file_name)
tar.close()
except Exception as e:
raise e
def _download(url, filename, md5sum):
"""
Download file and check md5
"""
retry = 0
retry_limit = 3
chunk_size = 4096
while not (os.path.exists(filename) and md5file(filename) == md5sum):
if retry < retry_limit:
retry += 1
else:
raise RuntimeError(
"Cannot download dataset ({0}) with retry {1} times.".format(
url, retry_limit))
try:
start = time.time()
size = 0
res = requests.get(url, stream=True)
filesize = int(res.headers['content-length'])
if res.status_code == 200:
print("[Filesize]: %0.2f MB" % (filesize / 1024 / 1024))
# save by chunk
with io.open(filename, "wb") as fout:
for chunk in res.iter_content(chunk_size=chunk_size):
if chunk:
fout.write(chunk)
size += len(chunk)
pr = '>' * int(size * 50 / filesize)
print(
'\r[Process ]: %s%.2f%%' %
(pr, float(size / filesize * 100)),
end='')
end = time.time()
print("\n[CostTime]: %.2f s" % (end - start))
except Exception as e:
print(e)
def download(name, dir_path):
url = FILE_INFO['BASE_URL'] + FILE_INFO[name]['name']
file_path = os.path.join(dir_path, FILE_INFO[name]['name'])
if not os.path.exists(dir_path):
os.makedirs(dir_path)
# download data
print("Downloading : %s" % name)
_download(url, file_path, FILE_INFO[name]['md5'])
# extract data
print("Extracting : %s" % file_path)
extract(file_path, dir_path)
os.remove(file_path)
if __name__ == '__main__':
if len(sys.argv) != 2:
usage()
sys.exit(1)
pwd = os.path.join(os.path.dirname(__file__), './')
ernie_dir = os.path.join(os.path.dirname(__file__), './pretrained')
if sys.argv[1] == 'all':
download('DATA', pwd)
download('LAC_MODEL', pwd)
if sys.argv[1] == "dataset":
download('DATA', pwd)
elif sys.argv[1] == "lac":
download('LAC_MODEL', pwd)
else:
usage()
#!/bin/bash
# download baseline model file to ./model_baseline/
if [ -d ./model_baseline/ ]
then
echo "./model_baseline/ directory already existed, ignore download"
else
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/lexical_analysis-2.0.0.tar.gz
tar xvf lexical_analysis-2.0.0.tar.gz
/bin/rm lexical_analysis-2.0.0.tar.gz
fi
# download dataset file to ./data/
if [ -d ./data/ ]
then
echo "./data/ directory already existed, ignore download"
else
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/lexical_analysis-dataset-2.0.0.tar.gz
tar xvf lexical_analysis-dataset-2.0.0.tar.gz
/bin/rm lexical_analysis-dataset-2.0.0.tar.gz
fi
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
SequenceTagging network structure
"""
from __future__ import division
from __future__ import print_function
import io
import os
import sys
import math
import argparse
import numpy as np
from train import SeqTagging
from utils.check import check_gpu, check_version
from utils.metrics import chunk_count
from reader import LacDataset, create_lexnet_data_generator, create_dataloader
work_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(work_dir)
from hapi.model import set_device, Input
import paddle.fluid as fluid
from paddle.fluid.optimizer import AdamOptimizer
from paddle.fluid.layers.utils import flatten
def main(args):
place = set_device(args.device)
fluid.enable_dygraph(place) if args.dynamic else None
inputs = [Input([None, None], 'int64', name='words'),
Input([None], 'int64', name='length')]
feed_list = None if args.dynamic else [x.forward() for x in inputs]
dataset = LacDataset(args)
eval_path = args.test_file
chunk_evaluator = fluid.metrics.ChunkEvaluator()
chunk_evaluator.reset()
eval_generator = create_lexnet_data_generator(
args, reader=dataset, file_name=eval_path, place=place, mode="test")
eval_dataset = create_dataloader(
eval_generator, place, feed_list=feed_list)
vocab_size = dataset.vocab_size
num_labels = dataset.num_labels
model = SeqTagging(args, vocab_size, num_labels)
optim = AdamOptimizer(
learning_rate=args.base_learning_rate,
parameter_list=model.parameters())
model.mode = "test"
model.prepare(inputs=inputs)
model.load(args.init_from_checkpoint, skip_mismatch=True)
for data in eval_dataset():
if len(data) == 1:
batch_data = data[0]
targets = np.array(batch_data[2])
else:
batch_data = data
targets = batch_data[2].numpy()
inputs_data = [batch_data[0], batch_data[1]]
crf_decode, length = model.test(inputs=inputs_data)
num_infer_chunks, num_label_chunks, num_correct_chunks = chunk_count(crf_decode, targets, length, dataset.id2label_dict)
chunk_evaluator.update(num_infer_chunks, num_label_chunks, num_correct_chunks)
precision, recall, f1 = chunk_evaluator.eval()
print("[test] P: %.5f, R: %.5f, F1: %.5f" % (precision, recall, f1))
if __name__ == '__main__':
parser = argparse.ArgumentParser("sequence tagging training")
parser.add_argument(
"-wd",
"--word_dict_path",
default=None,
type=str,
help='word dict path')
parser.add_argument(
"-ld",
"--label_dict_path",
default=None,
type=str,
help='label dict path')
parser.add_argument(
"-wrd",
"--word_rep_dict_path",
default=None,
type=str,
help='The path of the word replacement Dictionary.')
parser.add_argument(
"-dev",
"--device",
type=str,
default='gpu',
help="device to use, gpu or cpu")
parser.add_argument(
"-d", "--dynamic", action='store_true', help="enable dygraph mode")
parser.add_argument(
"-e", "--epoch", default=10, type=int, help="number of epoch")
parser.add_argument(
'-lr',
'--base_learning_rate',
default=1e-3,
type=float,
metavar='LR',
help='initial learning rate')
parser.add_argument(
"--word_emb_dim",
default=128,
type=int,
help='word embedding dimension')
parser.add_argument(
"--grnn_hidden_dim", default=128, type=int, help="hidden dimension")
parser.add_argument(
"--bigru_num", default=2, type=int, help='the number of bi-rnn')
parser.add_argument("-elr", "--emb_learning_rate", default=1.0, type=float)
parser.add_argument("-clr", "--crf_learning_rate", default=1.0, type=float)
parser.add_argument(
"-b", "--batch_size", default=300, type=int, help="batch size")
parser.add_argument(
"--max_seq_len", default=126, type=int, help="max sequence length")
parser.add_argument(
"-n", "--num_devices", default=1, type=int, help="number of devices")
parser.add_argument(
"-o",
"--save_dir",
default="./model",
type=str,
help="save model path")
parser.add_argument(
"--init_from_checkpoint",
default=None,
type=str,
help="load init model parameters")
parser.add_argument(
"--init_from_pretrain_model",
default=None,
type=str,
help="load pretrain model parameters")
parser.add_argument(
"-sf", "--save_freq", default=1, type=int, help="save frequency")
parser.add_argument(
"-ef", "--eval_freq", default=1, type=int, help="eval frequency")
parser.add_argument(
"--output_file", default="predict.result", type=str, help="predict output file")
parser.add_argument(
"--predict_file", default="./data/infer.tsv", type=str, help="predict output file")
parser.add_argument(
"--test_file", default="./data/test.tsv", type=str, help="predict and eval output file")
parser.add_argument(
"--train_file", default="./data/train.tsv", type=str, help="train file")
parser.add_argument(
"--mode", default="predict", type=str, help="train|test|predict")
args = parser.parse_args()
print(args)
use_gpu = True if args.device == "gpu" else False
check_gpu(use_gpu)
check_version()
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
SequenceTagging network structure
"""
from __future__ import division
from __future__ import print_function
import io
import os
import sys
import math
import argparse
import numpy as np
from train import SeqTagging
from utils.check import check_gpu, check_version
from reader import LacDataset, create_lexnet_data_generator, create_dataloader
work_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(work_dir)
from hapi.model import set_device, Input
import paddle.fluid as fluid
from paddle.fluid.optimizer import AdamOptimizer
from paddle.fluid.layers.utils import flatten
def main(args):
place = set_device(args.device)
fluid.enable_dygraph(place) if args.dynamic else None
inputs = [Input([None, None], 'int64', name='words'),
Input([None], 'int64', name='length')]
feed_list = None if args.dynamic else [x.forward() for x in inputs]
dataset = LacDataset(args)
predict_path = args.predict_file
predict_generator = create_lexnet_data_generator(
args, reader=dataset, file_name=predict_path, place=place, mode="predict")
predict_dataset = create_dataloader(
predict_generator, place, feed_list=feed_list)
vocab_size = dataset.vocab_size
num_labels = dataset.num_labels
model = SeqTagging(args, vocab_size, num_labels)
optim = AdamOptimizer(
learning_rate=args.base_learning_rate,
parameter_list=model.parameters())
model.mode = "test"
model.prepare(inputs=inputs)
model.load(args.init_from_checkpoint)
f = open(args.output_file, "wb")
for data in predict_dataset():
if len(data) == 1:
input_data = data[0]
else:
input_data = data
results, length = model.test(inputs=flatten(input_data))
for i in range(len(results)):
word_len = length[i]
word_ids = results[i][: word_len]
tags = [dataset.id2label_dict[str(id)] for id in word_ids]
f.write("\002".join(tags) + "\n")
if __name__ == '__main__':
parser = argparse.ArgumentParser("sequence tagging training")
parser.add_argument(
"-wd",
"--word_dict_path",
default=None,
type=str,
help='word dict path')
parser.add_argument(
"-ld",
"--label_dict_path",
default=None,
type=str,
help='label dict path')
parser.add_argument(
"-wrd",
"--word_rep_dict_path",
default=None,
type=str,
help='The path of the word replacement Dictionary.')
parser.add_argument(
"-dev",
"--device",
type=str,
default='gpu',
help="device to use, gpu or cpu")
parser.add_argument(
"-d", "--dynamic", action='store_true', help="enable dygraph mode")
parser.add_argument(
"-e", "--epoch", default=10, type=int, help="number of epoch")
parser.add_argument(
'-lr',
'--base_learning_rate',
default=1e-3,
type=float,
metavar='LR',
help='initial learning rate')
parser.add_argument(
"--word_emb_dim",
default=128,
type=int,
help='word embedding dimension')
parser.add_argument(
"--grnn_hidden_dim", default=128, type=int, help="hidden dimension")
parser.add_argument(
"--bigru_num", default=2, type=int, help='the number of bi-rnn')
parser.add_argument("-elr", "--emb_learning_rate", default=1.0, type=float)
parser.add_argument("-clr", "--crf_learning_rate", default=1.0, type=float)
parser.add_argument(
"-b", "--batch_size", default=300, type=int, help="batch size")
parser.add_argument(
"--max_seq_len", default=126, type=int, help="max sequence length")
parser.add_argument(
"-n", "--num_devices", default=1, type=int, help="number of devices")
parser.add_argument(
"-o",
"--save_dir",
default="./model",
type=str,
help="save model path")
parser.add_argument(
"--init_from_checkpoint",
default=None,
type=str,
help="load init model parameters")
parser.add_argument(
"--init_from_pretrain_model",
default=None,
type=str,
help="load pretrain model parameters")
parser.add_argument(
"-sf", "--save_freq", default=1, type=int, help="save frequency")
parser.add_argument(
"-ef", "--eval_freq", default=1, type=int, help="eval frequency")
parser.add_argument(
"--output_file", default="predict.result", type=str, help="predict output file")
parser.add_argument(
"--predict_file", default="./data/infer.tsv", type=str, help="predict output file")
parser.add_argument(
"--mode", default="train", type=str, help="train|test|predict")
args = parser.parse_args()
print(args)
use_gpu = True if args.device == "gpu" else False
check_gpu(use_gpu)
check_version()
main(args)
......@@ -21,7 +21,7 @@ from __future__ import print_function
import io
import numpy as np
import paddle.fluid as fluid
import paddle
class LacDataset(object):
......@@ -76,11 +76,11 @@ class LacDataset(object):
@property
def vocab_size(self):
return len(self.word2id_dict.values())
return max(self.word2id_dict.values()) + 1
@property
def num_labels(self):
return len(self.label2id_dict.values())
return max(self.label2id_dict.values()) + 1
def get_num_examples(self, filename):
"""num of line of file"""
......@@ -120,48 +120,121 @@ class LacDataset(object):
def wrapper():
fread = io.open(filename, "r", encoding="utf-8")
headline = next(fread)
headline = headline.strip().split('\t')
assert len(headline) == 2 and headline[0] == "text_a" and headline[
1] == "label"
buf = []
for line in fread:
words, labels = line.strip("\n").split("\t")
if len(words) < 1:
continue
word_ids = self.word_to_ids(words.split("\002"))
label_ids = self.label_to_ids(labels.split("\002"))
assert len(word_ids) == len(label_ids)
word_ids = word_ids[0:max_seq_len]
words_len = np.int64(len(word_ids))
word_ids += [0 for _ in range(max_seq_len - words_len)]
label_ids = label_ids[0:max_seq_len]
label_ids += [0 for _ in range(max_seq_len - words_len)]
assert len(word_ids) == len(label_ids)
yield word_ids, label_ids, words_len
if mode == "train":
headline = next(fread)
headline = headline.strip().split('\t')
assert len(headline) == 2 and headline[0] == "text_a" and headline[
1] == "label"
buf = []
for line in fread:
words, labels = line.strip("\n").split("\t")
if len(words) < 1:
continue
word_ids = self.word_to_ids(words.split("\002"))
label_ids = self.label_to_ids(labels.split("\002"))
assert len(word_ids) == len(label_ids)
words_len = np.int64(len(word_ids))
word_ids = word_ids[0:max_seq_len]
words_len = np.int64(len(word_ids))
word_ids += [0 for _ in range(max_seq_len - words_len)]
label_ids = label_ids[0:max_seq_len]
label_ids += [0 for _ in range(max_seq_len - words_len)]
assert len(word_ids) == len(label_ids)
yield word_ids, label_ids, words_len
elif mode == "test":
headline = next(fread)
headline = headline.strip().split('\t')
assert len(headline) == 2 and headline[0] == "text_a" and headline[
1] == "label"
buf = []
for line in fread:
words, labels = line.strip("\n").split("\t")
if len(words) < 1:
continue
word_ids = self.word_to_ids(words.split("\002"))
label_ids = self.label_to_ids(labels.split("\002"))
assert len(word_ids) == len(label_ids)
words_len = np.int64(len(word_ids))
yield word_ids, label_ids, words_len
else:
for line in fread:
words = line.strip("\n").split('\t')[0]
if words == u"text_a":
continue
if "\002" not in words:
word_ids = self.word_to_ids(words)
else:
word_ids = self.word_to_ids(words.split("\002"))
words_len = np.int64(len(word_ids))
yield word_ids, words_len
fread.close()
return wrapper
def create_lexnet_data_generator(args, reader, file_name, place, mode="train"):
def wrapper():
batch_words, batch_labels, seq_lens = [], [], []
for epoch in xrange(args.epoch):
def create_lexnet_data_generator(args, reader, file_name, place, mode="train"):
def padding_data(max_len, batch_data):
padding_batch_data = []
for data in batch_data:
data += [0 for _ in range(max_len - len(data))]
padding_batch_data.append(data)
return padding_batch_data
def wrapper():
if mode == "train":
batch_words, batch_labels, seq_lens = [], [], []
for epoch in xrange(args.epoch):
for instance in reader.file_reader(
file_name, mode, max_seq_len=args.max_seq_len)():
words, labels, words_len = instance
if len(seq_lens) < args.batch_size:
batch_words.append(words)
batch_labels.append(labels)
seq_lens.append(words_len)
if len(seq_lens) == args.batch_size:
yield batch_words, seq_lens, batch_labels, batch_labels
batch_words, batch_labels, seq_lens = [], [], []
if len(seq_lens) > 0:
yield batch_words, seq_lens, batch_labels, batch_labels
elif mode == "test":
batch_words, batch_labels, seq_lens, max_len = [], [], [], 0
for instance in reader.file_reader(
file_name, mode, max_seq_len=args.max_seq_len)():
file_name, mode, max_seq_len=args.max_seq_len)():
words, labels, words_len = instance
max_len = words_len if words_len > max_len else max_len
if len(seq_lens) < args.batch_size:
batch_words.append(words)
batch_labels.append(labels)
seq_lens.append(words_len)
if len(seq_lens) == args.batch_size:
yield batch_words, batch_labels, seq_lens, batch_labels
batch_words, batch_labels, seq_lens = [], [], []
batch_labels.append(labels)
if len(seq_lens) == args.batch_size:
padding_batch_words = padding_data(max_len, batch_words)
padding_batch_labels = padding_data(max_len, batch_labels)
yield padding_batch_words, seq_lens, padding_batch_labels, padding_batch_labels
batch_words, batch_labels, seq_lens, max_len = [], [], [], 0
if len(seq_lens) > 0:
padding_batch_words = padding_data(max_len, batch_words)
padding_batch_labels = padding_data(max_len, batch_labels)
yield padding_batch_words, seq_lens, padding_batch_labels, padding_batch_labels
if len(seq_lens) > 0:
yield batch_words, batch_labels, seq_lens, batch_labels
batch_words, batch_labels, seq_lens = [], [], []
else:
batch_words, seq_lens, max_len = [], [], 0
for instance in reader.file_reader(
file_name, mode, max_seq_len=args.max_seq_len)():
words, words_len = instance
if len(seq_lens) < args.batch_size:
batch_words.append(words)
seq_lens.append(words_len)
max_len = words_len if words_len > max_len else max_len
if len(seq_lens) == args.batch_size:
padding_batch_words = padding_data(max_len, batch_words)
yield padding_batch_words, seq_lens
batch_words, seq_lens, max_len = [], [], 0
if len(seq_lens) > 0:
padding_batch_words = padding_data(max_len, batch_words)
yield padding_batch_words, seq_lens
return wrapper
......
......@@ -24,11 +24,15 @@ import sys
import math
import argparse
import numpy as np
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from metrics import Metric
from model import Model, Input, Loss, set_device
from text import SequenceTagging
work_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(os.path.join(work_dir))
from hapi.metrics import Metric
from hapi.model import Model, Input, Loss, set_device
from hapi.text.text import SequenceTagging
from utils.check import check_gpu, check_version
from reader import LacDataset, create_lexnet_data_generator, create_dataloader
import paddle.fluid as fluid
......@@ -47,6 +51,7 @@ class SeqTagging(Model):
for infer: return the prediction
otherwise: return the prediction
"""
self.mode_type = args.mode
self.word_emb_dim = args.word_emb_dim
self.vocab_size = vocab_size
self.num_labels = num_labels
......@@ -72,12 +77,18 @@ class SeqTagging(Model):
init_bound=self.init_bound,
length=self.length)
def forward(self, word, target, lengths):
def forward(self, *inputs):
"""
Configure the network
"""
crf_decode, avg_cost, lengths = self.sequence_tagging(word, target, lengths)
return crf_decode, avg_cost, lengths
word = inputs[0]
lengths = inputs[1]
if self.mode_type == "train" or self.mode_type == "test":
target = inputs[2]
outputs = self.sequence_tagging(word, lengths, target)
else:
outputs = self.sequence_tagging(word, lengths)
return outputs
class Chunk_eval(fluid.dygraph.Layer):
......@@ -103,10 +114,9 @@ class Chunk_eval(fluid.dygraph.Layer):
dtype="int64")
num_correct_chunks = self._helper.create_variable_for_type_inference(
dtype="int64")
this_input = {"Inference": input, "Label": label[0]}
if seq_length:
this_input["SeqLength"] = seq_length[0]
this_input = {"Inference": input, "Label": label}
if seq_length is not None:
this_input["SeqLength"] = seq_length
self._helper.append_op(
type='chunk_eval',
inputs=this_input,
......@@ -144,9 +154,10 @@ class ChunkEval(Metric):
int(math.ceil((num_labels - 1) / 2.0)), "IOB")
self.reset()
def add_metric_op(self, pred, label, *args, **kwargs):
crf_decode = pred[0]
lengths = pred[2]
def add_metric_op(self, *args):
crf_decode = args[0]
lengths = args[2]
label = args[3]
(num_infer_chunks, num_label_chunks,
num_correct_chunks) = self.chunk_eval(
input=crf_decode, label=label, seq_length=lengths)
......@@ -194,18 +205,16 @@ def main(args):
place = set_device(args.device)
fluid.enable_dygraph(place) if args.dynamic else None
inputs = [
Input(
[None, args.max_seq_len], 'int64', name='words'), Input(
[None, args.max_seq_len], 'int64', name='target'), Input(
[None], 'int64', name='length')
]
labels = [Input([None, args.max_seq_len], 'int64', name='labels')]
inputs = [Input([None, None], 'int64', name='words'),
Input([None], 'int64', name='length'),
Input([None, None], 'int64', name='target')]
labels = [Input([None, None], 'int64', name='labels')]
feed_list = None if args.dynamic else [x.forward() for x in inputs + labels]
dataset = LacDataset(args)
train_path = os.path.join(args.data, "train.tsv")
test_path = os.path.join(args.data, "test.tsv")
train_path = args.train_file
test_path = args.test_file
train_generator = create_lexnet_data_generator(
args, reader=dataset, file_name=train_path, place=place, mode="train")
......@@ -233,8 +242,11 @@ def main(args):
labels=labels,
device=args.device)
if args.resume is not None:
model.load(args.resume)
if args.init_from_checkpoint:
model.load(args.init_from_checkpoint)
if args.init_from_pretrain_model:
model.load(args.init_from_pretrain_model, reset_optimizer=True)
model.fit(train_dataset,
test_dataset,
......@@ -246,9 +258,7 @@ def main(args):
if __name__ == '__main__':
parser = argparse.ArgumentParser("LAC training")
parser.add_argument(
"-dir", "--data", default=None, type=str, help='path to LAC dataset')
parser = argparse.ArgumentParser("sequence tagging training")
parser.add_argument(
"-wd",
"--word_dict_path",
......@@ -301,23 +311,41 @@ if __name__ == '__main__':
"--max_seq_len", default=126, type=int, help="max sequence length")
parser.add_argument(
"-n", "--num_devices", default=1, type=int, help="number of devices")
parser.add_argument(
"-r",
"--resume",
default=None,
type=str,
help="checkpoint path to resume")
parser.add_argument(
"-o",
"--save_dir",
default="./model",
type=str,
help="save model path")
parser.add_argument(
"--init_from_checkpoint",
default=None,
type=str,
help="load init model parameters")
parser.add_argument(
"--init_from_pretrain_model",
default=None,
type=str,
help="load pretrain model parameters")
parser.add_argument(
"-sf", "--save_freq", default=1, type=int, help="save frequency")
parser.add_argument(
"-ef", "--eval_freq", default=1, type=int, help="eval frequency")
parser.add_argument(
"--output_file", default="predict.result", type=str, help="predict output file")
parser.add_argument(
"--predict_file", default="./data/infer.tsv", type=str, help="predict output file")
parser.add_argument(
"--test_file", default="./data/test.tsv", type=str, help="predict and eval output file")
parser.add_argument(
"--train_file", default="./data/train.tsv", type=str, help="train file")
parser.add_argument(
"--mode", default="train", type=str, help="train|test|predict")
args = parser.parse_args()
print(args)
use_gpu = True if args.device == "gpu" else False
check_gpu(use_gpu)
check_version()
main(args)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import paddle.fluid as fluid
__all__ = ['check_gpu', 'check_version']
def check_gpu(use_gpu):
"""
Log error and exit when set use_gpu=true in paddlepaddle
cpu version.
"""
err = "Config use_gpu cannot be set as true while you are " \
"using paddlepaddle cpu version ! \nPlease try: \n" \
"\t1. Install paddlepaddle-gpu to run model on GPU \n" \
"\t2. Set use_gpu as false in config file to run " \
"model on CPU"
try:
if use_gpu and not fluid.is_compiled_with_cuda():
print(err)
sys.exit(1)
except Exception as e:
pass
def check_version():
"""
Log error and exit when the installed version of paddlepaddle is
not satisfied.
"""
err = "PaddlePaddle version 1.6 or higher is required, " \
"or a suitable develop version is satisfied as well. \n" \
"Please make sure the version is good with your code." \
try:
fluid.require_version('1.7.0')
except Exception as e:
print(err)
sys.exit(1)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import paddle.fluid as fluid
__all__ = ['chunk_count', "build_chunk"]
def build_chunk(data_list, id2label_dict):
"""
Assembly entity
"""
tag_list = [id2label_dict.get(str(id)) for id in data_list]
ner_dict = {}
ner_str = ""
ner_start = 0
for i in range(len(tag_list)):
tag = tag_list[i]
if tag == u"O":
if i != 0:
key = "%d_%d" % (ner_start, i - 1)
ner_dict[key] = ner_str
ner_start = i
ner_str = tag
elif tag.endswith(u"B"):
if i != 0:
key = "%d_%d" % (ner_start, i - 1)
ner_dict[key] = ner_str
ner_start = i
ner_str = tag.split('-')[0]
elif tag.endswith(u"I"):
if tag.split('-')[0] != ner_str:
if i != 0:
key = "%d_%d" % (ner_start, i - 1)
ner_dict[key] = ner_str
ner_start = i
ner_str = tag.split('-')[0]
return ner_dict
def chunk_count(infer_numpy, label_numpy, seq_len, id2label_dict):
"""
calculate num_correct_chunks num_error_chunks total_num for metrics
"""
num_infer_chunks, num_label_chunks, num_correct_chunks = 0, 0, 0
assert infer_numpy.shape[0] == label_numpy.shape[0]
for i in range(infer_numpy.shape[0]):
infer_list = infer_numpy[i][: seq_len[i]]
label_list = label_numpy[i][: seq_len[i]]
infer_dict = build_chunk(infer_list, id2label_dict)
num_infer_chunks += len(infer_dict)
label_dict = build_chunk(label_list, id2label_dict)
num_label_chunks += len(label_dict)
for key in infer_dict:
if key in label_dict and label_dict[key] == infer_dict[key]:
num_correct_chunks += 1
return num_infer_chunks, num_label_chunks, num_correct_chunks
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册