Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
models
提交
8f61de12
M
models
项目概览
PaddlePaddle
/
models
1 年多 前同步成功
通知
222
Star
6828
Fork
2962
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
602
列表
看板
标记
里程碑
合并请求
255
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
M
models
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
602
Issue
602
列表
看板
标记
里程碑
合并请求
255
合并请求
255
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
8f61de12
编写于
8月 09, 2018
作者:
Y
Yibing Liu
提交者:
GitHub
8月 09, 2018
浏览文件
操作
浏览文件
下载
差异文件
Merge pull request #1121 from kuke/sequence_tagging_for_ner_ce
Enable ce for sequence_tagging_for_ner
上级
5efb3d3d
c443f9b8
变更
10
隐藏空白更改
内联
并排
Showing
10 changed file
with
504 addition
and
24 deletion
+504
-24
fluid/sequence_tagging_for_ner/.run_ce.sh
fluid/sequence_tagging_for_ner/.run_ce.sh
+5
-0
fluid/sequence_tagging_for_ner/README.md
fluid/sequence_tagging_for_ner/README.md
+1
-5
fluid/sequence_tagging_for_ner/_ce.py
fluid/sequence_tagging_for_ner/_ce.py
+48
-0
fluid/sequence_tagging_for_ner/data/download.sh
fluid/sequence_tagging_for_ner/data/download.sh
+17
-0
fluid/sequence_tagging_for_ner/data/target.txt
fluid/sequence_tagging_for_ner/data/target.txt
+9
-0
fluid/sequence_tagging_for_ner/data/test
fluid/sequence_tagging_for_ner/data/test
+128
-0
fluid/sequence_tagging_for_ner/data/train
fluid/sequence_tagging_for_ner/data/train
+139
-0
fluid/sequence_tagging_for_ner/reader.py
fluid/sequence_tagging_for_ner/reader.py
+66
-0
fluid/sequence_tagging_for_ner/train.py
fluid/sequence_tagging_for_ner/train.py
+44
-19
fluid/sequence_tagging_for_ner/utils.py
fluid/sequence_tagging_for_ner/utils.py
+47
-0
未找到文件。
fluid/sequence_tagging_for_ner/.run_ce.sh
0 → 100755
浏览文件 @
8f61de12
###!/bin/bash
####This file is only used for continuous evaluation.
export
CE_MODE_X
=
1
python train.py | python _ce.py
fluid/sequence_tagging_for_ner/README.md
浏览文件 @
8f61de12
...
...
@@ -22,11 +22,7 @@
## 数据获取
请参考PaddlePaddle v2版本
[
命名实体识别
](
https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md
)
一节中数据获取方式,将该例中的data文件夹拷贝至本例目录下,运行其中的download.sh脚本获取训练和测试数据。
## 通用脚本获取
请将PaddlePaddle v2版本
[
命名实体识别
](
https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md
)
中提供的用于数据读取的文件
[
reader.py
](
https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/reader.py
)
以及包含字典导入等通用功能的文件
[
utils.py
](
https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/utils.py
)
复制到本目录下。本例将会使用到这两个脚本。
完整数据的获取请参考PaddlePaddle v2版本
[
命名实体识别
](
https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md
)
一节中的方式。本例的示例数据同样可以通过运行data/download.sh来获取。
## 训练
...
...
fluid/sequence_tagging_for_ner/_ce.py
0 → 100644
浏览文件 @
8f61de12
####this file is only used for continuous evaluation test!
import
os
import
sys
sys
.
path
.
append
(
os
.
environ
[
'ceroot'
])
from
kpi
import
CostKpi
,
DurationKpi
,
AccKpi
#### NOTE kpi.py should shared in models in some way!!!!
train_acc_kpi
=
AccKpi
(
'train_precision'
,
0.005
,
actived
=
True
)
test_acc_kpi
=
CostKpi
(
'test_precision'
,
0.005
,
actived
=
True
)
train_duration_kpi
=
DurationKpi
(
'train_duration'
,
0.05
,
actived
=
True
)
tracking_kpis
=
[
train_acc_kpi
,
test_acc_kpi
,
train_duration_kpi
,
]
def
parse_log
(
log
):
for
line
in
log
.
split
(
'
\n
'
):
fs
=
line
.
strip
().
split
(
'
\t
'
)
print
(
fs
)
if
len
(
fs
)
==
3
and
fs
[
0
]
==
'kpis'
:
print
(
"-----%s"
%
fs
)
kpi_name
=
fs
[
1
]
kpi_value
=
float
(
fs
[
2
])
yield
kpi_name
,
kpi_value
def
log_to_ce
(
log
):
kpi_tracker
=
{}
for
kpi
in
tracking_kpis
:
kpi_tracker
[
kpi
.
name
]
=
kpi
for
(
kpi_name
,
kpi_value
)
in
parse_log
(
log
):
print
(
kpi_name
,
kpi_value
)
kpi_tracker
[
kpi_name
].
add_record
(
kpi_value
)
kpi_tracker
[
kpi_name
].
persist
()
if
__name__
==
'__main__'
:
log
=
sys
.
stdin
.
read
()
print
(
"*****"
)
print
(
log
)
print
(
"****"
)
log_to_ce
(
log
)
fluid/sequence_tagging_for_ner/data/download.sh
0 → 100644
浏览文件 @
8f61de12
if
[
-f
assignment2.zip
]
;
then
echo
"data exist"
exit
0
else
wget http://cs224d.stanford.edu/assignment2/assignment2.zip
fi
if
[
$?
-eq
0
]
;
then
unzip assignment2.zip
cp
assignment2_release/data/ner/wordVectors.txt ./data
cp
assignment2_release/data/ner/vocab.txt ./data
rm
-rf
assignment2_release
else
echo
"download data error!"
>>
/dev/stderr
exit
1
fi
fluid/sequence_tagging_for_ner/data/target.txt
0 → 100644
浏览文件 @
8f61de12
B-LOC
I-LOC
B-MISC
I-MISC
B-ORG
I-ORG
B-PER
I-PER
O
fluid/sequence_tagging_for_ner/data/test
0 → 100644
浏览文件 @
8f61de12
CRICKET NNP I-NP O
- : O O
LEICESTERSHIRE NNP I-NP I-ORG
TAKE NNP I-NP O
OVER IN I-PP O
AT NNP I-NP O
TOP NNP I-NP O
AFTER NNP I-NP O
INNINGS NNP I-NP O
VICTORY NN I-NP O
. . O O
LONDON NNP I-NP I-LOC
1996-08-30 CD I-NP O
West NNP I-NP I-MISC
Indian NNP I-NP I-MISC
all-rounder NN I-NP O
Phil NNP I-NP I-PER
Simmons NNP I-NP I-PER
took VBD I-VP O
four CD I-NP O
for IN I-PP O
38 CD I-NP O
on IN I-PP O
Friday NNP I-NP O
as IN I-PP O
Leicestershire NNP I-NP I-ORG
beat VBD I-VP O
Somerset NNP I-NP I-ORG
by IN I-PP O
an DT I-NP O
innings NN I-NP O
and CC O O
39 CD I-NP O
runs NNS I-NP O
in IN I-PP O
two CD I-NP O
days NNS I-NP O
to TO I-VP O
take VB I-VP O
over IN I-PP O
at IN B-PP O
the DT I-NP O
head NN I-NP O
of IN I-PP O
the DT I-NP O
county NN I-NP O
championship NN I-NP O
. . O O
Their PRP$ I-NP O
stay NN I-NP O
on IN I-PP O
top NN I-NP O
, , O O
though RB I-ADVP O
, , O O
may MD I-VP O
be VB I-VP O
short-lived JJ I-ADJP O
as IN I-PP O
title NN I-NP O
rivals NNS I-NP O
Essex NNP I-NP I-ORG
, , O O
Derbyshire NNP I-NP I-ORG
and CC I-NP O
Surrey NNP I-NP I-ORG
all DT O O
closed VBD I-VP O
in RP I-PRT O
on IN I-PP O
victory NN I-NP O
while IN I-SBAR O
Kent NNP I-NP I-ORG
made VBD I-VP O
up RP I-PRT O
for IN I-PP O
lost VBN I-NP O
time NN I-NP O
in IN I-PP O
their PRP$ I-NP O
rain-affected JJ I-NP O
match NN I-NP O
against IN I-PP O
Nottinghamshire NNP I-NP I-ORG
. . O O
After IN I-PP O
bowling VBG I-NP O
Somerset NNP I-NP I-ORG
out RP I-PRT O
for IN I-PP O
83 CD I-NP O
on IN I-PP O
the DT I-NP O
opening NN I-NP O
morning NN I-NP O
at IN I-PP O
Grace NNP I-NP I-LOC
Road NNP I-NP I-LOC
, , O O
Leicestershire NNP I-NP I-ORG
extended VBD I-VP O
their PRP$ I-NP O
first JJ I-NP O
innings NN I-NP O
by IN I-PP O
94 CD I-NP O
runs VBZ I-VP O
before IN I-PP O
being VBG I-VP O
bowled VBD I-VP O
out RP I-PRT O
for IN I-PP O
296 CD I-NP O
with IN I-PP O
England NNP I-NP I-LOC
discard VBP I-VP O
Andy NNP I-NP I-PER
Caddick NNP I-NP I-PER
taking VBG I-VP O
three CD I-NP O
for IN I-PP O
83 CD I-NP O
. . O O
fluid/sequence_tagging_for_ner/data/train
0 → 100644
浏览文件 @
8f61de12
EU NNP I-NP I-ORG
rejects VBZ I-VP O
German JJ I-NP I-MISC
call NN I-NP O
to TO I-VP O
boycott VB I-VP O
British JJ I-NP I-MISC
lamb NN I-NP O
. . O O
Peter NNP I-NP I-PER
Blackburn NNP I-NP I-PER
BRUSSELS NNP I-NP I-LOC
1996-08-22 CD I-NP O
The DT I-NP O
European NNP I-NP I-ORG
Commission NNP I-NP I-ORG
said VBD I-VP O
on IN I-PP O
Thursday NNP I-NP O
it PRP B-NP O
disagreed VBD I-VP O
with IN I-PP O
German JJ I-NP I-MISC
advice NN I-NP O
to TO I-PP O
consumers NNS I-NP O
to TO I-VP O
shun VB I-VP O
British JJ I-NP I-MISC
lamb NN I-NP O
until IN I-SBAR O
scientists NNS I-NP O
determine VBP I-VP O
whether IN I-SBAR O
mad JJ I-NP O
cow NN I-NP O
disease NN I-NP O
can MD I-VP O
be VB I-VP O
transmitted VBN I-VP O
to TO I-PP O
sheep NN I-NP O
. . O O
Germany NNP I-NP I-LOC
's POS B-NP O
representative NN I-NP O
to TO I-PP O
the DT I-NP O
European NNP I-NP I-ORG
Union NNP I-NP I-ORG
's POS B-NP O
veterinary JJ I-NP O
committee NN I-NP O
Werner NNP I-NP I-PER
Zwingmann NNP I-NP I-PER
said VBD I-VP O
on IN I-PP O
Wednesday NNP I-NP O
consumers NNS I-NP O
should MD I-VP O
buy VB I-VP O
sheepmeat NN I-NP O
from IN I-PP O
countries NNS I-NP O
other JJ I-ADJP O
than IN I-PP O
Britain NNP I-NP I-LOC
until IN I-SBAR O
the DT I-NP O
scientific JJ I-NP O
advice NN I-NP O
was VBD I-VP O
clearer JJR I-ADJP O
. . O O
" " O O
We PRP I-NP O
do VBP I-VP O
n't RB I-VP O
support VB I-VP O
any DT I-NP O
such JJ I-NP O
recommendation NN I-NP O
because IN I-SBAR O
we PRP I-NP O
do VBP I-VP O
n't RB I-VP O
see VB I-VP O
any DT I-NP O
grounds NNS I-NP O
for IN I-PP O
it PRP I-NP O
, , O O
" " O O
the DT I-NP O
Commission NNP I-NP I-ORG
's POS B-NP O
chief JJ I-NP O
spokesman NN I-NP O
Nikolaus NNP I-NP I-PER
van NNP I-NP I-PER
der FW I-NP I-PER
Pas NNP I-NP I-PER
told VBD I-VP O
a DT I-NP O
news NN I-NP O
briefing NN I-NP O
. . O O
He PRP I-NP O
said VBD I-VP O
further JJ I-NP O
scientific JJ I-NP O
study NN I-NP O
was VBD I-VP O
required VBN I-VP O
and CC O O
if IN I-SBAR O
it PRP I-NP O
was VBD I-VP O
found VBN I-VP O
that IN I-SBAR O
action NN I-NP O
was VBD I-VP O
needed VBN I-VP O
it PRP I-NP O
should MD I-VP O
be VB I-VP O
taken VBN I-VP O
by IN I-PP O
the DT I-NP O
European NNP I-NP I-ORG
Union NNP I-NP I-ORG
. . O O
fluid/sequence_tagging_for_ner/reader.py
0 → 100644
浏览文件 @
8f61de12
"""
Conll03 dataset.
"""
from
utils
import
*
__all__
=
[
"data_reader"
]
def
canonicalize_digits
(
word
):
if
any
([
c
.
isalpha
()
for
c
in
word
]):
return
word
word
=
re
.
sub
(
"\d"
,
"DG"
,
word
)
if
word
.
startswith
(
"DG"
):
word
=
word
.
replace
(
","
,
""
)
# remove thousands separator
return
word
def
canonicalize_word
(
word
,
wordset
=
None
,
digits
=
True
):
word
=
word
.
lower
()
if
digits
:
if
(
wordset
!=
None
)
and
(
word
in
wordset
):
return
word
word
=
canonicalize_digits
(
word
)
# try to canonicalize numbers
if
(
wordset
==
None
)
or
(
word
in
wordset
):
return
word
else
:
return
"UUUNKKK"
# unknown token
def
data_reader
(
data_file
,
word_dict
,
label_dict
):
"""
The dataset can be obtained according to http://www.clips.uantwerpen.be/conll2003/ner/.
It returns a reader creator, each sample in the reader includes:
word id sequence, label id sequence and raw sentence.
:return: reader creator
:rtype: callable
"""
def
reader
():
UNK_IDX
=
word_dict
[
"UUUNKKK"
]
sentence
=
[]
labels
=
[]
with
open
(
data_file
,
"r"
)
as
f
:
for
line
in
f
:
if
len
(
line
.
strip
())
==
0
:
if
len
(
sentence
)
>
0
:
word_idx
=
[
word_dict
.
get
(
canonicalize_word
(
w
,
word_dict
),
UNK_IDX
)
for
w
in
sentence
]
mark
=
[
1
if
w
[
0
].
isupper
()
else
0
for
w
in
sentence
]
label_idx
=
[
label_dict
[
l
]
for
l
in
labels
]
yield
word_idx
,
mark
,
label_idx
sentence
=
[]
labels
=
[]
else
:
segs
=
line
.
strip
().
split
()
sentence
.
append
(
segs
[
0
])
# transform I-TYPE to BIO schema
if
segs
[
-
1
]
!=
"O"
and
(
len
(
labels
)
==
0
or
labels
[
-
1
][
1
:]
!=
segs
[
-
1
][
1
:]):
labels
.
append
(
"B"
+
segs
[
-
1
][
1
:])
else
:
labels
.
append
(
segs
[
-
1
])
return
reader
fluid/sequence_tagging_for_ner/train.py
浏览文件 @
8f61de12
import
os
import
math
import
time
import
numpy
as
np
import
paddle
.v2
as
paddle
import
paddle
import
paddle.fluid
as
fluid
import
reader
...
...
@@ -24,12 +25,19 @@ def test(exe, chunk_evaluator, inference_program, test_data, place):
return
chunk_evaluator
.
eval
(
exe
)
def
main
(
train_data_file
,
test_data_file
,
vocab_file
,
target_file
,
emb_file
,
model_save_dir
,
num_passes
,
use_gpu
,
parallel
):
def
main
(
train_data_file
,
test_data_file
,
vocab_file
,
target_file
,
emb_file
,
model_save_dir
,
num_passes
,
use_gpu
,
parallel
,
batch_size
=
200
):
if
not
os
.
path
.
exists
(
model_save_dir
):
os
.
mkdir
(
model_save_dir
)
BATCH_SIZE
=
200
word_dict
=
load_dict
(
vocab_file
)
label_dict
=
load_dict
(
target_file
)
...
...
@@ -58,55 +66,71 @@ def main(train_data_file, test_data_file, vocab_file, target_file, emb_file,
test_target
=
chunk_evaluator
.
metrics
+
chunk_evaluator
.
states
inference_program
=
fluid
.
io
.
get_inference_program
(
test_target
)
train_reader
=
paddle
.
batch
(
paddle
.
reader
.
shuffle
(
if
"CE_MODE_X"
not
in
os
.
environ
:
train_reader
=
paddle
.
batch
(
paddle
.
reader
.
shuffle
(
reader
.
data_reader
(
train_data_file
,
word_dict
,
label_dict
),
buf_size
=
20000
),
batch_size
=
batch_size
)
test_reader
=
paddle
.
batch
(
paddle
.
reader
.
shuffle
(
reader
.
data_reader
(
test_data_file
,
word_dict
,
label_dict
),
buf_size
=
20000
),
batch_size
=
batch_size
)
else
:
train_reader
=
paddle
.
batch
(
reader
.
data_reader
(
train_data_file
,
word_dict
,
label_dict
),
buf_size
=
20000
),
batch_size
=
BATCH_SIZE
)
test_reader
=
paddle
.
batch
(
paddle
.
reader
.
shuffle
(
batch_size
=
batch_size
)
test_reader
=
paddle
.
batch
(
reader
.
data_reader
(
test_data_file
,
word_dict
,
label_dict
),
buf_size
=
20000
),
batch_size
=
BATCH_SIZE
)
batch_size
=
batch_size
)
place
=
fluid
.
CUDAPlace
(
0
)
if
use_gpu
else
fluid
.
CPUPlace
()
feeder
=
fluid
.
DataFeeder
(
feed_list
=
[
word
,
mark
,
target
],
place
=
place
)
exe
=
fluid
.
Executor
(
place
)
if
"CE_MODE_X"
in
os
.
environ
:
fluid
.
default_startup_program
().
random_seed
=
110
exe
.
run
(
fluid
.
default_startup_program
())
embedding_name
=
'emb'
embedding_param
=
fluid
.
global_scope
().
find_var
(
embedding_name
).
get_tensor
()
embedding_param
.
set
(
word_vector_values
,
place
)
batch_id
=
0
for
pass_id
in
xrange
(
num_passes
):
chunk_evaluator
.
reset
(
exe
)
for
data
in
train_reader
(
):
for
batch_id
,
data
in
enumerate
(
train_reader
()
):
cost
,
batch_precision
,
batch_recall
,
batch_f1_score
=
exe
.
run
(
fluid
.
default_main_program
(),
feed
=
feeder
.
feed
(
data
),
fetch_list
=
[
avg_cost
]
+
chunk_evaluator
.
metrics
)
if
batch_id
%
5
==
0
:
print
(
cost
)
print
(
"Pass "
+
str
(
pass_id
)
+
", Batch "
+
str
(
batch_id
)
+
", Cost "
+
str
(
cost
[
0
])
+
", Precision "
+
str
(
batch_precision
[
0
])
+
", Recall "
+
str
(
batch_recall
[
0
])
+
", F1_score"
+
str
(
batch_f1_score
[
0
]))
batch_id
=
batch_id
+
1
pass_precision
,
pass_recall
,
pass_f1_score
=
chunk_evaluator
.
eval
(
exe
)
print
(
"[TrainSet] pass_id:"
+
str
(
pass_id
)
+
" pass_precision:"
+
str
(
pass_precision
)
+
" pass_recall:"
+
str
(
pass_recall
)
+
" pass_f1_score:"
+
str
(
pass_f1_score
))
pass_precision
,
pass_recall
,
pass_f1_score
=
test
(
test_pass_precision
,
test_pass_recall
,
test_
pass_f1_score
=
test
(
exe
,
chunk_evaluator
,
inference_program
,
test_reader
,
place
)
print
(
"[TestSet] pass_id:"
+
str
(
pass_id
)
+
" pass_precision:"
+
str
(
pass_precision
)
+
" pass_recall:"
+
str
(
pass_recall
)
+
" pass_f1_score:"
+
str
(
pass_f1_score
))
test_pass_precision
)
+
" pass_recall:"
+
str
(
test_
pass_recall
)
+
" pass_f1_score:"
+
str
(
test_
pass_f1_score
))
save_dirname
=
os
.
path
.
join
(
model_save_dir
,
"params_pass_%d"
%
pass_id
)
fluid
.
io
.
save_inference_model
(
save_dirname
,
[
'word'
,
'mark'
,
'target'
],
[
crf_decode
],
exe
)
crf_decode
,
exe
)
if
(
"CE_MODE_X"
in
os
.
environ
)
and
(
pass_id
%
50
==
0
):
if
pass_id
>
0
:
print
(
"kpis train_precision %f"
%
pass_precision
)
print
(
"kpis test_precision %f"
%
test_pass_precision
)
print
(
"kpis train_duration %f"
%
(
time
.
time
()
-
time_begin
))
time_begin
=
time
.
time
()
if
__name__
==
"__main__"
:
...
...
@@ -118,5 +142,6 @@ if __name__ == "__main__":
emb_file
=
"data/wordVectors.txt"
,
model_save_dir
=
"models"
,
num_passes
=
1000
,
batch_size
=
1
,
use_gpu
=
False
,
parallel
=
False
)
fluid/sequence_tagging_for_ner/utils.py
0 → 100644
浏览文件 @
8f61de12
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import
logging
import
os
import
re
import
argparse
import
numpy
as
np
from
collections
import
defaultdict
logger
=
logging
.
getLogger
(
"paddle"
)
logger
.
setLevel
(
logging
.
INFO
)
def
get_embedding
(
emb_file
=
'data/wordVectors.txt'
):
"""
Get the trained word vector.
"""
return
np
.
loadtxt
(
emb_file
,
dtype
=
float
)
def
load_dict
(
dict_path
):
"""
Load the word dictionary from the given file.
Each line of the given file is a word, which can include multiple columns
seperated by tab.
This function takes the first column (columns in a line are seperated by
tab) as key and takes line number of a line as the key (index of the word
in the dictionary).
"""
return
dict
((
line
.
strip
().
split
(
"
\t
"
)[
0
],
idx
)
for
idx
,
line
in
enumerate
(
open
(
dict_path
,
"r"
).
readlines
()))
def
load_reverse_dict
(
dict_path
):
"""
Load the word dictionary from the given file.
Each line of the given file is a word, which can include multiple columns
seperated by tab.
This function takes line number of a line as the key (index of the word in
the dictionary) and the first column (columns in a line are seperated by
tab) as the value.
"""
return
dict
((
idx
,
line
.
strip
().
split
(
"
\t
"
)[
0
])
for
idx
,
line
in
enumerate
(
open
(
dict_path
,
"r"
).
readlines
()))
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录