未验证 提交 ac5b7971 编写于 作者: B bbking 提交者: GitHub

Update PaddleNLP emotion-detection for new codestyle (#3444)

* update code for argparser

* add inference_model

* fix inference_model

* update create model for inference

* fix model nets conflict with Senta

* update readme
上级 a2beb8ae
## 简介
# 对话情绪识别
* [模型简介](#模型简介)
* [快速开始](#快速开始)
* [进阶使用](#进阶使用)
* [版本更新](#版本更新)
* [作者](#作者)
* [如何贡献代码](#如何贡献代码)
## 模型简介
对话情绪识别(Emotion Detection,简称EmoTect),专注于识别智能对话场景中用户的情绪,针对智能对话场景中的用户文本,自动判断该文本的情绪类别并给出相应的置信度,情绪类型分为积极、消极、中性。
对话情绪识别适用于聊天、客服等多个场景,能够帮助企业更好地把握对话质量、改善产品的用户交互体验,也能分析客服服务质量、降低人工质检成本。可通过 [AI开放平台-对话情绪识别](http://ai.baidu.com/tech/nlp_apply/emotion_detection) 线上体验。
效果上,我们基于百度自建测试集(包含闲聊、客服)和nlpcc2014微博情绪数据集,进行评测,效果如下表所示,此外我们还开源了百度基于海量数据训练好的模型,该模型在聊天对话语料上fine-tune之后,可以得到更好的效果。
效果上,我们基于百度自建测试集(包含闲聊、客服)和 nlpcc2014 微博情绪数据集,进行评测,效果如下表所示,此外我们还开源了百度基于海量数据训练好的模型,该模型在聊天对话语料上 Finetune 之后,可以得到更好的效果。
| 模型 | 闲聊 | 客服 | 微博 |
| :------| :------ | :------ | :------ |
......@@ -19,32 +28,148 @@
## 快速开始
本项目依赖于 Paddlepaddle 1.3.2 及以上版本,请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装
### 安装说明
1. PaddlePaddle 安装
本项目依赖于 PaddlePaddle Fluid 1.3.2 及以上版本,请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装
2. 代码安装
克隆代码库到本地
```shell
git clone https://github.com/PaddlePaddle/models.git
cd models/PaddleNLP/emotion_detection
```
3. 环境依赖
请参考 PaddlePaddle [安装说明](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/install/index_cn.html) 部分的内容
### 代码结构说明
以下是本项目主要代码结构及说明:
```text
.
├── config.json # 配置文件
├── config.py # 配置文件读取接口
├── inference_model.py # 保存 inference_model 的脚本
├── reader.py # 数据读取接口
├── run_classifier.py # 项目的主程序入口,包括训练、预测、评估
├── run.sh # 训练、预测、评估运行脚本
├── run_ernie_classifier.py # 基于ERNIE表示的项目的主程序入口
├── run_ernie.sh # 基于ERNIE的训练、预测、评估运行脚本
├── utils.py # 其它功能函数脚本
```
### 数据准备
#### **自定义数据**
数据由两列组成,以制表符('\t')分隔,第一列是情绪分类的类别(0表示消极;1表示中性;2表示积极),第二列是以空格分词的中文文本,如下示例,文件为 utf8 编码。
```text
label text_a
0 谁 骂人 了 ? 我 从来 不 骂人 , 我 骂 的 都 不是 人 , 你 是 人 吗 ?
1 我 有事 等会儿 就 回来 和 你 聊
2 我 见到 你 很高兴 谢谢 你 帮 我
```
注:PaddleNLP 项目提供了分词预处理脚本(在preprocess目录下),可供用户使用,具体使用方法如下:
```shell
python tokenizer.py --test_data_dir ./test.txt.utf8 --batch_size 1 > test.txt.utf8.seg
```
#### 公开数据集
这里我们提供一份已标注的、经过分词预处理的机器人聊天数据集,只需运行数据下载脚本 ```sh download_data.sh```,运行成功后,会生成文件夹 ```data```,其目录结构如下:
```text
.
├── train.tsv # 训练集
├── dev.tsv # 验证集
├── test.tsv # 测试集
├── infer.tsv # 待预测数据
├── vocab.txt # 词典
```
### 单机训练
基于示例的数据集,可以运行下面的命令,在训练集(train.tsv)上进行模型训练,并在开发集(dev.tsv)验证。
```shell
# TextCNN 模型
sh run.sh train
```
其中 ```run.sh``` 默认训练的是 TextCNN 模型,可直接通过 ```run.sh``` 脚本传入```model_type```参数,或者通过修改 ```config.json``` 中的```model_type``` 选择不同的模型,更多参数配置及说明可以运行如下命令查看
```shell
python run_classifier.py -h
python版本依赖python 2.7
"""
# 输出结果示例
Running type options:
--do_train DO_TRAIN Whether to perform training. Default: False.
...
#### 安装代码
Model config options:
--model_type {bow_net,cnn_net,lstm_net,bilstm_net,gru_net,textcnn_net}
Model type to run the task. Default: textcnn_net.
--init_checkpoint INIT_CHECKPOINT
Init checkpoint to resume training from. Default: .
--save_checkpoint_dir SAVE_CHECKPOINT_DIR
Directory path to save checkpoints Default: .
...
"""
```
本项目参数控制优先级:命令行参数 > ```config.json ``` > 默认值。训练完成后,会在```./save_models/textcnn``` 目录下生成以 ```step_xxx ``` 命名的模型目录。
### 模型评估
基于训练的模型,可以运行下面的命令进行测试,查看预训练的模型在测试集(test.tsv)上的评测结果
克隆代码库到本地
```shell
git clone https://github.com/PaddlePaddle/models.git
cd models/PaddleNLP/emotion_detection
# TextCNN 模型
sh run.sh eval
"""
# 输出结果示例
Load model from ./save_models/textcnn/step_756
Final test result:
[test evaluation] avg loss: 0.339021, avg acc: 0.869691, elapsed time: 0.123983 s
"""
```
#### 数据准备
默认使用的模型```./save_models/textcnn/step_756```,可修改```run.sh```中的 init_checkpoint 参数,选择其它step的模型进行评估。
下载经过预处理的数据,运行该脚本之后,会生成data目录,data目录下有训练集数据(train.tsv)、开发集数据(dev.tsv)、测试集数据(test.tsv)、 待预测数据(infer.tsv)以及对应词典(vocab.txt)
### 模型推断
利用已有模型,可在未知label的数据集(infer.tsv)上进行预测,得到模型预测结果及各label的概率。
```shell
sh download_data.sh
# TextCNN 模型
sh run.sh infer
"""
# 输出结果示例
Load model from ./save_models/textcnn/step_756
1 0.000776 0.998341 0.000883
0 0.987223 0.003371 0.009406
1 0.000365 0.998635 0.001001
1 0.000455 0.998125 0.001420
"""
```
#### 模型下载
### 预训练模型
我们开源了基于海量数据训练好的对话情绪识别模型(基于TextCNN、ERNIE等模型训练),可供用户直接使用,我们提供两种下载方式。
**方式一**:基于PaddleHub命令行工具(PaddleHub[安装方式](https://github.com/PaddlePaddle/PaddleHub)
```shell
mkdir models && cd models
mkdir pretrain_models && cd pretrain_models
hub download emotion_detection_textcnn --output_path ./
hub download emotion_detection_ernie_finetune --output_path ./
tar xvf emotion_detection_textcnn-1.0.0.tar.gz
......@@ -52,48 +177,39 @@ tar xvf emotion_detection_ernie_finetune-1.0.0.tar.gz
```
**方式二**:直接下载脚本
```shell
sh download_model.sh
```
#### 模型评估
以上两种方式会将预训练的 TextCNN 模型和 ERNIE模型,保存在```pretrain_models```目录下,可直接修改```run.sh```脚本中的```init_checkpoint```参数进行评估、预测。
基于已有的预训练模型和数据,可以运行下面的命令进行测试,查看预训练的模型在测试集(test.tsv)上的评测结果
```shell
# TextCNN 模型
sh run.sh eval
# ERNIE 模型
sh run_ernie.sh eval
```
### 服务部署
#### 模型训练
为了将模型应用于线上部署,可以利用```inference_model.py``` 脚本对模型进行裁剪,只保存网络参数及裁剪后的模型。运行命令如下:
基于示例的数据集,可以运行下面的命令,在训练集(train.tsv)上进行模型训练,并在开发集(dev.tsv)验证
```shell
# TextCNN 模型
sh run.sh train
# ERNIE 模型
sh run_ernie.sh train
sh run.sh save_inference_model
```
训练完成后,可修改```run.sh``````run_ernie.sh```中的init_checkpoint 参数,选择最优step的模型进行评估和预测
#### 模型预测
同时裁剪后的模型使用方法详见```inference_model.py```,测试命令如下:
利用已有模型,可在未知label的数据集(infer.tsv)上进行预测,得到模型预测结果及各label的概率
```shell
# TextCNN 模型
sh run.sh infer
# ERNIE 模型
sh run_ernie.sh infer
python inference_model.py
```
#### 服务器部署
请参考PaddlePaddle官方提供的 [服务器端部署](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/advanced_usage/deploy/inference/index_cn.html) 文档进行部署上线。
## 进阶使用
#### 任务定义
### 背景介绍
对话情绪识别任务输入是一段用户文本,输出是检测到的情绪类别,包括消极、积极、中性,这是一个经典的短文本三分类任务。
#### 模型原理介绍
### 模型概览
本项目针对对话情绪识别问题,开源了一系列分类模型,供用户可配置地使用:
......@@ -104,102 +220,127 @@ sh run_ernie.sh infer
+ BI-LSTM:双向单层LSTM模型,采用双向LSTM结构,更好地捕获句子中的语义特征;
+ ERNIE:百度自研基于海量数据和先验知识训练的通用文本语义表示模型,并基于此在对话情绪分类数据集上进行fine-tune获得。
#### 数据格式说明
### 自定义模型
训练、预测、评估使用的数据示例如下,数据由两列组成,以制表符('\t')分隔,第一列是情绪分类的类别(0表示消极;1表示中性;2表示积极),第二列是以空格分词的中文文本,文件为utf8编码。
可以根据自己的需求,组建自定义的模型,具体方法如下所示:
```text
label text_a
0 谁 骂人 了 ? 我 从来 不 骂人 , 我 骂 的 都 不是 人 , 你 是 人 吗 ?
1 我 有事 等会儿 就 回来 和 你 聊
2 我 见到 你 很高兴 谢谢 你 帮 我
```
注:本项目额外提供了分词预处理脚本(在preprocess目录下),可供用户使用,具体使用方法如下:
```shell
python tokenizer.py --test_data_dir ./test.txt.utf8 --batch_size 1 > test.txt.utf8.seg
```
1. 定义自己的网络结构
#### 代码结构说明
用户可以在 ```models/classification/nets.py``` 中,定义自己的模型,只需要增加新的函数即可。假设用户自定义的函数名为```user_net```
```text
.
├── config.json # 模型配置文件
├── config.py # 定义了该项目模型的相关配置,包括具体模型类别、以及模型的超参数
├── reader.py # 定义了读入数据,加载词典的功能
├── run_classifier.py # 该项目的主函数,封装包括训练、预测、评估的部分
├── run_ernie_classifier.py # 基于ERNIE表示的项目的主函数
├── run_ernie.sh # 基于ERNIE的训练、预测、评估运行脚本
├── run.sh # 训练、预测、评估运行脚本
├── utils.py # 定义了其他常用的功能函数
```
2. 更改模型配置
#### 如何组建自己的模型
```config.json``` 中需要将 ```model_type``` 改为用户自定义的 ```user_net```
可以根据自己的需求,组建自定义的模型,具体方法如下所示:
3. 模型训练
1. 定义自己的网络结构
用户可以在 ```models/classification/nets.py``` 中,定义自己的模型,只需要增加新的函数即可。假设用户自定义的函数名为```user_net```
2. 更改模型配置
```config.json``` 中需要将 ```model_type``` 改为用户自定义的 ```user_net```
3. 模型训练,运行训练、评估、预测需要在 ```run.sh``````run_ernie.sh``` 中将模型、数据、词典路径等配置进行修改
通过```run.sh``` 脚本运行训练、评估、预测。
### 基于 ERNIE 进行 Finetune
#### 如何基于百度开源模型进行 Finetune
ERNIE 是百度自研的基于海量数据和先验知识训练的通用文本语义表示模型,基于 ERNIE 进行 Finetune,能够提升对话情绪识别的效果。
用户可基于百度开源的对话情绪识别模型在自有数据上实现 Finetune 训练,以期获得更好的效果提升,具体模型 Finetune 方法如下所示
#### 模型训练
如果用户基于开源的 TextCNN模型进行 Finetune,需要修改```run.sh``````config.json```文件
需要先下载 ERNIE 模型,使用如下命令:
```run.sh``` 脚本修改如下:
```shell
# 在train()函数中,增加--init_checkpoint选项;修改--vocab_path
--init_checkpoint ./models/textcnn
--vocab_path ./data/vocab.txt
mkdir -p pretrain_models/ernie
cd pretrain_models/ernie
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz -O ERNIE_stable-1.0.1.tar.gz
tar -zxvf ERNIE_stable-1.0.1.tar.gz
```
```config.json``` 配置修改如下:
然后修改```run_ernie.sh``` 脚本中train 函数的 ```init_checkpoint``` 参数,再执行命令:
```shell
# vocab_size为词典大小,对应上面./data/vocab.txt
"vocab_size": 240465
#--init_checkpoint ./pretrain_models/ernie
sh run_ernie.sh train
```
如果用户基于开源的 ERNIE模型进行Finetune,需要更新```run_ernie.sh```脚本,具体修改如下:
默认使用GPU进行训练,模型保存在 ```./save_models/ernie/```目录下,以 ```step_xxx ``` 命名。
#### 模型评估
根据训练结果,可选择最优的step进行评估,修改```run_ernie.sh``` 脚本中 eval 函数 ```init_checkpoint``` 参数,然后执行
```shell
# 在train()函数中,修改--init_checkpoint选项
--init_checkpoint ./models/ernie_finetune/params
#--init_checkpoint./save/step_907
sh run_ernie.sh eval
'''
# 输出结果示例
W0820 14:59:47.811139 334 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0820 14:59:47.815557 334 device_context.cc:267] device: 0, cuDNN Version: 7.3.
Load model from ./save_models/ernie/step_907
Final validation result:
[test evaluation] avg loss: 0.260597, ave acc: 0.907336, elapsed time: 2.383077 s
'''
```
#### 如何基于PaddleHub加载ERNIE进行 Finetune
#### 模型推断
我们也提供了使用PaddleHub加载ERNIE模型的选项,PaddleHub是PaddlePaddle的预训练模型管理工具,可以一行代码完成预训练模型的加载,简化预训练模型的使用和迁移学习。更多相关的介绍,可以查看[PaddleHub](https://github.com/PaddlePaddle/PaddleHub)
修改```run_ernie.sh``` 脚本中 infer 函数 ```init_checkpoint``` 参数,然后执行
如果想使用该功能,需要修改run_ernie.sh中的配置如下:
```shell
# 在train()函数中,修改--use_paddle_hub选项
--use_paddle_hub true
#--init_checkpoint./save/step_907
sh run_ernie.sh infer
'''
# 输出结果示例
Load model from ./save_models/ernie/step_907
Final test result:
1 0.000803 0.998870 0.000326
0 0.976585 0.021535 0.001880
1 0.000572 0.999153 0.000275
1 0.001113 0.998502 0.000385
'''
```
### 基于 PaddleHub 加载 ERNIE 进行 Finetune
我们也提供了使用 PaddleHub 加载 ERNIE 模型的选项,PaddleHub 是 PaddlePaddle 的预训练模型管理工具,可以一行代码完成预训练模型的加载,简化预训练模型的使用和迁移学习。更多相关的介绍,可以查看 [PaddleHub](https://github.com/PaddlePaddle/PaddleHub)
注意:使用该选项需要先安装PaddleHub,安装命令如下
```shell
pip install paddlehub
```
执行以下命令进行Finetune
需要修改```run_ernie.sh```中的配置如下:
```shell
# 在train()函数中,修改--use_paddle_hub选项
--use_paddle_hub true
```
执行以下命令进行 Finetune
```shell
sh run_ernie.sh train
```
Finetune结束后,进行eval或者infer时,需要修改run_ernie.sh中的配置如下:
Finetune 结束后,进行 eval 或者 infer 时,需要修改 ```run_ernie.sh``` 中的配置如下:
```shell
# 在eval()和infer()函数中,修改--use_paddle_hub选项
--use_paddle_hub true
```
执行以下命令进行eval和infer
执行以下命令进行 eval 和 infer
```shell
sh run_ernie.sh eval
sh run_ernie.sh infer
```
## 版本更新
2019/08/26 规范化配置的使用,对模块内数据处理代码进行了重构,更新README结构,提高易用性。
2019/06/13 添加PaddleHub调用ERNIE方式。
## 作者
- [chenbingjin](https://github.com/chenbjin)
- [wuzewu](https://github.com/nepeplwu)
## 如何贡献代码
如果你可以修复某个issue或者增加一个新功能,欢迎给我们提交PR。如果对应的PR被接受了,我们将根据贡献的质量和难度进行打分(0-5分,越高越好)。如果你累计获得了10分,可以联系我们获得面试机会或者为你写推荐信。
{
"task_name": "emotion_detection",
"model_type": "textcnn_net",
"vocab_size": 240465
"num_labels": 3,
"vocab_size": 240465,
"vocab_path": "./data/vocab.txt",
"data_dir": "./data",
"inference_model_dir": "./inference_model",
"save_checkpoint_dir": "",
"init_checkpoint": "",
"lr": 0.02,
"epoch": 10,
"batch_size": 64
}
......@@ -19,35 +19,172 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import six
import json
import argparse
def str2bool(value):
"""
String to Boolean
"""
# because argparse does not support to parse "true, False" as python
# boolean directly
return value.lower() in ("true", "t", "1")
class EmoTectConfig(object):
class ArgumentGroup(object):
"""
EmoTect Config
Argument Class
"""
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, dtype, default, help, **kwargs):
"""
Add argument
"""
dtype = str2bool if dtype == bool else dtype
self._group.add_argument(
"--" + name,
default=default,
type=dtype,
help=help + ' Default: %(default)s.',
**kwargs)
def __init__(self, config_path):
self._config_dict = self._parse(config_path)
def _parse(self, config_path):
try:
with open(config_path) as json_file:
config_dict = json.load(json_file)
except Exception:
raise IOError("Error in parsing emotect model config file '%s'" %
config_path)
else:
return config_dict
def __getitem__(self, key):
return self._config_dict[key]
def print_config(self):
class PDConfig(object):
"""
A high-level api for handling argument configs.
"""
def __init__(self, json_file=""):
"""
Print Config
Init funciton for PDConfig.
json_file: the path to the json configure file.
"""
for arg, value in sorted(six.iteritems(self._config_dict)):
assert isinstance(json_file, str)
self.args = None
self.arg_config = {}
parser = argparse.ArgumentParser()
run_type_g = ArgumentGroup(parser, "Running type options", "")
run_type_g.add_arg("do_train", bool, False, "Whether to perform training.")
run_type_g.add_arg("do_val", bool, False, "Whether to perform evaluation.")
run_type_g.add_arg("do_infer", bool, False, "Whether to perform inference.")
run_type_g.add_arg("do_save_inference_model", bool, False, "Whether to perform save inference model.")
model_g = ArgumentGroup(parser, "Model config options", "")
model_g.add_arg("model_type", str, "cnn_net", "Model type to run the task.",
choices=["bow_net","cnn_net", "lstm_net", "bilstm_net", "gru_net", "textcnn_net"])
model_g.add_arg("num_labels", int, 3 , "Number of labels for classification")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("save_checkpoint_dir", str, None, "Directory path to save checkpoints")
model_g.add_arg("inference_model_dir", str, None, "Directory path to save inference model")
data_g = ArgumentGroup(parser, "Data config options", "")
data_g.add_arg("data_dir", str, None, "Directory path to training data.")
data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
data_g.add_arg("vocab_size", str, None, "Vocabulary size.")
data_g.add_arg("max_seq_len", int, 128, "Number of words of the longest sequence.")
train_g = ArgumentGroup(parser, "Training config options", "")
train_g.add_arg("lr", float, 0.002, "The Learning rate value for training.")
train_g.add_arg("epoch", int, 10, "Number of epoches for training.")
train_g.add_arg("use_cuda", bool, False, "If set, use GPU for training.")
train_g.add_arg("batch_size", int, 256, "Total examples' number in batch for training.")
train_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
train_g.add_arg("save_steps", int, 1000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
train_g.add_arg("random_seed", int, 0, "Random seed.")
log_g = ArgumentGroup(parser, "Logging options", "")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log")
log_g.add_arg("task_name", str, None, "The name of task to perform emotion detection")
log_g.add_arg('enable_ce', bool, False, 'If set, run the task with continuous evaluation logs.')
custom_g = ArgumentGroup(parser, "Customize options", "")
self.custom_g = custom_g
self.parser = parser
self.arglist = [a.dest for a in self.parser._actions]
self.json_config = None
if json_file != "":
self.load_json(json_file)
def load_json(self, file_path):
"""load json config """
if not os.path.exists(file_path):
raise Warning("the json file %s does not exist." % file_path)
return
try:
with open(file_path, "r") as fin:
self.json_config = json.load(fin)
except Exception as e:
raise IOError("Error in parsing json config file '%s'" % file_path)
for name in self.json_config:
# use `six.string_types` but not `str` for compatible with python2 and python3
if not isinstance(self.json_config[name], (int, float, bool, six.string_types)):
continue
if name in self.arglist:
self.set_default(name, self.json_config[name])
else:
self.custom_g.add_arg(name,
type(self.json_config[name]),
self.json_config[name],
"customized options")
def set_default(self, name, value):
for arg in self.parser._actions:
if arg.dest == name:
arg.default = value
def build(self):
self.args = self.parser.parse_args()
self.arg_config = vars(self.args)
def print_arguments(self):
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(self.arg_config)):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
def add_arg(self, name, dtype, default, descrip):
self.custom_g.add_arg(name, dtype, default, descrip)
def __add__(self, new_arg):
assert isinstance(new_arg, list) or isinstance(new_arg, tuple)
assert len(new_arg) >= 3
assert self.args is None
name = new_arg[0]
dtype = new_arg[1]
dvalue = new_arg[2]
desc = new_arg[3] if len(new_arg) == 4 else "Description is not provided."
self.add_arg(name, dtype, dvalue, desc)
return self
def __getattr__(self, name):
if name in self.arg_config:
return self.arg_config[name]
if name in self.json_config:
return self.json_config[name]
raise Warning("The argument %s is not defined." % name)
if __name__ == '__main__':
pd_config = PDConfig('config.json')
pd_config += ("my_age", int, 18, "I am forever 18.")
pd_config.build()
pd_config.print_arguments()
print(pd_config.use_cuda)
print(pd_config.model_type)
# -*- encoding: utf8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import sys
sys.path.append("../")
import paddle
import paddle.fluid as fluid
import numpy as np
from models.model_check import check_cuda
from config import PDConfig
from run_classifier import create_model
import utils
def do_save_inference_model(args):
if args.use_cuda:
dev_count = fluid.core.get_cuda_device_count()
place = fluid.CUDAPlace(0)
else:
dev_count = int(os.environ.get('CPU_NUM', 1))
place = fluid.CPUPlace()
test_prog = fluid.default_main_program()
startup_prog = fluid.default_startup_program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
infer_pyreader, probs, feed_target_names = create_model(
args,
pyreader_name='infer_reader',
num_labels=args.num_labels,
is_prediction=True)
test_prog = test_prog.clone(for_test=True)
exe = fluid.Executor(place)
exe.run(startup_prog)
assert (args.init_checkpoint)
if args.init_checkpoint:
utils.init_checkpoint(exe, args.init_checkpoint, test_prog)
fluid.io.save_inference_model(
args.inference_model_dir,
feeded_var_names=feed_target_names,
target_vars=[probs],
executor=exe,
main_program=test_prog,
model_filename="model.pdmodel",
params_filename="params.pdparams")
print("save inference model at %s" % (args.inference_model_dir))
def test_inference_model(args, texts):
if args.use_cuda:
dev_count = fluid.core.get_cuda_device_count()
place = fluid.CUDAPlace(0)
else:
dev_count = int(os.environ.get('CPU_NUM', 1))
place = fluid.CPUPlace()
test_prog = fluid.default_main_program()
startup_prog = fluid.default_startup_program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
infer_pyreader, probs, feed_target_names = create_model(
args,
pyreader_name='infer_reader',
num_labels=args.num_labels,
is_prediction=True)
test_prog = test_prog.clone(for_test=True)
exe = fluid.Executor(place)
exe.run(startup_prog)
assert (args.inference_model_dir)
infer_program, feed_names, fetch_targets = fluid.io.load_inference_model(
dirname=args.inference_model_dir,
executor=exe,
model_filename="model.pdmodel",
params_filename="params.pdparams")
data = []
seq_lens = []
for query in texts:
wids = utils.query2ids(args.vocab_path, query)
wids, seq_len = utils.pad_wid(wids)
data.append(wids)
seq_lens.append(seq_len)
batch_size = len(data)
data = np.array(data).reshape((batch_size, 128, 1))
seq_lens = np.array(seq_lens).reshape((batch_size, 1))
pred = exe.run(infer_program,
feed={
feed_names[0]:data,
feed_names[1]:seq_lens},
fetch_list=fetch_targets,
return_numpy=True)
for probs in pred[0]:
print("%d\t%f\t%f\t%f" % (np.argmax(probs), probs[0], probs[1], probs[2]))
if __name__ == "__main__":
args = PDConfig(json_file="./config.json")
args.build()
args.print_arguments()
check_cuda(args.use_cuda)
if args.do_save_inference_model:
do_save_inference_model(args)
else:
texts = [u"我 讨厌 你 , 哼哼 哼 。 。", u"我 喜欢 你 , 爱 你 哟"]
test_inference_model(args, texts)
......@@ -29,43 +29,45 @@ class EmoTectProcessor(object):
Processor class for data convertors for EmoTect.
"""
def __init__(self, data_dir, vocab_path, random_seed=None):
def __init__(self, data_dir, vocab_path, random_seed=None, max_seq_len=128):
self.data_dir = data_dir
self.vocab = load_vocab(vocab_path)
self.num_examples = {"train": -1, "dev": -1, "test": -1, "infer": -1}
np.random.seed(random_seed)
self.max_seq_len = max_seq_len
def get_train_examples(self, data_dir, epoch=1):
def get_train_examples(self, data_dir, epoch, max_seq_len):
"""
Load training examples
"""
return data_reader(
os.path.join(self.data_dir, "train.tsv"), self.vocab,
self.num_examples, "train", epoch)
self.num_examples, "train", epoch, max_seq_len)
def get_dev_examples(self, data_dir):
def get_dev_examples(self, data_dir, epoch, max_seq_len):
"""
Load dev examples
"""
return data_reader(
os.path.join(self.data_dir, "dev.tsv"), self.vocab,
self.num_examples, "dev")
self.num_examples, "dev", epoch, max_seq_len)
def get_test_examples(self, data_dir):
def get_test_examples(self, data_dir, epoch, max_seq_len):
"""
Load test examples
"""
return data_reader(
os.path.join(self.data_dir, "test.tsv"), self.vocab,
self.num_examples, "test")
self.num_examples, "test", epoch, max_seq_len)
def get_infer_examples(self, data_dir):
def get_infer_examples(self, data_dir, epoch, max_seq_len):
"""
Load infer querys
"""
return data_reader(
os.path.join(self.data_dir, "infer.tsv"), self.vocab,
self.num_examples, "infer")
self.num_examples, "infer", epoch, max_seq_len)
def get_labels(self):
"""
......@@ -95,16 +97,16 @@ class EmoTectProcessor(object):
"""
if phase == "train":
return paddle.batch(
self.get_train_examples(self.data_dir, epoch), batch_size)
self.get_train_examples(self.data_dir, epoch, self.max_seq_len), batch_size)
elif phase == "dev":
return paddle.batch(
self.get_dev_examples(self.data_dir), batch_size)
self.get_dev_examples(self.data_dir, epoch, self.max_seq_len), batch_size)
elif phase == "test":
return paddle.batch(
self.get_test_examples(self.data_dir), batch_size)
self.get_test_examples(self.data_dir, epoch, self.max_seq_len), batch_size)
elif phase == "infer":
return paddle.batch(
self.get_infer_examples(self.data_dir), batch_size)
self.get_infer_examples(self.data_dir, epoch, self.max_seq_len), batch_size)
else:
raise ValueError(
"Unknown phase, which should be in ['train', 'dev', 'test', 'infer']."
......
#!/bin/bash
export FLAGS_enable_parallel_graph=1
export FLAGS_sync_nccl_allreduce=1
export CUDA_VISIBLE_DEVICES=3
export CUDA_VISIBLE_DEVICES=0
export FLAGS_fraction_of_gpu_memory_to_use=0.95
TASK_NAME='emotion_detection'
DATA_PATH=./data/
VOCAB_PATH=./data/vocab.txt
CKPT_PATH=./save_models/textcnn
MODEL_PATH=./models/textcnn
MODEL_PATH=./save_models/textcnn/step_756
# run_train on train.tsv and do_val on dev.tsv
train() {
python run_classifier.py \
--task_name ${TASK_NAME} \
--use_cuda false \
--do_train true \
--do_val true \
--epoch 5 \
--lr 0.002 \
--batch_size 64 \
--data_dir ${DATA_PATH} \
--vocab_path ${VOCAB_PATH} \
--output_dir ${CKPT_PATH} \
--save_checkpoint_dir ${CKPT_PATH} \
--save_steps 200 \
--validation_steps 200 \
--epoch 5 \
--lr 0.002 \
--config_path ./config.json \
--skip_steps 200
}
# run_eval on test.tsv
evaluate() {
python run_classifier.py \
--task_name ${TASK_NAME} \
--use_cuda false \
--do_val true \
--batch_size 128 \
--data_dir ${DATA_PATH} \
--vocab_path ${VOCAB_PATH} \
--init_checkpoint ${MODEL_PATH} \
--config_path ./config.json
--init_checkpoint ${MODEL_PATH}
}
# run_infer on infer.tsv
infer() {
python run_classifier.py \
--task_name ${TASK_NAME} \
--use_cuda false \
--do_infer true \
--batch_size 32 \
--data_dir ${DATA_PATH} \
--vocab_path ${VOCAB_PATH} \
--init_checkpoint ${MODEL_PATH} \
--config_path ./config.json
--init_checkpoint ${MODEL_PATH}
}
# run_save_inference_model
save_inference_model() {
python inference_model.py \
--use_cuda false \
--do_save_inference_model true \
--init_checkpoint ${MODEL_PATH} \
--inference_model_dir ./inference_model
}
main() {
......@@ -64,13 +58,16 @@ main() {
infer)
infer "$@";
;;
save_inference_model)
save_inference_model "$@";
;;
help)
echo "Usage: ${BASH_SOURCE} {train|eval|infer}";
echo "Usage: ${BASH_SOURCE} {train|eval|infer|save_inference_model}";
return 0;
;;
*)
echo "unsupport command [${cmd}]";
echo "Usage: ${BASH_SOURCE} {train|eval|infer}";
echo "Usage: ${BASH_SOURCE} {train|eval|infer|save_inference_model}";
return 1;
;;
esac
......
......@@ -11,6 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Emotion Detection Task
"""
......@@ -21,7 +22,6 @@ from __future__ import print_function
import os
import time
import argparse
import multiprocessing
import sys
sys.path.append("../")
......@@ -32,107 +32,55 @@ import numpy as np
from models.classification import nets
from models.model_check import check_cuda
from config import PDConfig
import reader
import config
import utils
parser = argparse.ArgumentParser(__doc__)
model_g = utils.ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("config_path", str, None,
"Path to the json file for EmoTect model config.")
model_g.add_arg("init_checkpoint", str, None,
"Init checkpoint to resume training from.")
model_g.add_arg("output_dir", str, None, "Directory path to save checkpoints")
train_g = utils.ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 10, "Number of epoches for training.")
train_g.add_arg("save_steps", int, 10000,
"The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000,
"The steps interval to evaluate model performance.")
train_g.add_arg("lr", float, 0.002, "The Learning rate value for training.")
log_g = utils.ArgumentGroup(parser, "logging", "logging related")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log")
data_g = utils.ArgumentGroup(
parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("data_dir", str, None, "Directory path to training data.")
data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
data_g.add_arg("batch_size", int, 256,
"Total examples' number in batch for training.")
data_g.add_arg("random_seed", int, 0, "Random seed.")
run_type_g = utils.ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, False, "If set, use GPU for training.")
run_type_g.add_arg("task_name", str, None,
"The name of task to perform sentiment classification.")
run_type_g.add_arg("do_train", bool, False, "Whether to perform training.")
run_type_g.add_arg("do_val", bool, False, "Whether to perform evaluation.")
run_type_g.add_arg("do_infer", bool, False, "Whether to perform inference.")
parser.add_argument(
'--enable_ce',
action='store_true',
help='If set, run the task with continuous evaluation logs.')
args = parser.parse_args()
def create_model(args,
pyreader_name,
emotect_config,
num_labels,
is_infer=False):
is_prediction=False):
"""
Create Model for sentiment classification
Create Model for Emotion Detection
"""
if is_infer:
pyreader = fluid.layers.py_reader(
data = fluid.layers.data(name="words", shape=[-1, args.max_seq_len, 1], dtype="int64")
label = fluid.layers.data(name="label", shape=[-1, 1], dtype="int64")
seq_len = fluid.layers.data(name="seq_len", shape=[-1, 1], dtype="int64")
if is_prediction:
pyreader = fluid.io.PyReader(
feed_list=[data, seq_len],
capacity=16,
shapes=[[-1, 1]],
dtypes=['int64'],
lod_levels=[1],
name=pyreader_name,
use_double_buffer=False)
iterable=False,
return_list=False)
else:
pyreader = fluid.layers.py_reader(
pyreader = fluid.io.PyReader(
feed_list=[data, label, seq_len],
capacity=16,
shapes=([-1, 1], [-1, 1]),
dtypes=('int64', 'int64'),
lod_levels=(1, 0),
name=pyreader_name,
use_double_buffer=False)
iterable=False,
return_list=False)
if emotect_config['model_type'] == "cnn_net":
if args.model_type == "cnn_net":
network = nets.cnn_net
elif emotect_config['model_type'] == "bow_net":
elif args.model_type == "bow_net":
network = nets.bow_net
elif emotect_config['model_type'] == "lstm_net":
elif args.model_type == "lstm_net":
network = nets.lstm_net
elif emotect_config['model_type'] == "bilstm_net":
elif args.model_type == "bilstm_net":
network = nets.bilstm_net
elif emotect_config['model_type'] == "gru_net":
elif args.model_type == "gru_net":
network = nets.gru_net
elif emotect_config['model_type'] == "textcnn_net":
elif args.model_type == "textcnn_net":
network = nets.textcnn_net
else:
raise ValueError("Unknown network type!")
if is_infer:
data = fluid.layers.read_file(pyreader)
probs = network(
data,
None,
emotect_config["vocab_size"],
class_dim=num_labels,
is_infer=True)
return pyreader, probs
data, label = fluid.layers.read_file(pyreader)
avg_loss, probs = network(
data, label, emotect_config["vocab_size"], class_dim=num_labels)
if is_prediction:
probs = network(data, seq_len, None, args.vocab_size, class_dim=num_labels, is_prediction=True)
return pyreader, probs, [data.name, seq_len.name]
avg_loss, probs = network(data, seq_len, label, args.vocab_size, class_dim=num_labels)
num_seqs = fluid.layers.create_tensor(dtype='int64')
accuracy = fluid.layers.accuracy(input=probs, label=label, total=num_seqs)
return pyreader, avg_loss, accuracy, num_seqs
......@@ -187,8 +135,6 @@ def main(args):
"""
Main Function
"""
emotect_config = config.EmoTectConfig(args.config_path)
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else:
......@@ -196,11 +142,11 @@ def main(args):
exe = fluid.Executor(place)
task_name = args.task_name.lower()
processor = reader.EmoTectProcessor(
data_dir=args.data_dir,
vocab_path=args.vocab_path,
random_seed=args.random_seed)
num_labels = len(processor.get_labels())
processor = reader.EmoTectProcessor(data_dir=args.data_dir,
vocab_path=args.vocab_path,
random_seed=args.random_seed)
#num_labels = len(processor.get_labels())
num_labels = args.num_labels
if not (args.do_train or args.do_val or args.do_infer):
raise ValueError("For args `do_train`, `do_val` and `do_infer`, at "
......@@ -229,9 +175,8 @@ def main(args):
train_pyreader, loss, accuracy, num_seqs = create_model(
args,
pyreader_name='train_reader',
emotect_config=emotect_config,
num_labels=num_labels,
is_infer=False)
is_prediction=False)
sgd_optimizer = fluid.optimizer.Adagrad(learning_rate=args.lr)
sgd_optimizer.minimize(loss)
......@@ -243,27 +188,41 @@ def main(args):
(lower_mem, upper_mem, unit))
if args.do_val:
if args.do_train:
test_data_generator = processor.data_generator(
batch_size=args.batch_size,
phase='dev',
epoch=1)
else:
test_data_generator = processor.data_generator(
batch_size=args.batch_size,
phase='test',
epoch=1)
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, loss, accuracy, num_seqs = create_model(
args,
pyreader_name='test_reader',
emotect_config=emotect_config,
num_labels=num_labels,
is_infer=False)
is_prediction=False)
test_prog = test_prog.clone(for_test=True)
if args.do_infer:
infer_data_generator = processor.data_generator(
batch_size=args.batch_size,
phase='infer',
epoch=1)
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
infer_pyreader, probs = create_model(
infer_pyreader, probs, _ = create_model(
args,
pyreader_name='infer_reader',
emotect_config=emotect_config,
num_labels=num_labels,
is_infer=True)
is_prediction=True)
test_prog = test_prog.clone(for_test=True)
exe.run(startup_prog)
......@@ -280,11 +239,15 @@ def main(args):
if args.do_train:
train_exe = exe
train_pyreader.decorate_paddle_reader(train_data_generator)
train_pyreader.decorate_sample_list_generator(train_data_generator)
else:
train_exe = None
if args.do_val or args.do_infer:
if args.do_val:
test_exe = exe
test_pyreader.decorate_sample_list_generator(test_data_generator)
if args.do_infer:
test_exe = exe
infer_pyreader.decorate_sample_list_generator(infer_data_generator)
if args.do_train:
train_pyreader.start()
......@@ -332,24 +295,24 @@ def main(args):
time_begin = time.time()
if steps % args.save_steps == 0:
save_path = os.path.join(args.output_dir,
"step_" + str(steps))
save_path = os.path.join(args.save_checkpoint_dir, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
if steps % args.validation_steps == 0:
# evaluate on dev set
if args.do_val:
test_pyreader.decorate_paddle_reader(
processor.data_generator(
batch_size=args.batch_size,
phase='dev',
epoch=1))
evaluate(test_exe, test_prog, test_pyreader,
[loss.name, accuracy.name, num_seqs.name],
"dev")
except fluid.core.EOFException:
save_path = os.path.join(args.output_dir, "step_" + str(steps))
print("final step: %d " % steps)
if args.do_val:
evaluate(test_exe, test_prog, test_pyreader,
[loss.name, accuracy.name, num_seqs.name],
"dev")
save_path = os.path.join(args.save_checkpoint_dir, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
train_pyreader.reset()
break
......@@ -372,19 +335,17 @@ def main(args):
# evaluate on test set
if not args.do_train and args.do_val:
test_pyreader.decorate_paddle_reader(
processor.data_generator(
batch_size=args.batch_size, phase='test', epoch=1))
print("Final test result:")
evaluate(test_exe, test_prog, test_pyreader,
[loss.name, accuracy.name, num_seqs.name], "test")
[loss.name, accuracy.name, num_seqs.name],
"test")
# infer
if args.do_infer:
infer_pyreader.decorate_paddle_reader(
processor.data_generator(
batch_size=args.batch_size, phase='infer', epoch=1))
infer(test_exe, test_prog, infer_pyreader, [probs.name], "infer")
print("Final infer result:")
infer(test_exe, test_prog, infer_pyreader,
[probs.name],
"infer")
def get_cards():
......@@ -396,6 +357,8 @@ def get_cards():
if __name__ == "__main__":
utils.print_arguments(args)
args = PDConfig('config.json')
args.build()
args.print_arguments()
check_cuda(args.use_cuda)
main(args)
#!/bin/bash
export FLAGS_sync_nccl_allreduce=1
export CUDA_VISIBLE_DEVICES=2
MODEL_PATH=./models/ernie_finetune
export CUDA_VISIBLE_DEVICES=0
MODEL_PATH=./pretrain_models/ernie
TASK_DATA_PATH=./data
CKPT_PATH=./save_models/ernie
......@@ -18,7 +18,7 @@ train() {
--train_set ${TASK_DATA_PATH}/train.tsv \
--dev_set ${TASK_DATA_PATH}/dev.tsv \
--vocab_path ${MODEL_PATH}/vocab.txt \
--output_dir ${CKPT_PATH} \
--save_checkpoint_dir ${CKPT_PATH} \
--save_steps 500 \
--validation_steps 50 \
--epoch 3 \
......@@ -38,7 +38,7 @@ evaluate() {
--do_val true \
--use_paddle_hub false \
--batch_size 32 \
--init_checkpoint ${MODEL_PATH}/params \
--init_checkpoint ${CKPT_PATH}/step_907 \
--test_set ${TASK_DATA_PATH}/test.tsv \
--vocab_path ${MODEL_PATH}/vocab.txt \
--max_seq_len 64 \
......@@ -54,7 +54,7 @@ infer() {
--do_infer true \
--use_paddle_hub false \
--batch_size 32 \
--init_checkpoint ${MODEL_PATH}/params \
--init_checkpoint ${CKPT_PATH}/step_907 \
--infer_set ${TASK_DATA_PATH}/infer.tsv \
--vocab_path ${MODEL_PATH}/vocab.txt \
--max_seq_len 64 \
......
......@@ -11,6 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Emotion Detection Task, based on ERNIE
"""
......@@ -34,27 +35,27 @@ from preprocess.ernie import task_reader
from models.representation import ernie
from models.model_check import check_cuda
import utils
import config
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = utils.ArgumentGroup(parser, "model", "model configuration and paths.")
model_g = config.ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("ernie_config_path", str, None, "Path to the json file for ernie model config.")
model_g.add_arg("senta_config_path", str, None, "Path to the json file for senta model config.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("output_dir", str, "checkpoints", "Path to save checkpoints")
model_g.add_arg("save_checkpoint_dir", str, "checkpoints", "Path to save checkpoints")
model_g.add_arg("use_paddle_hub", bool, False, "Whether to load ERNIE using PaddleHub")
train_g = utils.ArgumentGroup(parser, "training", "training options.")
train_g = config.ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 10, "Number of epoches for training.")
train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
train_g.add_arg("lr", float, 0.002, "The Learning rate value for training.")
log_g = utils.ArgumentGroup(parser, "logging", "logging related")
log_g = config.ArgumentGroup(parser, "logging", "logging related")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log")
data_g = utils.ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g = config.ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("data_dir", str, None, "Directory path to training data.")
data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
data_g.add_arg("batch_size", int, 256, "Total examples' number in batch for training.")
......@@ -69,7 +70,7 @@ data_g.add_arg("label_map_config", str, None, "label_map_path.")
data_g.add_arg("do_lower_case", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
run_type_g = utils.ArgumentGroup(parser, "run_type", "running type options.")
run_type_g = config.ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, False, "If set, use GPU for training.")
run_type_g.add_arg("task_name", str, None, "The name of task to perform sentiment classification.")
run_type_g.add_arg("do_train", bool, False, "Whether to perform training.")
......@@ -348,7 +349,7 @@ def main(args):
time_begin = time.time()
if steps % args.save_steps == 0:
save_path = os.path.join(args.output_dir, "step_" + str(steps))
save_path = os.path.join(args.save_checkpoint_dir, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
if steps % args.validation_steps == 0:
......@@ -367,7 +368,7 @@ def main(args):
"dev")
except fluid.core.EOFException:
save_path = os.path.join(args.output_dir, "step_" + str(steps))
save_path = os.path.join(args.save_checkpoint_dir, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
train_pyreader.reset()
break
......
......@@ -11,6 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
EmoTect utilities.
"""
......@@ -20,56 +21,14 @@ from __future__ import print_function
import io
import os
import six
import sys
import six
import random
import argparse
import paddle
import paddle.fluid as fluid
import numpy as np
def str2bool(value):
"""
String to Boolean
"""
# because argparse does not support to parse "true, False" as python
# boolean directly
return value.lower() in ("true", "t", "1")
class ArgumentGroup(object):
"""
Argument Class
"""
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, type, default, help, **kwargs):
"""
Add argument
"""
type = str2bool if type == bool else type
self._group.add_argument(
"--" + name,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
def print_arguments(args):
"""
Print Arguments
"""
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
def init_checkpoint(exe, init_checkpoint_path, main_program):
"""
Init CheckPoint
......@@ -93,11 +52,34 @@ def init_checkpoint(exe, init_checkpoint_path, main_program):
print("Load model from {}".format(init_checkpoint_path))
def data_reader(file_path, word_dict, num_examples, phrase, epoch=1):
def word2id(word_dict, query):
"""
Convert word sequence into slot
Convert word sequence into id list
"""
unk_id = len(word_dict)
wids = [word_dict[w] if w in word_dict else unk_id
for w in query.strip().split(" ")]
return wids
def pad_wid(wids, max_seq_len=128, pad_id=0):
"""
Padding data to max_seq_len
"""
seq_len = len(wids)
if seq_len < max_seq_len:
for i in range(max_seq_len - seq_len):
wids.append(pad_id)
else:
wids = wids[:max_seq_len]
seq_len = max_seq_len
return wids, seq_len
def data_reader(file_path, word_dict, num_examples, phrase, epoch, max_seq_len):
"""
Data reader, which convert word sequence into id list
"""
all_data = []
with io.open(file_path, "r", encoding='utf8') as fin:
for line in fin:
......@@ -105,24 +87,20 @@ def data_reader(file_path, word_dict, num_examples, phrase, epoch=1):
continue
if phrase == "infer":
cols = line.strip().split("\t")
if len(cols) != 1:
query = cols[-1]
wids = [
word_dict[x] if x in word_dict else unk_id
for x in query.strip().split(" ")
]
all_data.append((wids, ))
query = cols[-1] if len(cols) != -1 else cols[0]
wids = word2id(word_dict, query)
wids, seq_len = pad_wid(wids, max_seq_len)
all_data.append((wids, seq_len))
else:
cols = line.strip().split("\t")
if len(cols) != 2:
sys.stderr.write("[NOTICE] Error Format Line!")
continue
label = int(cols[0])
wids = [
word_dict[x] if x in word_dict else unk_id
for x in cols[1].split(" ")
]
all_data.append((wids, label))
query = cols[1].strip()
wids = word2id(word_dict, query)
wids, seq_len = pad_wid(wids, max_seq_len)
all_data.append((wids, label, seq_len))
num_examples[phrase] = len(all_data)
if phrase == "infer":
......@@ -131,8 +109,8 @@ def data_reader(file_path, word_dict, num_examples, phrase, epoch=1):
"""
Infer reader function
"""
for wids in all_data:
yield wids
for wids, seq_len in all_data:
yield wids, seq_len
return reader
......@@ -143,8 +121,8 @@ def data_reader(file_path, word_dict, num_examples, phrase, epoch=1):
for idx in range(epoch):
if phrase == "train":
random.shuffle(all_data)
for wids, label in all_data:
yield wids, label
for wids, label, seq_len in all_data:
yield wids, label, seq_len
return reader
......@@ -162,3 +140,22 @@ def load_vocab(file_path):
wid += 1
vocab["<unk>"] = len(vocab)
return vocab
def print_arguments(args):
"""
print arguments
"""
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
def query2ids(vocab_path, query):
"""
Convert query to id list according to the given vocab
"""
vocab = load_vocab(vocab_path)
wids = word2id(vocab, query)
return wids
......@@ -133,7 +133,7 @@ def bilstm_net(data,
input=data,
size=[dict_dim, emb_dim],
param_attr=fluid.ParamAttr(learning_rate=emb_lr))
emb = fluid.layers.sequence_unpad(emb, length=seq_len)
fc0 = fluid.layers.fc(input=emb, size=hid_dim * 4)
......@@ -200,15 +200,15 @@ def gru_net(data,
def textcnn_net(data,
seq_len,
label,
dict_dim,
emb_dim=128,
hid_dim=128,
hid_dim2=96,
class_dim=2,
win_sizes=None,
is_prediction=False):
seq_len,
label,
dict_dim,
emb_dim=128,
hid_dim=128,
hid_dim2=96,
class_dim=2,
win_sizes=None,
is_prediction=False):
"""
Textcnn_net
"""
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册