提交 202a06a2 编写于 作者: Y Yibing Liu

Merge branch 'develop' of https://github.com/PaddlePaddle/models into ctc_decoder_deploy

.DS_Store .DS_Store
*.pyc
group: deprecated-2017Q2
language: cpp language: cpp
cache: ccache cache: ccache
sudo: required sudo: required
......
...@@ -10,6 +10,7 @@ unittest(){ ...@@ -10,6 +10,7 @@ unittest(){
cd $1 > /dev/null cd $1 > /dev/null
if [ -f "setup.sh" ]; then if [ -f "setup.sh" ]; then
sh setup.sh sh setup.sh
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
fi fi
if [ $? != 0 ]; then if [ $? != 0 ]; then
exit 1 exit 1
......
...@@ -47,29 +47,37 @@ PaddlePaddle提供了丰富的运算单元,帮助大家以模块化的方式 ...@@ -47,29 +47,37 @@ PaddlePaddle提供了丰富的运算单元,帮助大家以模块化的方式
- 5.1 [基于 Pairwise 和 Listwise 的排序学习](https://github.com/PaddlePaddle/models/tree/develop/ltr) - 5.1 [基于 Pairwise 和 Listwise 的排序学习](https://github.com/PaddlePaddle/models/tree/develop/ltr)
## 6. 序列标注 ## 6. 深度结构化语义模型
深度结构化语义模型使用DNN模型在一个连续的语义空间中学习文本低纬的向量表示,最终建模两个句子间的语义相似度。
本例中我们演示如何使用 PaddlePaddle实现一个通用的深度结构化语义模型来建模两个字符串间的语义相似度。
模型支持CNN(卷积网络)、FC(全连接网络)、RNN(递归神经网络)等不同的网络结构,以及分类、回归、排序等不同损失函数,采用了比较通用的数据格式,用户替换数据便可以在真实场景中使用。
- 6.1 [深度结构化语义模型](https://github.com/PaddlePaddle/models/tree/develop/dssm)
## 7. 序列标注
给定输入序列,序列标注模型为序列中每一个元素贴上一个类别标签,是自然语言处理领域最基础的任务之一。随着深度学习的不断探索和发展,利用循环神经网络学习输入序列的特征表示,条件随机场(Conditional Random Field, CRF)在特征基础上完成序列标注任务,逐渐成为解决序列标注问题的标配解决方案。 给定输入序列,序列标注模型为序列中每一个元素贴上一个类别标签,是自然语言处理领域最基础的任务之一。随着深度学习的不断探索和发展,利用循环神经网络学习输入序列的特征表示,条件随机场(Conditional Random Field, CRF)在特征基础上完成序列标注任务,逐渐成为解决序列标注问题的标配解决方案。
在序列标注的例子中,我们以命名实体识别(Named Entity Recognition,NER)任务为例,介绍如何训练一个端到端的序列标注模型。 在序列标注的例子中,我们以命名实体识别(Named Entity Recognition,NER)任务为例,介绍如何训练一个端到端的序列标注模型。
- 6.1 [命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/sequence_tagging_for_ner) - 7.1 [命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/sequence_tagging_for_ner)
## 7. 序列到序列学习 ## 8. 序列到序列学习
序列到序列学习实现两个甚至是多个不定长模型之间的映射,有着广泛的应用,包括:机器翻译、智能对话与问答、广告创意语料生成、自动编码(如金融画像编码)、判断多个文本串之间的语义相关性等。 序列到序列学习实现两个甚至是多个不定长模型之间的映射,有着广泛的应用,包括:机器翻译、智能对话与问答、广告创意语料生成、自动编码(如金融画像编码)、判断多个文本串之间的语义相关性等。
在序列到序列学习的例子中,我们以机器翻译任务为例,提供了多种改进模型,供大家学习和使用。包括:不带注意力机制的序列到序列映射模型,这一模型是所有序列到序列学习模型的基础;使用 scheduled sampling 改善 RNN 模型在生成任务中的错误累积问题;带外部记忆机制的神经机器翻译,通过增强神经网络的记忆能力,来完成复杂的序列到序列学习任务。 在序列到序列学习的例子中,我们以机器翻译任务为例,提供了多种改进模型,供大家学习和使用。包括:不带注意力机制的序列到序列映射模型,这一模型是所有序列到序列学习模型的基础;使用 scheduled sampling 改善 RNN 模型在生成任务中的错误累积问题;带外部记忆机制的神经机器翻译,通过增强神经网络的记忆能力,来完成复杂的序列到序列学习任务。
- 7.1 [无注意力机制的编码器解码器模型](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention) - 8.1 [无注意力机制的编码器解码器模型](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention)
## 8. 图像分类 ## 9. 图像分类
图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在图像分类的例子中,我们向大家介绍如何在PaddlePaddle中训练AlexNet、VGG、GoogLeNet和ResNet模型。同时还提供了一个模型转换工具,能够将Caffe训练好的模型文件,转换为PaddlePaddle的模型文件。 图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在图像分类的例子中,我们向大家介绍如何在PaddlePaddle中训练AlexNet、VGG、GoogLeNet和ResNet模型。同时还提供了一个模型转换工具,能够将Caffe训练好的模型文件,转换为PaddlePaddle的模型文件。
- 8.1 [将Caffe模型文件转换为PaddlePaddle模型文件](https://github.com/PaddlePaddle/models/tree/develop/image_classification/caffe2paddle) - 9.1 [将Caffe模型文件转换为PaddlePaddle模型文件](https://github.com/PaddlePaddle/models/tree/develop/image_classification/caffe2paddle)
- 8.2 [AlexNet](https://github.com/PaddlePaddle/models/tree/develop/image_classification) - 9.2 [AlexNet](https://github.com/PaddlePaddle/models/tree/develop/image_classification)
- 8.3 [VGG](https://github.com/PaddlePaddle/models/tree/develop/image_classification) - 9.3 [VGG](https://github.com/PaddlePaddle/models/tree/develop/image_classification)
- 8.4 [Residual Network](https://github.com/PaddlePaddle/models/tree/develop/image_classification) - 9.4 [Residual Network](https://github.com/PaddlePaddle/models/tree/develop/image_classification)
## Copyright and License ## Copyright and License
......
# 点击率预估 # 点击率预估
以下是本例目录包含的文件以及对应说明:
```
├── README.md # 本教程markdown 文档
├── dataset.md # 数据集处理教程
├── images # 本教程图片目录
│   ├── lr_vs_dnn.jpg
│   └── wide_deep.png
├── infer.py # 预测脚本
├── network_conf.py # 模型网络配置
├── reader.py # data reader
├── train.py # 训练脚本
└── utils.py # helper functions
└── avazu_data_processer.py # 示例数据预处理脚本
```
## 背景介绍 ## 背景介绍
CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率, CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\]
通常被用来衡量一个在线广告系统的有效性 是对用户点击一个特定链接的概率做出预测,是广告投放过程中的一个重要环节。精准的点击率预估对在线广告系统收益最大化具有重要意义
当有多个广告位时,CTR 预估一般会作为排序的基准。 当有多个广告位时,CTR 预估一般会作为排序的基准,比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
1. 召回满足 query 的广告集合 1. 获取与用户搜索词相关的广告集合
2. 业务规则和相关性过滤 2. 业务规则和相关性过滤
3. 根据拍卖机制和 CTR 排序 3. 根据拍卖机制和 CTR 排序
4. 展出广告 4. 展出广告
...@@ -36,13 +51,11 @@ Figure 1. LR 和 DNN 模型结构对比 ...@@ -36,13 +51,11 @@ Figure 1. LR 和 DNN 模型结构对比
</p> </p>
LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加), LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加),
但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息)。 但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息);
如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量, 如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量,
这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。 这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法。 LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法;
而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率, 而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率,
这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。 这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。
...@@ -59,10 +72,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括 ...@@ -59,10 +72,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括
我们直接使用第一种方法做分类任务。 我们直接使用第一种方法做分类任务。
我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。 我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示本例中的模型。
具体的特征处理方法参看 [data process](./dataset.md)
本教程中演示模型的输入格式如下:
```
# <dnn input ids> \t <lr input sparse values> \t click
1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
23 231 \t 1230:0.12 13421:0.9 \t 1
```
详细的格式描述如下:
- `dnn input ids` 采用 one-hot 表示,只需要填写值为1的ID(注意这里不是变长输入)
- `lr input sparse values` 使用了 `ID:VALUE` 的表示,值部分最好规约到值域 `[-1, 1]`
此外,模型训练时需要传入一个文件描述 dnn 和 lr两个子模型的输入维度,文件的格式如下:
具体的特征处理方法参看 [data process](./dataset.md) ```
dnn_input_dim: <int>
lr_input_dim: <int>
```
其中, `<int>` 表示一个整型数值。
本目录下的 `avazu_data_processor.py` 可以对下载的演示数据集\[[2](#参考文档)\] 进行处理,具体使用方法参考如下说明:
```
usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
OUTPUT_DIR
[--num_lines_to_detect NUM_LINES_TO_DETECT]
[--test_set_size TEST_SET_SIZE]
[--train_size TRAIN_SIZE]
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
path of the Avazu dataset
--output_dir OUTPUT_DIR
directory to output
--num_lines_to_detect NUM_LINES_TO_DETECT
number of records to detect dataset's meta info
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--train_size TRAIN_SIZE
size of the trainset (default: 100000)
```
- `data_path` 是待处理的数据路径
- `output_dir` 生成数据的输出路径
- `num_lines_to_detect` 预先扫描数据生成ID的个数,这里是扫描的文件行数
- `test_set_size` 生成测试集的行数
- `train_size` 生成训练姐的行数
## Wide & Deep Learning Model ## Wide & Deep Learning Model
...@@ -201,18 +266,20 @@ trainer.train( ...@@ -201,18 +266,20 @@ trainer.train(
## 运行训练和测试 ## 运行训练和测试
训练模型需要如下步骤: 训练模型需要如下步骤:
1. 下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\] 1. 准备训练数据
1.[Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz 1.[Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
2. 解压 train.gz 得到 train.txt 2. 解压 train.gz 得到 train.txt
2. 执行 `python train.py --train_data_path train.txt` ,开始训练 3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
2. 执行 `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0` 开始训练
上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下 上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
``` ```
usage: train.py [-h] --train_data_path TRAIN_DATA_PATH usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
[--batch_size BATCH_SIZE] [--test_set_size TEST_SET_SIZE] [--test_data_path TEST_DATA_PATH] [--batch_size BATCH_SIZE]
[--num_passes NUM_PASSES] [--num_passes NUM_PASSES]
[--num_lines_to_detact NUM_LINES_TO_DETACT] [--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
DATA_META_FILE --model_type MODEL_TYPE
PaddlePaddle CTR example PaddlePaddle CTR example
...@@ -220,16 +287,78 @@ optional arguments: ...@@ -220,16 +287,78 @@ optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--train_data_path TRAIN_DATA_PATH --train_data_path TRAIN_DATA_PATH
path of training dataset path of training dataset
--test_data_path TEST_DATA_PATH
path of testing dataset
--batch_size BATCH_SIZE --batch_size BATCH_SIZE
size of mini-batch (default:10000) size of mini-batch (default:10000)
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--num_passes NUM_PASSES --num_passes NUM_PASSES
number of passes to train number of passes to train
--num_lines_to_detact NUM_LINES_TO_DETACT --model_output_prefix MODEL_OUTPUT_PREFIX
number of records to detect dataset's meta info prefix of path for model to store (default:
./ctr_models)
--data_meta_file DATA_META_FILE
path of data meta info file
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
- `train_data_path` : 训练集的路径
- `test_data_path` : 测试集的路径
- `num_passes`: 模型训练多少轮
- `data_meta_file`: 参考[数据和任务抽象](### 数据和任务抽象)的描述。
- `model_type`: 模型分类或回归
## 用训好的模型做预测
训好的模型可以用来预测新的数据, 预测数据的格式为
```
# <dnn input ids> \t <lr input sparse values>
1 23 190 \t 230:0.12 3421:0.9 23451:0.12
23 231 \t 1230:0.12 13421:0.9
``` ```
这里与训练数据的格式唯一不同的地方,就是没有标签,也就是训练数据中第3列 `click` 对应的数值。
`infer.py` 的使用方法如下
```
usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
--prediction_output_path PREDICTION_OUTPUT_PATH
[--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--model_gz_path MODEL_GZ_PATH
path of model parameters gz file
--data_path DATA_PATH
path of the dataset to infer
--prediction_output_path PREDICTION_OUTPUT_PATH
path to output the prediction
--data_meta_path DATA_META_PATH
path of trainset's meta info, default is ./data.meta
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
- `model_gz_path_model`:用 `gz` 压缩过的模型路径
- `data_path` : 需要预测的数据路径
- `prediction_output_paht`:预测输出的路径
- `data_meta_file` :参考[数据和任务抽象](### 数据和任务抽象)的描述。
- `model_type` :分类或回归
示例数据可以用如下命令预测
```
python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
```
最终的预测结果位于 `predictions.txt`
## 参考文献 ## 参考文献
1. <https://en.wikipedia.org/wiki/Click-through_rate> 1. <https://en.wikipedia.org/wiki/Click-through_rate>
2. <https://www.kaggle.com/c/avazu-ctr-prediction/data> 2. <https://www.kaggle.com/c/avazu-ctr-prediction/data>
......
#!/usr/bin/env python
# -*- coding: utf-8 -*-import os
import sys import sys
import csv import csv
import cPickle
import argparse
import numpy as np import numpy as np
from utils import logger, TaskMode
parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument(
'--data_path', type=str, required=True, help="path of the Avazu dataset")
parser.add_argument(
'--output_dir', type=str, required=True, help="directory to output")
parser.add_argument(
'--num_lines_to_detect',
type=int,
default=500000,
help="number of records to detect dataset's meta info")
parser.add_argument(
'--test_set_size',
type=int,
default=10000,
help="size of the validation dataset(default: 10000)")
parser.add_argument(
'--train_size',
type=int,
default=100000,
help="size of the trainset (default: 100000)")
args = parser.parse_args()
''' '''
The fields of the dataset are: The fields of the dataset are:
...@@ -22,7 +50,7 @@ The fields of the dataset are: ...@@ -22,7 +50,7 @@ The fields of the dataset are:
15. device_conn_type 15. device_conn_type
16. C14-C21 -- anonymized categorical variables 16. C14-C21 -- anonymized categorical variables
We will treat following fields as categorical features: We will treat the following fields as categorical features:
- C1 - C1
- banner_pos - banner_pos
...@@ -40,6 +68,14 @@ and some other features as id features: ...@@ -40,6 +68,14 @@ and some other features as id features:
The `hour` field will be treated as a continuous feature and will be transformed The `hour` field will be treated as a continuous feature and will be transformed
to one-hot representation which has 24 bits. to one-hot representation which has 24 bits.
This script will output 3 files:
1. train.txt
2. test.txt
3. infer.txt
all the files are for demo.
''' '''
feature_dims = {} feature_dims = {}
...@@ -161,6 +197,7 @@ def detect_dataset(path, topn, id_fea_space=10000): ...@@ -161,6 +197,7 @@ def detect_dataset(path, topn, id_fea_space=10000):
NOTE the records should be randomly shuffled first. NOTE the records should be randomly shuffled first.
''' '''
# create categorical statis objects. # create categorical statis objects.
logger.warning('detecting dataset')
with open(path, 'rb') as csvfile: with open(path, 'rb') as csvfile:
reader = csv.DictReader(csvfile) reader = csv.DictReader(csvfile)
...@@ -174,9 +211,6 @@ def detect_dataset(path, topn, id_fea_space=10000): ...@@ -174,9 +211,6 @@ def detect_dataset(path, topn, id_fea_space=10000):
for key, item in fields.items(): for key, item in fields.items():
feature_dims[key] = item.size() feature_dims[key] = item.size()
#for key in id_features:
#feature_dims[key] = id_fea_space
feature_dims['hour'] = 24 feature_dims['hour'] = 24
feature_dims['click'] = 1 feature_dims['click'] = 1
...@@ -184,10 +218,17 @@ def detect_dataset(path, topn, id_fea_space=10000): ...@@ -184,10 +218,17 @@ def detect_dataset(path, topn, id_fea_space=10000):
feature_dims[key] for key in categorial_features + ['hour']) + 1 feature_dims[key] for key in categorial_features + ['hour']) + 1
feature_dims['lr_input'] = np.sum(feature_dims[key] feature_dims['lr_input'] = np.sum(feature_dims[key]
for key in id_features) + 1 for key in id_features) + 1
return feature_dims return feature_dims
def load_data_meta(meta_path):
'''
Load dataset's meta infomation.
'''
feature_dims, fields = cPickle.load(open(meta_path, 'rb'))
return feature_dims, fields
def concat_sparse_vectors(inputs, dims): def concat_sparse_vectors(inputs, dims):
''' '''
Concaterate more than one sparse vectors into one. Concaterate more than one sparse vectors into one.
...@@ -211,67 +252,162 @@ class AvazuDataset(object): ...@@ -211,67 +252,162 @@ class AvazuDataset(object):
''' '''
Load AVAZU dataset as train set. Load AVAZU dataset as train set.
''' '''
TRAIN_MODE = 0
TEST_MODE = 1
def __init__(self, train_path, n_records_as_test=-1): def __init__(self,
train_path,
n_records_as_test=-1,
fields=None,
feature_dims=None):
self.train_path = train_path self.train_path = train_path
self.n_records_as_test = n_records_as_test self.n_records_as_test = n_records_as_test
# task model: 0 train, 1 test self.fields = fields
self.mode = 0 # default is train mode.
self.mode = TaskMode.create_train()
self.categorial_dims = [
feature_dims[key] for key in categorial_features + ['hour']
]
self.id_dims = [feature_dims[key] for key in id_features]
def train(self): def train(self):
self.mode = self.TRAIN_MODE '''
return self._parse(self.train_path, skip_n_lines=self.n_records_as_test) Load trainset.
'''
logger.info("load trainset from %s" % self.train_path)
self.mode = TaskMode.create_train()
with open(self.train_path) as f:
reader = csv.DictReader(f)
def test(self): for row_id, row in enumerate(reader):
self.mode = self.TEST_MODE # skip top n lines
return self._parse(self.train_path, top_n_lines=self.n_records_as_test) if self.n_records_as_test > 0 and row_id < self.n_records_as_test:
continue
def _parse(self, path, skip_n_lines=-1, top_n_lines=-1): rcd = self._parse_record(row)
with open(path, 'rb') as csvfile: if rcd:
reader = csv.DictReader(csvfile) yield rcd
categorial_dims = [ def test(self):
feature_dims[key] for key in categorial_features + ['hour'] '''
] Load testset.
id_dims = [feature_dims[key] for key in id_features] '''
logger.info("load testset from %s" % self.train_path)
self.mode = TaskMode.create_test()
with open(self.train_path) as f:
reader = csv.DictReader(f)
for row_id, row in enumerate(reader): for row_id, row in enumerate(reader):
if skip_n_lines > 0 and row_id < skip_n_lines: # skip top n lines
continue if self.n_records_as_test > 0 and row_id > self.n_records_as_test:
if top_n_lines > 0 and row_id > top_n_lines:
break break
rcd = self._parse_record(row)
if rcd:
yield rcd
def infer(self):
'''
Load inferset.
'''
logger.info("load inferset from %s" % self.train_path)
self.mode = TaskMode.create_infer()
with open(self.train_path) as f:
reader = csv.DictReader(f)
for row_id, row in enumerate(reader):
rcd = self._parse_record(row)
if rcd:
yield rcd
def _parse_record(self, row):
'''
Parse a CSV row and get a record.
'''
record = [] record = []
for key in categorial_features: for key in categorial_features:
record.append(fields[key].gen(row[key])) record.append(self.fields[key].gen(row[key]))
record.append([int(row['hour'][-2:])]) record.append([int(row['hour'][-2:])])
dense_input = concat_sparse_vectors(record, categorial_dims) dense_input = concat_sparse_vectors(record, self.categorial_dims)
record = [] record = []
for key in id_features: for key in id_features:
if 'cross' not in key: if 'cross' not in key:
record.append(fields[key].gen(row[key])) record.append(self.fields[key].gen(row[key]))
else: else:
fea0 = fields[key].cross_fea0 fea0 = self.fields[key].cross_fea0
fea1 = fields[key].cross_fea1 fea1 = self.fields[key].cross_fea1
record.append( record.append(
fields[key].gen_cross_fea(row[fea0], row[fea1])) self.fields[key].gen_cross_fea(row[fea0], row[fea1]))
sparse_input = concat_sparse_vectors(record, id_dims) sparse_input = concat_sparse_vectors(record, self.id_dims)
record = [dense_input, sparse_input] record = [dense_input, sparse_input]
if not self.mode.is_infer():
record.append(list((int(row['click']), ))) record.append(list((int(row['click']), )))
yield record return record
def ids2dense(vec, dim):
return vec
def ids2sparse(vec):
return ["%d:1" % x for x in vec]
if __name__ == '__main__': detect_dataset(args.data_path, args.num_lines_to_detect)
path = 'train.txt' dataset = AvazuDataset(
print detect_dataset(path, 400000) args.data_path,
args.test_set_size,
fields=fields,
feature_dims=feature_dims)
filereader = AvazuDataset(path) output_trainset_path = os.path.join(args.output_dir, 'train.txt')
for no, rcd in enumerate(filereader.train()): output_testset_path = os.path.join(args.output_dir, 'test.txt')
print no, rcd output_infer_path = os.path.join(args.output_dir, 'infer.txt')
if no > 1000: break output_meta_path = os.path.join(args.output_dir, 'data.meta.txt')
with open(output_trainset_path, 'w') as f:
for id, record in enumerate(dataset.train()):
if id and id % 10000 == 0:
logger.info("load %d records" % id)
if id > args.train_size:
break
dnn_input, lr_input, click = record
dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
lr_input = ids2sparse(lr_input)
line = "%s\t%s\t%d\n" % (' '.join(map(str, dnn_input)),
' '.join(map(str, lr_input)), click[0])
f.write(line)
logger.info('write to %s' % output_trainset_path)
with open(output_testset_path, 'w') as f:
for id, record in enumerate(dataset.test()):
dnn_input, lr_input, click = record
dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
lr_input = ids2sparse(lr_input)
line = "%s\t%s\t%d\n" % (' '.join(map(str, dnn_input)),
' '.join(map(str, lr_input)), click[0])
f.write(line)
logger.info('write to %s' % output_testset_path)
with open(output_infer_path, 'w') as f:
for id, record in enumerate(dataset.infer()):
dnn_input, lr_input = record
dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
lr_input = ids2sparse(lr_input)
line = "%s\t%s\n" % (' '.join(map(str, dnn_input)),
' '.join(map(str, lr_input)), )
f.write(line)
if id > args.test_set_size:
break
logger.info('write to %s' % output_infer_path)
with open(output_meta_path, 'w') as f:
lines = [
"dnn_input_dim: %d" % feature_dims['dnn_input'],
"lr_input_dim: %d" % feature_dims['lr_input']
]
f.write('\n'.join(lines))
logger.info('write data meta into %s' % output_meta_path)
# 数据及处理 # 数据及处理
## 数据集介绍 ## 数据集介绍
本教程演示使用Kaggle上CTR任务的数据集\[[3](#参考文献)\]的预处理方法,最终产生本模型需要的格式,详细的数据格式参考[README.md](./README.md)
Wide && Deep Model\[[2](#参考文献)\]的优势是融合稠密特征和大规模稀疏特征,
因此特征处理方面也针对稠密和稀疏两种特征作处理,
其中Deep部分的稠密值全部转化为ID类特征,
通过embedding 来转化为稠密的向量输入;Wide部分主要通过ID的叉乘提升维度。
数据集使用 `csv` 格式存储,其中各个字段内容如下: 数据集使用 `csv` 格式存储,其中各个字段内容如下:
- `id` : ad identifier - `id` : ad identifier
......
...@@ -42,15 +42,30 @@ ...@@ -42,15 +42,30 @@
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# 点击率预估 # 点击率预估
以下是本例目录包含的文件以及对应说明:
```
├── README.md # 本教程markdown 文档
├── dataset.md # 数据集处理教程
├── images # 本教程图片目录
│   ├── lr_vs_dnn.jpg
│   └── wide_deep.png
├── infer.py # 预测脚本
├── network_conf.py # 模型网络配置
├── reader.py # data reader
├── train.py # 训练脚本
└── utils.py # helper functions
└── avazu_data_processer.py # 示例数据预处理脚本
```
## 背景介绍 ## 背景介绍
CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率, CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\]
通常被用来衡量一个在线广告系统的有效性 是对用户点击一个特定链接的概率做出预测,是广告投放过程中的一个重要环节。精准的点击率预估对在线广告系统收益最大化具有重要意义
当有多个广告位时,CTR 预估一般会作为排序的基准。 当有多个广告位时,CTR 预估一般会作为排序的基准,比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
1. 召回满足 query 的广告集合 1. 获取与用户搜索词相关的广告集合
2. 业务规则和相关性过滤 2. 业务规则和相关性过滤
3. 根据拍卖机制和 CTR 排序 3. 根据拍卖机制和 CTR 排序
4. 展出广告 4. 展出广告
...@@ -78,13 +93,11 @@ Figure 1. LR 和 DNN 模型结构对比 ...@@ -78,13 +93,11 @@ Figure 1. LR 和 DNN 模型结构对比
</p> </p>
LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加), LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加),
但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息)。 但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息);
如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量, 如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量,
这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。 这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法。 LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法;
而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率, 而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率,
这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。 这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。
...@@ -101,10 +114,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括 ...@@ -101,10 +114,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括
我们直接使用第一种方法做分类任务。 我们直接使用第一种方法做分类任务。
我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。 我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示本例中的模型。
具体的特征处理方法参看 [data process](./dataset.md)。
本教程中演示模型的输入格式如下:
```
# <dnn input ids> \t <lr input sparse values> \t click
1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
23 231 \t 1230:0.12 13421:0.9 \t 1
```
详细的格式描述如下:
- `dnn input ids` 采用 one-hot 表示,只需要填写值为1的ID(注意这里不是变长输入)
- `lr input sparse values` 使用了 `ID:VALUE` 的表示,值部分最好规约到值域 `[-1, 1]`。
此外,模型训练时需要传入一个文件描述 dnn 和 lr两个子模型的输入维度,文件的格式如下:
具体的特征处理方法参看 [data process](./dataset.md) ```
dnn_input_dim: <int>
lr_input_dim: <int>
```
其中, `<int>` 表示一个整型数值。
本目录下的 `avazu_data_processor.py` 可以对下载的演示数据集\[[2](#参考文档)\] 进行处理,具体使用方法参考如下说明:
```
usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
OUTPUT_DIR
[--num_lines_to_detect NUM_LINES_TO_DETECT]
[--test_set_size TEST_SET_SIZE]
[--train_size TRAIN_SIZE]
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
path of the Avazu dataset
--output_dir OUTPUT_DIR
directory to output
--num_lines_to_detect NUM_LINES_TO_DETECT
number of records to detect dataset's meta info
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--train_size TRAIN_SIZE
size of the trainset (default: 100000)
```
- `data_path` 是待处理的数据路径
- `output_dir` 生成数据的输出路径
- `num_lines_to_detect` 预先扫描数据生成ID的个数,这里是扫描的文件行数
- `test_set_size` 生成测试集的行数
- `train_size` 生成训练姐的行数
## Wide & Deep Learning Model ## Wide & Deep Learning Model
...@@ -243,18 +308,20 @@ trainer.train( ...@@ -243,18 +308,20 @@ trainer.train(
## 运行训练和测试 ## 运行训练和测试
训练模型需要如下步骤: 训练模型需要如下步骤:
1. 下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\] 1. 准备训练数据
1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz 1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
2. 解压 train.gz 得到 train.txt 2. 解压 train.gz 得到 train.txt
2. 执行 `python train.py --train_data_path train.txt` ,开始训练 3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
2. 执行 `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0` 开始训练
上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下 上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
``` ```
usage: train.py [-h] --train_data_path TRAIN_DATA_PATH usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
[--batch_size BATCH_SIZE] [--test_set_size TEST_SET_SIZE] [--test_data_path TEST_DATA_PATH] [--batch_size BATCH_SIZE]
[--num_passes NUM_PASSES] [--num_passes NUM_PASSES]
[--num_lines_to_detact NUM_LINES_TO_DETACT] [--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
DATA_META_FILE --model_type MODEL_TYPE
PaddlePaddle CTR example PaddlePaddle CTR example
...@@ -262,16 +329,78 @@ optional arguments: ...@@ -262,16 +329,78 @@ optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--train_data_path TRAIN_DATA_PATH --train_data_path TRAIN_DATA_PATH
path of training dataset path of training dataset
--test_data_path TEST_DATA_PATH
path of testing dataset
--batch_size BATCH_SIZE --batch_size BATCH_SIZE
size of mini-batch (default:10000) size of mini-batch (default:10000)
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--num_passes NUM_PASSES --num_passes NUM_PASSES
number of passes to train number of passes to train
--num_lines_to_detact NUM_LINES_TO_DETACT --model_output_prefix MODEL_OUTPUT_PREFIX
number of records to detect dataset's meta info prefix of path for model to store (default:
./ctr_models)
--data_meta_file DATA_META_FILE
path of data meta info file
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
- `train_data_path` : 训练集的路径
- `test_data_path` : 测试集的路径
- `num_passes`: 模型训练多少轮
- `data_meta_file`: 参考[数据和任务抽象](### 数据和任务抽象)的描述。
- `model_type`: 模型分类或回归
## 用训好的模型做预测
训好的模型可以用来预测新的数据, 预测数据的格式为
```
# <dnn input ids> \t <lr input sparse values>
1 23 190 \t 230:0.12 3421:0.9 23451:0.12
23 231 \t 1230:0.12 13421:0.9
``` ```
这里与训练数据的格式唯一不同的地方,就是没有标签,也就是训练数据中第3列 `click` 对应的数值。
`infer.py` 的使用方法如下
```
usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
--prediction_output_path PREDICTION_OUTPUT_PATH
[--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--model_gz_path MODEL_GZ_PATH
path of model parameters gz file
--data_path DATA_PATH
path of the dataset to infer
--prediction_output_path PREDICTION_OUTPUT_PATH
path to output the prediction
--data_meta_path DATA_META_PATH
path of trainset's meta info, default is ./data.meta
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
- `model_gz_path_model`:用 `gz` 压缩过的模型路径
- `data_path` : 需要预测的数据路径
- `prediction_output_paht`:预测输出的路径
- `data_meta_file` :参考[数据和任务抽象](### 数据和任务抽象)的描述。
- `model_type` :分类或回归
示例数据可以用如下命令预测
```
python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
```
最终的预测结果位于 `predictions.txt`。
## 参考文献 ## 参考文献
1. <https://en.wikipedia.org/wiki/Click-through_rate> 1. <https://en.wikipedia.org/wiki/Click-through_rate>
2. <https://www.kaggle.com/c/avazu-ctr-prediction/data> 2. <https://www.kaggle.com/c/avazu-ctr-prediction/data>
......
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import gzip
import argparse
import itertools
import paddle.v2 as paddle
import network_conf
from train import dnn_layer_dims
import reader
from utils import logger, ModelType
parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument(
'--model_gz_path',
type=str,
required=True,
help="path of model parameters gz file")
parser.add_argument(
'--data_path', type=str, required=True, help="path of the dataset to infer")
parser.add_argument(
'--prediction_output_path',
type=str,
required=True,
help="path to output the prediction")
parser.add_argument(
'--data_meta_path',
type=str,
default="./data.meta",
help="path of trainset's meta info, default is ./data.meta")
parser.add_argument(
'--model_type',
type=int,
required=True,
default=ModelType.CLASSIFICATION,
help='model type, classification: %d, regression %d (default classification)'
% (ModelType.CLASSIFICATION, ModelType.REGRESSION))
args = parser.parse_args()
paddle.init(use_gpu=False, trainer_count=1)
class CTRInferer(object):
def __init__(self, param_path):
logger.info("create CTR model")
dnn_input_dim, lr_input_dim = reader.load_data_meta(args.data_meta_path)
# create the mdoel
self.ctr_model = network_conf.CTRmodel(
dnn_layer_dims,
dnn_input_dim,
lr_input_dim,
model_type=ModelType(args.model_type),
is_infer=True)
# load parameter
logger.info("load model parameters from %s" % param_path)
self.parameters = paddle.parameters.Parameters.from_tar(
gzip.open(param_path, 'r'))
self.inferer = paddle.inference.Inference(
output_layer=self.ctr_model.model,
parameters=self.parameters, )
def infer(self, data_path):
logger.info("infer data...")
dataset = reader.Dataset()
infer_reader = paddle.batch(
dataset.infer(args.data_path), batch_size=1000)
logger.warning('write predictions to %s' % args.prediction_output_path)
output_f = open(args.prediction_output_path, 'w')
for id, batch in enumerate(infer_reader()):
res = self.inferer.infer(input=batch)
predictions = [x for x in itertools.chain.from_iterable(res)]
assert len(batch) == len(
predictions), "predict error, %d inputs, but %d predictions" % (
len(batch), len(predictions))
output_f.write('\n'.join(map(str, predictions)) + '\n')
if __name__ == '__main__':
ctr_inferer = CTRInferer(args.model_gz_path)
ctr_inferer.infer(args.data_path)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import paddle.v2 as paddle
from paddle.v2 import layer
from paddle.v2 import data_type as dtype
from utils import logger, ModelType
class CTRmodel(object):
'''
A CTR model which implements wide && deep learning model.
'''
def __init__(self,
dnn_layer_dims,
dnn_input_dim,
lr_input_dim,
model_type=ModelType.create_classification(),
is_infer=False):
'''
@dnn_layer_dims: list of integer
dims of each layer in dnn
@dnn_input_dim: int
size of dnn's input layer
@lr_input_dim: int
size of lr's input layer
@is_infer: bool
whether to build a infer model
'''
self.dnn_layer_dims = dnn_layer_dims
self.dnn_input_dim = dnn_input_dim
self.lr_input_dim = lr_input_dim
self.model_type = model_type
self.is_infer = is_infer
self._declare_input_layers()
self.dnn = self._build_dnn_submodel_(self.dnn_layer_dims)
self.lr = self._build_lr_submodel_()
# model's prediction
# TODO(superjom) rename it to prediction
if self.model_type.is_classification():
self.model = self._build_classification_model(self.dnn, self.lr)
if self.model_type.is_regression():
self.model = self._build_regression_model(self.dnn, self.lr)
def _declare_input_layers(self):
self.dnn_merged_input = layer.data(
name='dnn_input',
type=paddle.data_type.sparse_binary_vector(self.dnn_input_dim))
self.lr_merged_input = layer.data(
name='lr_input',
type=paddle.data_type.sparse_vector(self.lr_input_dim))
if not self.is_infer:
self.click = paddle.layer.data(
name='click', type=dtype.dense_vector(1))
def _build_dnn_submodel_(self, dnn_layer_dims):
'''
build DNN submodel.
'''
dnn_embedding = layer.fc(
input=self.dnn_merged_input, size=dnn_layer_dims[0])
_input_layer = dnn_embedding
for i, dim in enumerate(dnn_layer_dims[1:]):
fc = layer.fc(
input=_input_layer,
size=dim,
act=paddle.activation.Relu(),
name='dnn-fc-%d' % i)
_input_layer = fc
return _input_layer
def _build_lr_submodel_(self):
'''
config LR submodel
'''
fc = layer.fc(
input=self.lr_merged_input, size=1, act=paddle.activation.Relu())
return fc
def _build_classification_model(self, dnn, lr):
merge_layer = layer.concat(input=[dnn, lr])
self.output = layer.fc(
input=merge_layer,
size=1,
# use sigmoid function to approximate ctr rate, a float value between 0 and 1.
act=paddle.activation.Sigmoid())
if not self.is_infer:
self.train_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
input=self.output, label=self.click)
return self.output
def _build_regression_model(self, dnn, lr):
merge_layer = layer.concat(input=[dnn, lr])
self.output = layer.fc(
input=merge_layer, size=1, act=paddle.activation.Sigmoid())
if not self.is_infer:
self.train_cost = paddle.layer.mse_cost(
input=self.output, label=self.click)
return self.output
from utils import logger, TaskMode, load_dnn_input_record, load_lr_input_record
feeding_index = {'dnn_input': 0, 'lr_input': 1, 'click': 2}
class Dataset(object):
def __init__(self):
self.mode = TaskMode.create_train()
def train(self, path):
'''
Load trainset.
'''
logger.info("load trainset from %s" % path)
self.mode = TaskMode.create_train()
self.path = path
return self._parse
def test(self, path):
'''
Load testset.
'''
logger.info("load testset from %s" % path)
self.path = path
self.mode = TaskMode.create_test()
return self._parse
def infer(self, path):
'''
Load infer set.
'''
logger.info("load inferset from %s" % path)
self.path = path
self.mode = TaskMode.create_infer()
return self._parse
def _parse(self):
'''
Parse dataset.
'''
with open(self.path) as f:
for line_id, line in enumerate(f):
fs = line.strip().split('\t')
dnn_input = load_dnn_input_record(fs[0])
lr_input = load_lr_input_record(fs[1])
if not self.mode.is_infer():
click = [int(fs[2])]
yield dnn_input, lr_input, click
else:
yield dnn_input, lr_input
def load_data_meta(path):
'''
load data meta info from path, return (dnn_input_dim, lr_input_dim)
'''
with open(path) as f:
lines = f.read().split('\n')
err_info = "wrong meta format"
assert len(lines) == 2, err_info
assert 'dnn_input_dim:' in lines[0] and 'lr_input_dim:' in lines[
1], err_info
res = map(int, [_.split(':')[1] for _ in lines])
logger.info('dnn input dim: %d' % res[0])
logger.info('lr input dim: %d' % res[1])
return res
#!/usr/bin/env python #!/usr/bin/env python
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-import os
import argparse import argparse
import logging import gzip
import reader
import paddle.v2 as paddle import paddle.v2 as paddle
from paddle.v2 import layer from utils import logger, ModelType
from paddle.v2 import data_type as dtype from network_conf import CTRmodel
from data_provider import field_index, detect_dataset, AvazuDataset
parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument( def parse_args():
parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument(
'--train_data_path', '--train_data_path',
type=str, type=str,
required=True, required=True,
help="path of training dataset") help="path of training dataset")
parser.add_argument( parser.add_argument(
'--test_data_path', type=str, help='path of testing dataset')
parser.add_argument(
'--batch_size', '--batch_size',
type=int, type=int,
default=10000, default=10000,
help="size of mini-batch (default:10000)") help="size of mini-batch (default:10000)")
parser.add_argument( parser.add_argument(
'--test_set_size',
type=int,
default=10000,
help="size of the validation dataset(default: 10000)")
parser.add_argument(
'--num_passes', type=int, default=10, help="number of passes to train") '--num_passes', type=int, default=10, help="number of passes to train")
parser.add_argument( parser.add_argument(
'--num_lines_to_detact', '--model_output_prefix',
type=str,
default='./ctr_models',
help='prefix of path for model to store (default: ./ctr_models)')
parser.add_argument(
'--data_meta_file',
type=str,
required=True,
help='path of data meta info file', )
parser.add_argument(
'--model_type',
type=int, type=int,
default=500000, required=True,
help="number of records to detect dataset's meta info") default=ModelType.CLASSIFICATION,
help='model type, classification: %d, regression %d (default classification)'
args = parser.parse_args() % (ModelType.CLASSIFICATION, ModelType.REGRESSION))
dnn_layer_dims = [128, 64, 32, 1]
data_meta_info = detect_dataset(args.train_data_path, args.num_lines_to_detact)
logging.warning('detect categorical fields in dataset %s' %
args.train_data_path)
for key, item in data_meta_info.items():
logging.warning(' - {}\t{}'.format(key, item))
paddle.init(use_gpu=False, trainer_count=1)
# ==============================================================================
# input layers
# ==============================================================================
dnn_merged_input = layer.data(
name='dnn_input',
type=paddle.data_type.sparse_binary_vector(data_meta_info['dnn_input']))
lr_merged_input = layer.data(
name='lr_input',
type=paddle.data_type.sparse_binary_vector(data_meta_info['lr_input']))
click = paddle.layer.data(name='click', type=dtype.dense_vector(1)) return parser.parse_args()
# ============================================================================== dnn_layer_dims = [128, 64, 32, 1]
# network structure
# ==============================================================================
def build_dnn_submodel(dnn_layer_dims):
dnn_embedding = layer.fc(input=dnn_merged_input, size=dnn_layer_dims[0])
_input_layer = dnn_embedding
for i, dim in enumerate(dnn_layer_dims[1:]):
fc = layer.fc(
input=_input_layer,
size=dim,
act=paddle.activation.Relu(),
name='dnn-fc-%d' % i)
_input_layer = fc
return _input_layer
# config LR submodel
def build_lr_submodel():
fc = layer.fc(
input=lr_merged_input, size=1, name='lr', act=paddle.activation.Relu())
return fc
# conbine DNN and LR submodels
def combine_submodels(dnn, lr):
merge_layer = layer.concat(input=[dnn, lr])
fc = layer.fc(
input=merge_layer,
size=1,
name='output',
# use sigmoid function to approximate ctr rate, a float value between 0 and 1.
act=paddle.activation.Sigmoid())
return fc
dnn = build_dnn_submodel(dnn_layer_dims)
lr = build_lr_submodel()
output = combine_submodels(dnn, lr)
# ============================================================================== # ==============================================================================
# cost and train period # cost and train period
# ============================================================================== # ==============================================================================
classification_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
input=output, label=click)
params = paddle.parameters.create(classification_cost)
optimizer = paddle.optimizer.Momentum(momentum=0.01) def train():
args = parse_args()
args.model_type = ModelType(args.model_type)
paddle.init(use_gpu=False, trainer_count=1)
dnn_input_dim, lr_input_dim = reader.load_data_meta(args.data_meta_file)
trainer = paddle.trainer.SGD( # create ctr model.
cost=classification_cost, parameters=params, update_equation=optimizer) model = CTRmodel(
dnn_layer_dims,
dnn_input_dim,
lr_input_dim,
model_type=args.model_type,
is_infer=False)
dataset = AvazuDataset( params = paddle.parameters.create(model.train_cost)
args.train_data_path, n_records_as_test=args.test_set_size) optimizer = paddle.optimizer.AdaGrad()
trainer = paddle.trainer.SGD(
cost=model.train_cost, parameters=params, update_equation=optimizer)
def event_handler(event): dataset = reader.Dataset()
def __event_handler__(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
num_samples = event.batch_id * args.batch_size num_samples = event.batch_id * args.batch_size
if event.batch_id % 100 == 0: if event.batch_id % 100 == 0:
logging.warning("Pass %d, Samples %d, Cost %f" % logger.warning("Pass %d, Samples %d, Cost %f, %s" % (
(event.pass_id, num_samples, event.cost)) event.pass_id, num_samples, event.cost, event.metrics))
if event.batch_id % 1000 == 0: if event.batch_id % 1000 == 0:
if args.test_data_path:
result = trainer.test( result = trainer.test(
reader=paddle.batch(dataset.test, batch_size=args.batch_size),
feeding=field_index)
logging.warning("Test %d-%d, Cost %f" %
(event.pass_id, event.batch_id, result.cost))
trainer.train(
reader=paddle.batch( reader=paddle.batch(
paddle.reader.shuffle(dataset.train, buf_size=500), dataset.test(args.test_data_path),
batch_size=args.batch_size), batch_size=args.batch_size),
feeding=field_index, feeding=reader.feeding_index)
event_handler=event_handler, logger.warning("Test %d-%d, Cost %f, %s" %
(event.pass_id, event.batch_id, result.cost,
result.metrics))
path = "{}-pass-{}-batch-{}-test-{}.tar.gz".format(
args.model_output_prefix, event.pass_id, event.batch_id,
result.cost)
with gzip.open(path, 'w') as f:
params.to_tar(f)
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(
dataset.train(args.train_data_path), buf_size=500),
batch_size=args.batch_size),
feeding=reader.feeding_index,
event_handler=__event_handler__,
num_passes=args.num_passes) num_passes=args.num_passes)
if __name__ == '__main__':
train()
import logging
logging.basicConfig()
logger = logging.getLogger("paddle")
logger.setLevel(logging.INFO)
class TaskMode:
TRAIN_MODE = 0
TEST_MODE = 1
INFER_MODE = 2
def __init__(self, mode):
self.mode = mode
def is_train(self):
return self.mode == self.TRAIN_MODE
def is_test(self):
return self.mode == self.TEST_MODE
def is_infer(self):
return self.mode == self.INFER_MODE
@staticmethod
def create_train():
return TaskMode(TaskMode.TRAIN_MODE)
@staticmethod
def create_test():
return TaskMode(TaskMode.TEST_MODE)
@staticmethod
def create_infer():
return TaskMode(TaskMode.INFER_MODE)
class ModelType:
CLASSIFICATION = 0
REGRESSION = 1
def __init__(self, mode):
self.mode = mode
def is_classification(self):
return self.mode == self.CLASSIFICATION
def is_regression(self):
return self.mode == self.REGRESSION
@staticmethod
def create_classification():
return ModelType(ModelType.CLASSIFICATION)
@staticmethod
def create_regression():
return ModelType(ModelType.REGRESSION)
def load_dnn_input_record(sent):
return map(int, sent.split())
def load_lr_input_record(sent):
res = []
for _ in [x.split(':') for x in sent.split()]:
res.append((int(_[0]), float(_[1]), ))
return res
manifest*
mean_std.npz
thirdparty/
# Deep Speech 2 on PaddlePaddle # DeepSpeech2 on PaddlePaddle
## Installation ## Installation
Please replace `$PADDLE_INSTALL_DIR` with your own paddle installation directory.
``` ```
sh setup.sh sh setup.sh
export LD_LIBRARY_PATH=$PADDLE_INSTALL_DIR/Paddle/third_party/install/warpctc/lib:$LD_LIBRARY_PATH
``` ```
For some machines, we also need to install libsndfile1. Details to be added. Please replace `$PADDLE_INSTALL_DIR` with your own paddle installation directory.
## Usage ## Usage
...@@ -35,15 +32,21 @@ python datasets/librispeech/librispeech.py --help ...@@ -35,15 +32,21 @@ python datasets/librispeech/librispeech.py --help
### Preparing for Training ### Preparing for Training
``` ```
python compute_mean_std.py python tools/compute_mean_std.py
```
It will compute mean and stdandard deviation for audio features, and save them to a file with a default name `./mean_std.npz`. This file will be used in both training and inferencing. The default feature of audio data is power spectrum, and the mfcc feature is also supported. To train and infer based on mfcc feature, please generate this file by
```
python tools/compute_mean_std.py --specgram_type mfcc
``` ```
`python compute_mean_std.py` computes mean and stdandard deviation for audio features, and save them to a file with a default name `./mean_std.npz`. This file will be used in both training and inferencing. and specify ```--specgram_type mfcc``` when running train.py, infer.py, evaluator.py or tune.py.
More help for arguments: More help for arguments:
``` ```
python compute_mean_std.py --help python tools/compute_mean_std.py --help
``` ```
### Training ### Training
...@@ -66,12 +69,31 @@ More help for arguments: ...@@ -66,12 +69,31 @@ More help for arguments:
python train.py --help python train.py --help
``` ```
### Inferencing ### Preparing language model
The following steps, inference, parameters tuning and evaluating, will require a language model during decoding.
A compressed language model is provided and can be accessed by
```
cd ./lm
sh run.sh
cd ..
```
### Inference
For GPU inference
``` ```
CUDA_VISIBLE_DEVICES=0 python infer.py CUDA_VISIBLE_DEVICES=0 python infer.py
``` ```
For CPU inference
```
python infer.py --use_gpu=False
```
More help for arguments: More help for arguments:
``` ```
...@@ -92,14 +114,55 @@ python evaluate.py --help ...@@ -92,14 +114,55 @@ python evaluate.py --help
### Parameters tuning ### Parameters tuning
Parameters tuning for the CTC beam search decoder Usually, the parameters $\alpha$ and $\beta$ for the CTC [prefix beam search](https://arxiv.org/abs/1408.2873) decoder need to be tuned after retraining the acoustic model.
For GPU tuning
``` ```
CUDA_VISIBLE_DEVICES=0 python tune.py CUDA_VISIBLE_DEVICES=0 python tune.py
``` ```
For CPU tuning
```
python tune.py --use_gpu=False
```
More help for arguments: More help for arguments:
``` ```
python tune.py --help python tune.py --help
``` ```
Then reset parameters with the tuning result before inference or evaluating.
### Playing with the ASR Demo
A real-time ASR demo is built for users to try out the ASR model with their own voice. Please do the following installation on the machine you'd like to run the demo's client (no need for the machine running the demo's server).
For example, on MAC OS X:
```
brew install portaudio
pip install pyaudio
pip install pynput
```
After a model and language model is prepared, we can first start the demo's server:
```
CUDA_VISIBLE_DEVICES=0 python demo_server.py
```
And then in another console, start the demo's client:
```
python demo_client.py
```
On the client console, press and hold the "white-space" key on the keyboard to start talking, until you finish your speech and then release the "white-space" key. The decoding results (infered transcription) will be displayed.
It could be possible to start the server and the client in two seperate machines, e.g. `demo_client.py` is usually started in a machine with a microphone hardware, while `demo_server.py` is usually started in a remote server with powerful GPUs. Please first make sure that these two machines have network access to each other, and then use `--host_ip` and `--host_port` to indicate the server machine's actual IP address (instead of the `localhost` as default) and TCP port, in both `demo_server.py` and `demo_client.py`.
## PaddleCloud Training
If you wish to train DeepSpeech2 on PaddleCloud, please refer to
[Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/models/tree/develop/deep_speech_2/cloud).
# Train DeepSpeech2 on PaddleCloud
>Note:
>Please make sure [PaddleCloud Client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `deep_speech_2/cloud/`
## Step 1: Upload Data
Provided with several input manifests, `pcloud_upload_data.sh` will pack and upload all the containing audio files to PaddleCloud filesystem, and also generate some corresponding manifest files with updated cloud paths.
Please modify the following arguments in `pcloud_upload_data.sh`:
- `IN_MANIFESTS`: Paths (in local filesystem) of manifest files containing the audio files to be uploaded. Multiple paths can be concatenated with a whitespace delimeter.
- `OUT_MANIFESTS`: Paths (in local filesystem) to write the updated output manifest files to. Multiple paths can be concatenated with a whitespace delimeter. The values of `audio_filepath` in the output manifests are updated with cloud filesystem paths.
- `CLOUD_DATA_DIR`: Directory (in PaddleCloud filesystem) to upload the data to. Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it.
- `NUM_SHARDS`: Number of data shards / parts (in tar files) to be generated when packing and uploading data. Smaller `num_shards` requires larger temoporal local disk space for packing data.
By running:
```
sh pcloud_upload_data.sh
```
all the audio files will be uploaded to PaddleCloud filesystem, and you will get modified manifests files in `OUT_MANIFESTS`.
You have to take this step only once, in the very first time you do the cloud training. Later on, the data is persisitent on the cloud filesystem and reusable for further job submissions.
## Step 2: Configure Training
Configure cloud training arguments in `pcloud_submit.sh`, with the following arguments:
- `TRAIN_MANIFEST`: Manifest filepath (in local filesystem) for training. Notice that the`audio_filepath` should be in cloud filesystem, like those generated by `pcloud_upload_data.sh`.
- `DEV_MANIFEST`: Manifest filepath (in local filesystem) for validation.
- `CLOUD_MODEL_DIR`: Directory (in PaddleCloud filesystem) to save the model parameters (checkpoints). Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it.
- `BATCH_SIZE`: Training batch size for a single node.
- `NUM_GPU`: Number of GPUs allocated for a single node.
- `NUM_NODE`: Number of nodes (machines) allocated for this job.
- `IS_LOCAL`: Set to False to enable parameter server, if using multiple nodes.
Configure other training hyper-parameters in `pcloud_train.sh` as you wish, just as what you can do in local training.
By running:
```
sh pcloud_submit.sh
```
you submit a training job to PaddleCloud. And you will see the job name when the submission is done.
## Step 3 Get Job Logs
Run this to list all the jobs you have submitted, as well as their running status:
```
paddlecloud get jobs
```
Run this, the corresponding job's logs will be printed.
```
paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME
```
## More Help
For more information about the usage of PaddleCloud, please refer to [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务).
"""Set up paths for DS2"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os.path
import sys
def add_path(path):
if path not in sys.path:
sys.path.insert(0, path)
this_dir = os.path.dirname(__file__)
proj_path = os.path.join(this_dir, '..')
add_path(proj_path)
TRAIN_MANIFEST="cloud/cloud.manifest.train"
DEV_MANIFEST="cloud/cloud.manifest.dev"
CLOUD_MODEL_DIR="/pfs/dlnel/home/USERNAME/deepspeech2/model"
BATCH_SIZE=256
NUM_GPU=8
NUM_NODE=1
IS_LOCAL="True"
JOB_NAME=deepspeech-`date +%Y%m%d%H%M%S`
DS2_PATH=${PWD%/*}
cp -f pcloud_train.sh ${DS2_PATH}
paddlecloud submit \
-image bootstrapper:5000/wanghaoshuang/pcloud_ds2:latest \
-jobname ${JOB_NAME} \
-cpu ${NUM_GPU} \
-gpu ${NUM_GPU} \
-memory 64Gi \
-parallelism ${NUM_NODE} \
-pscpu 1 \
-pservers 1 \
-psmemory 64Gi \
-passes 1 \
-entry "sh pcloud_train.sh ${TRAIN_MANIFEST} ${DEV_MANIFEST} ${CLOUD_MODEL_DIR} ${NUM_GPU} ${BATCH_SIZE} ${IS_LOCAL}" \
${DS2_PATH}
rm ${DS2_PATH}/pcloud_train.sh
TRAIN_MANIFEST=$1
DEV_MANIFEST=$2
MODEL_PATH=$3
NUM_GPU=$4
BATCH_SIZE=$5
IS_LOCAL=$6
python ./cloud/split_data.py \
--in_manifest_path=${TRAIN_MANIFEST} \
--out_manifest_path='/local.manifest.train'
python ./cloud/split_data.py \
--in_manifest_path=${DEV_MANIFEST} \
--out_manifest_path='/local.manifest.dev'
python train.py \
--batch_size=$BATCH_SIZE \
--use_gpu=1 \
--trainer_count=${NUM_GPU} \
--num_threads_data=${NUM_GPU} \
--is_local=${IS_LOCAL} \
--train_manifest_path='/local.manifest.train' \
--dev_manifest_path='/local.manifest.dev' \
--output_model_dir=${MODEL_PATH} \
IN_MANIFESTS="../datasets/manifest.train ../datasets/manifest.dev ../datasets/manifest.test"
OUT_MANIFESTS="./cloud.manifest.train ./cloud.manifest.dev ./cloud.manifest.test"
CLOUD_DATA_DIR="/pfs/dlnel/home/USERNAME/deepspeech2/data/librispeech"
NUM_SHARDS=50
python upload_data.py \
--in_manifest_paths ${IN_MANIFESTS} \
--out_manifest_paths ${OUT_MANIFESTS} \
--cloud_data_dir ${CLOUD_DATA_DIR} \
--num_shards ${NUM_SHARDS}
if [ $? -ne 0 ]
then
echo "Upload Data Failed!"
exit 1
fi
echo "All Done."
"""This tool is used for splitting data into each node of
paddlecloud. This script should be called in paddlecloud.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import json
import argparse
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--in_manifest_path",
type=str,
required=True,
help="Input manifest path for all nodes.")
parser.add_argument(
"--out_manifest_path",
type=str,
required=True,
help="Output manifest file path for current node.")
args = parser.parse_args()
def split_data(in_manifest_path, out_manifest_path):
with open("/trainer_id", "r") as f:
trainer_id = int(f.readline()[:-1])
with open("/trainer_count", "r") as f:
trainer_count = int(f.readline()[:-1])
out_manifest = []
for index, json_line in enumerate(open(in_manifest_path, 'r')):
if (index % trainer_count) == trainer_id:
out_manifest.append("%s\n" % json_line.strip())
with open(out_manifest_path, 'w') as f:
f.writelines(out_manifest)
if __name__ == '__main__':
split_data(args.in_manifest_path, args.out_manifest_path)
"""This script is for uploading data for DeepSpeech2 training on paddlecloud.
Steps:
1. Read original manifests and extract local sound files.
2. Tar all local sound files into multiple tar files and upload them.
3. Modify original manifests with updated paths in cloud filesystem.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import json
import os
import tarfile
import sys
import argparse
import shutil
from subprocess import call
import _init_paths
from data_utils.utils import read_manifest
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--in_manifest_paths",
default=[
"../datasets/manifest.train", "../datasets/manifest.dev",
"../datasets/manifest.test"
],
type=str,
nargs='+',
help="Local filepaths of input manifests to load, pack and upload."
"(default: %(default)s)")
parser.add_argument(
"--out_manifest_paths",
default=[
"./cloud.manifest.train", "./cloud.manifest.dev",
"./cloud.manifest.test"
],
type=str,
nargs='+',
help="Local filepaths of modified manifests to write to. "
"(default: %(default)s)")
parser.add_argument(
"--cloud_data_dir",
required=True,
type=str,
help="Destination directory on paddlecloud to upload data to.")
parser.add_argument(
"--num_shards",
default=10,
type=int,
help="Number of parts to split data to. (default: %(default)s)")
parser.add_argument(
"--local_tmp_dir",
default="./tmp/",
type=str,
help="Local directory for storing temporary data. (default: %(default)s)")
args = parser.parse_args()
def upload_data(in_manifest_path_list, out_manifest_path_list, local_tmp_dir,
upload_tar_dir, num_shards):
"""Extract and pack sound files listed in the manifest files into multple
tar files and upload them to padldecloud. Besides, generate new manifest
files with updated paths in paddlecloud.
"""
# compute total audio number
total_line = 0
for manifest_path in in_manifest_path_list:
with open(manifest_path, 'r') as f:
total_line += len(f.readlines())
line_per_tar = (total_line // num_shards) + 1
# pack and upload shard by shard
line_count, tar_file = 0, None
for manifest_path, out_manifest_path in zip(in_manifest_path_list,
out_manifest_path_list):
manifest = read_manifest(manifest_path)
out_manifest = []
for json_data in manifest:
sound_filepath = json_data['audio_filepath']
sound_filename = os.path.basename(sound_filepath)
if line_count % line_per_tar == 0:
if tar_file != None:
tar_file.close()
pcloud_cp(tar_path, upload_tar_dir)
os.remove(tar_path)
tar_name = 'part-%s-of-%s.tar' % (
str(line_count // line_per_tar).zfill(5),
str(num_shards).zfill(5))
tar_path = os.path.join(local_tmp_dir, tar_name)
tar_file = tarfile.open(tar_path, 'w')
tar_file.add(sound_filepath, arcname=sound_filename)
line_count += 1
json_data['audio_filepath'] = "tar:%s#%s" % (
os.path.join(upload_tar_dir, tar_name), sound_filename)
out_manifest.append("%s\n" % json.dumps(json_data))
with open(out_manifest_path, 'w') as f:
f.writelines(out_manifest)
pcloud_cp(out_manifest_path, upload_tar_dir)
tar_file.close()
pcloud_cp(tar_path, upload_tar_dir)
os.remove(tar_path)
def pcloud_mkdir(dir):
"""Make directory in PaddleCloud filesystem.
"""
if call(['paddlecloud', 'mkdir', dir]) != 0:
raise IOError("PaddleCloud mkdir failed: %s." % dir)
def pcloud_cp(src, dst):
"""Copy src from local filesytem to dst in PaddleCloud filesystem,
or downlowd src from PaddleCloud filesystem to dst in local filesystem.
"""
if call(['paddlecloud', 'cp', src, dst]) != 0:
raise IOError("PaddleCloud cp failed: from [%s] to [%s]." % (src, dst))
if __name__ == '__main__':
if not os.path.exists(args.local_tmp_dir):
os.makedirs(args.local_tmp_dir)
pcloud_mkdir(args.cloud_data_dir)
upload_data(args.in_manifest_paths, args.out_manifest_paths,
args.local_tmp_dir, args.cloud_data_dir, args.num_shards)
shutil.rmtree(args.local_tmp_dir)
[
{
"type": "shift",
"params": {"min_shift_ms": -5,
"max_shift_ms": 5},
"prob": 1.0
}
]
[
{
"type": "noise",
"params": {"min_snr_dB": 40,
"max_snr_dB": 50,
"noise_manifest_path": "datasets/manifest.noise"},
"prob": 0.6
},
{
"type": "impulse",
"params": {"impulse_manifest_path": "datasets/manifest.impulse"},
"prob": 0.5
},
{
"type": "speed",
"params": {"min_speed_rate": 0.95,
"max_speed_rate": 1.05},
"prob": 0.5
},
{
"type": "shift",
"params": {"min_shift_ms": -5,
"max_shift_ms": 5},
"prob": 1.0
},
{
"type": "volume",
"params": {"min_gain_dBFS": -10,
"max_gain_dBFS": 10},
"prob": 0.0
},
{
"type": "bayesian_normal",
"params": {"target_db": -20,
"prior_db": -20,
"prior_samples": 100},
"prob": 0.0
}
]
...@@ -204,7 +204,7 @@ class AudioSegment(object): ...@@ -204,7 +204,7 @@ class AudioSegment(object):
:raise ValueError: If the sample rates of the two segments are not :raise ValueError: If the sample rates of the two segments are not
equal, or if the lengths of segments don't match. equal, or if the lengths of segments don't match.
""" """
if type(self) != type(other): if isinstance(other, type(self)):
raise TypeError("Cannot add segments of different types: %s " raise TypeError("Cannot add segments of different types: %s "
"and %s." % (type(self), type(other))) "and %s." % (type(self), type(other)))
if self._sample_rate != other._sample_rate: if self._sample_rate != other._sample_rate:
...@@ -231,7 +231,7 @@ class AudioSegment(object): ...@@ -231,7 +231,7 @@ class AudioSegment(object):
Note that this is an in-place transformation. Note that this is an in-place transformation.
:param gain: Gain in decibels to apply to samples. :param gain: Gain in decibels to apply to samples.
:type gain: float :type gain: float|1darray
""" """
self._samples *= 10.**(gain / 20.) self._samples *= 10.**(gain / 20.)
...@@ -457,9 +457,9 @@ class AudioSegment(object): ...@@ -457,9 +457,9 @@ class AudioSegment(object):
audio segments when resample is not allowed. audio segments when resample is not allowed.
""" """
if allow_resample and self.sample_rate != impulse_segment.sample_rate: if allow_resample and self.sample_rate != impulse_segment.sample_rate:
impulse_segment = impulse_segment.resample(self.sample_rate) impulse_segment.resample(self.sample_rate)
if self.sample_rate != impulse_segment.sample_rate: if self.sample_rate != impulse_segment.sample_rate:
raise ValueError("Impulse segment's sample rate (%d Hz) is not" raise ValueError("Impulse segment's sample rate (%d Hz) is not "
"equal to base signal sample rate (%d Hz)." % "equal to base signal sample rate (%d Hz)." %
(impulse_segment.sample_rate, self.sample_rate)) (impulse_segment.sample_rate, self.sample_rate))
samples = signal.fftconvolve(self.samples, impulse_segment.samples, samples = signal.fftconvolve(self.samples, impulse_segment.samples,
......
...@@ -8,6 +8,8 @@ import random ...@@ -8,6 +8,8 @@ import random
from data_utils.augmentor.volume_perturb import VolumePerturbAugmentor from data_utils.augmentor.volume_perturb import VolumePerturbAugmentor
from data_utils.augmentor.shift_perturb import ShiftPerturbAugmentor from data_utils.augmentor.shift_perturb import ShiftPerturbAugmentor
from data_utils.augmentor.speed_perturb import SpeedPerturbAugmentor from data_utils.augmentor.speed_perturb import SpeedPerturbAugmentor
from data_utils.augmentor.noise_perturb import NoisePerturbAugmentor
from data_utils.augmentor.impulse_response import ImpulseResponseAugmentor
from data_utils.augmentor.resample import ResampleAugmentor from data_utils.augmentor.resample import ResampleAugmentor
from data_utils.augmentor.online_bayesian_normalization import \ from data_utils.augmentor.online_bayesian_normalization import \
OnlineBayesianNormalizationAugmentor OnlineBayesianNormalizationAugmentor
...@@ -24,20 +26,45 @@ class AugmentationPipeline(object): ...@@ -24,20 +26,45 @@ class AugmentationPipeline(object):
.. code-block:: .. code-block::
'[{"type": "volume", [ {
"params": {"min_gain_dBFS": -15, "type": "noise",
"max_gain_dBFS": 15}, "params": {"min_snr_dB": 10,
"prob": 0.5}, "max_snr_dB": 20,
{"type": "speed", "noise_manifest_path": "datasets/manifest.noise"},
"params": {"min_speed_rate": 0.8, "prob": 0.0
"max_speed_rate": 1.2}, },
"prob": 0.5} {
]' "type": "speed",
"params": {"min_speed_rate": 0.9,
"max_speed_rate": 1.1},
"prob": 1.0
},
{
"type": "shift",
"params": {"min_shift_ms": -5,
"max_shift_ms": 5},
"prob": 1.0
},
{
"type": "volume",
"params": {"min_gain_dBFS": -10,
"max_gain_dBFS": 10},
"prob": 0.0
},
{
"type": "bayesian_normal",
"params": {"target_db": -20,
"prior_db": -20,
"prior_samples": 100},
"prob": 0.0
}
]
This augmentation configuration inserts two augmentation models This augmentation configuration inserts two augmentation models
into the pipeline, with one is VolumePerturbAugmentor and the other into the pipeline, with one is VolumePerturbAugmentor and the other
SpeedPerturbAugmentor. "prob" indicates the probability of the current SpeedPerturbAugmentor. "prob" indicates the probability of the current
augmentor to take effect. augmentor to take effect. If "prob" is zero, the augmentor does not take
effect.
:param augmentation_config: Augmentation configuration in json string. :param augmentation_config: Augmentation configuration in json string.
:type augmentation_config: str :type augmentation_config: str
...@@ -60,7 +87,7 @@ class AugmentationPipeline(object): ...@@ -60,7 +87,7 @@ class AugmentationPipeline(object):
:type audio_segment: AudioSegmenet|SpeechSegment :type audio_segment: AudioSegmenet|SpeechSegment
""" """
for augmentor, rate in zip(self._augmentors, self._rates): for augmentor, rate in zip(self._augmentors, self._rates):
if self._rng.uniform(0., 1.) <= rate: if self._rng.uniform(0., 1.) < rate:
augmentor.transform_audio(audio_segment) augmentor.transform_audio(audio_segment)
def _parse_pipeline_from(self, config_json): def _parse_pipeline_from(self, config_json):
...@@ -89,5 +116,9 @@ class AugmentationPipeline(object): ...@@ -89,5 +116,9 @@ class AugmentationPipeline(object):
return ResampleAugmentor(self._rng, **params) return ResampleAugmentor(self._rng, **params)
elif augmentor_type == "bayesian_normal": elif augmentor_type == "bayesian_normal":
return OnlineBayesianNormalizationAugmentor(self._rng, **params) return OnlineBayesianNormalizationAugmentor(self._rng, **params)
elif augmentor_type == "noise":
return NoisePerturbAugmentor(self._rng, **params)
elif augmentor_type == "impulse":
return ImpulseResponseAugmentor(self._rng, **params)
else: else:
raise ValueError("Unknown augmentor type [%s]." % augmentor_type) raise ValueError("Unknown augmentor type [%s]." % augmentor_type)
"""Contains the impulse response augmentation model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from data_utils.augmentor.base import AugmentorBase
from data_utils import utils
from data_utils.audio import AudioSegment
class ImpulseResponseAugmentor(AugmentorBase):
"""Augmentation model for adding impulse response effect.
:param rng: Random generator object.
:type rng: random.Random
:param impulse_manifest_path: Manifest path for impulse audio data.
:type impulse_manifest_path: basestring
"""
def __init__(self, rng, impulse_manifest_path):
self._rng = rng
self._impulse_manifest = utils.read_manifest(
manifest_path=impulse_manifest_path)
def transform_audio(self, audio_segment):
"""Add impulse response effect.
Note that this is an in-place transformation.
:param audio_segment: Audio segment to add effects to.
:type audio_segment: AudioSegmenet|SpeechSegment
"""
impulse_json = self._rng.sample(self._impulse_manifest, 1)[0]
impulse_segment = AudioSegment.from_file(impulse_json['audio_filepath'])
audio_segment.convolve(impulse_segment, allow_resample=True)
"""Contains the noise perturb augmentation model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from data_utils.augmentor.base import AugmentorBase
from data_utils import utils
from data_utils.audio import AudioSegment
class NoisePerturbAugmentor(AugmentorBase):
"""Augmentation model for adding background noise.
:param rng: Random generator object.
:type rng: random.Random
:param min_snr_dB: Minimal signal noise ratio, in decibels.
:type min_snr_dB: float
:param max_snr_dB: Maximal signal noise ratio, in decibels.
:type max_snr_dB: float
:param noise_manifest_path: Manifest path for noise audio data.
:type noise_manifest_path: basestring
"""
def __init__(self, rng, min_snr_dB, max_snr_dB, noise_manifest_path):
self._min_snr_dB = min_snr_dB
self._max_snr_dB = max_snr_dB
self._rng = rng
self._noise_manifest = utils.read_manifest(
manifest_path=noise_manifest_path)
def transform_audio(self, audio_segment):
"""Add background noise audio.
Note that this is an in-place transformation.
:param audio_segment: Audio segment to add effects to.
:type audio_segment: AudioSegmenet|SpeechSegment
"""
noise_json = self._rng.sample(self._noise_manifest, 1)[0]
if noise_json['duration'] < audio_segment.duration:
raise RuntimeError("The duration of sampled noise audio is smaller "
"than the audio segment to add effects to.")
diff_duration = noise_json['duration'] - audio_segment.duration
start = self._rng.uniform(0, diff_duration)
end = start + audio_segment.duration
noise_segment = AudioSegment.slice_from_file(
noise_json['audio_filepath'], start=start, end=end)
snr_dB = self._rng.uniform(self._min_snr_dB, self._max_snr_dB)
audio_segment.add_noise(
noise_segment, snr_dB, allow_downsampling=True, rng=self._rng)
文件模式从 100755 更改为 100644
...@@ -6,9 +6,11 @@ from __future__ import division ...@@ -6,9 +6,11 @@ from __future__ import division
from __future__ import print_function from __future__ import print_function
import random import random
import numpy as np import tarfile
import multiprocessing import multiprocessing
import numpy as np
import paddle.v2 as paddle import paddle.v2 as paddle
from threading import local
from data_utils import utils from data_utils import utils
from data_utils.augmentor.augmentation import AugmentationPipeline from data_utils.augmentor.augmentation import AugmentationPipeline
from data_utils.featurizer.speech_featurizer import SpeechFeaturizer from data_utils.featurizer.speech_featurizer import SpeechFeaturizer
...@@ -65,7 +67,7 @@ class DataGenerator(object): ...@@ -65,7 +67,7 @@ class DataGenerator(object):
max_freq=None, max_freq=None,
specgram_type='linear', specgram_type='linear',
use_dB_normalization=True, use_dB_normalization=True,
num_threads=multiprocessing.cpu_count(), num_threads=multiprocessing.cpu_count() // 2,
random_seed=0): random_seed=0):
self._max_duration = max_duration self._max_duration = max_duration
self._min_duration = min_duration self._min_duration = min_duration
...@@ -82,6 +84,27 @@ class DataGenerator(object): ...@@ -82,6 +84,27 @@ class DataGenerator(object):
self._num_threads = num_threads self._num_threads = num_threads
self._rng = random.Random(random_seed) self._rng = random.Random(random_seed)
self._epoch = 0 self._epoch = 0
# for caching tar files info
self.local_data = local()
self.local_data.tar2info = {}
self.local_data.tar2object = {}
def process_utterance(self, filename, transcript):
"""Load, augment, featurize and normalize for speech data.
:param filename: Audio filepath
:type filename: basestring | file
:param transcript: Transcription text.
:type transcript: basestring
:return: Tuple of audio feature tensor and list of token ids for
transcription.
:rtype: tuple of (2darray, list)
"""
speech_segment = SpeechSegment.from_file(filename, transcript)
self._augmentation_pipeline.transform_audio(speech_segment)
specgram, text_ids = self._speech_featurizer.featurize(speech_segment)
specgram = self._normalizer.apply(specgram)
return specgram, text_ids
def batch_reader_creator(self, def batch_reader_creator(self,
manifest_path, manifest_path,
...@@ -152,7 +175,7 @@ class DataGenerator(object): ...@@ -152,7 +175,7 @@ class DataGenerator(object):
manifest, batch_size, clipped=True) manifest, batch_size, clipped=True)
elif shuffle_method == "instance_shuffle": elif shuffle_method == "instance_shuffle":
self._rng.shuffle(manifest) self._rng.shuffle(manifest)
elif not shuffle_method: elif shuffle_method == None:
pass pass
else: else:
raise ValueError("Unknown shuffle method %s." % raise ValueError("Unknown shuffle method %s." %
...@@ -198,13 +221,37 @@ class DataGenerator(object): ...@@ -198,13 +221,37 @@ class DataGenerator(object):
""" """
return self._speech_featurizer.vocab_list return self._speech_featurizer.vocab_list
def _process_utterance(self, filename, transcript): def _parse_tar(self, file):
"""Load, augment, featurize and normalize for speech data.""" """Parse a tar file to get a tarfile object
speech_segment = SpeechSegment.from_file(filename, transcript) and a map containing tarinfoes
self._augmentation_pipeline.transform_audio(speech_segment) """
specgram, text_ids = self._speech_featurizer.featurize(speech_segment) result = {}
specgram = self._normalizer.apply(specgram) f = tarfile.open(file)
return specgram, text_ids for tarinfo in f.getmembers():
result[tarinfo.name] = tarinfo
return f, result
def _get_file_object(self, file):
"""Get file object by file path.
If file startwith tar, it will return a tar file object
and cached tar file info for next reading request.
It will return file directly, if the type of file is not str.
"""
if file.startswith('tar:'):
tarpath, filename = file.split(':', 1)[1].split('#', 1)
if 'tar2info' not in self.local_data.__dict__:
self.local_data.tar2info = {}
if 'tar2object' not in self.local_data.__dict__:
self.local_data.tar2object = {}
if tarpath not in self.local_data.tar2info:
object, infoes = self._parse_tar(tarpath)
self.local_data.tar2info[tarpath] = infoes
self.local_data.tar2object[tarpath] = object
return self.local_data.tar2object[tarpath].extractfile(
self.local_data.tar2info[tarpath][filename])
else:
return open(file, 'r')
def _instance_reader_creator(self, manifest): def _instance_reader_creator(self, manifest):
""" """
...@@ -220,7 +267,8 @@ class DataGenerator(object): ...@@ -220,7 +267,8 @@ class DataGenerator(object):
yield instance yield instance
def mapper(instance): def mapper(instance):
return self._process_utterance(instance["audio_filepath"], return self.process_utterance(
self._get_file_object(instance["audio_filepath"]),
instance["text"]) instance["text"])
return paddle.reader.xmap_readers( return paddle.reader.xmap_readers(
......
...@@ -6,13 +6,15 @@ from __future__ import print_function ...@@ -6,13 +6,15 @@ from __future__ import print_function
import numpy as np import numpy as np
from data_utils import utils from data_utils import utils
from data_utils.audio import AudioSegment from data_utils.audio import AudioSegment
from python_speech_features import mfcc
from python_speech_features import delta
class AudioFeaturizer(object): class AudioFeaturizer(object):
"""Audio featurizer, for extracting features from audio contents of """Audio featurizer, for extracting features from audio contents of
AudioSegment or SpeechSegment. AudioSegment or SpeechSegment.
Currently, it only supports feature type of linear spectrogram. Currently, it supports feature types of linear spectrogram and mfcc.
:param specgram_type: Specgram feature type. Options: 'linear'. :param specgram_type: Specgram feature type. Options: 'linear'.
:type specgram_type: str :type specgram_type: str
...@@ -20,9 +22,10 @@ class AudioFeaturizer(object): ...@@ -20,9 +22,10 @@ class AudioFeaturizer(object):
:type stride_ms: float :type stride_ms: float
:param window_ms: Window size (in milliseconds) for generating frames. :param window_ms: Window size (in milliseconds) for generating frames.
:type window_ms: float :type window_ms: float
:param max_freq: Used when specgram_type is 'linear', only FFT bins :param max_freq: When specgram_type is 'linear', only FFT bins
corresponding to frequencies between [0, max_freq] are corresponding to frequencies between [0, max_freq] are
returned. returned; when specgram_type is 'mfcc', max_feq is the
highest band edge of mel filters.
:types max_freq: None|float :types max_freq: None|float
:param target_sample_rate: Audio are resampled (if upsampling or :param target_sample_rate: Audio are resampled (if upsampling or
downsampling is allowed) to this before downsampling is allowed) to this before
...@@ -54,7 +57,7 @@ class AudioFeaturizer(object): ...@@ -54,7 +57,7 @@ class AudioFeaturizer(object):
def featurize(self, def featurize(self,
audio_segment, audio_segment,
allow_downsampling=True, allow_downsampling=True,
allow_upsamplling=True): allow_upsampling=True):
"""Extract audio features from AudioSegment or SpeechSegment. """Extract audio features from AudioSegment or SpeechSegment.
:param audio_segment: Audio/speech segment to extract features from. :param audio_segment: Audio/speech segment to extract features from.
...@@ -91,6 +94,9 @@ class AudioFeaturizer(object): ...@@ -91,6 +94,9 @@ class AudioFeaturizer(object):
return self._compute_linear_specgram( return self._compute_linear_specgram(
samples, sample_rate, self._stride_ms, self._window_ms, samples, sample_rate, self._stride_ms, self._window_ms,
self._max_freq) self._max_freq)
elif self._specgram_type == 'mfcc':
return self._compute_mfcc(samples, sample_rate, self._stride_ms,
self._window_ms, self._max_freq)
else: else:
raise ValueError("Unknown specgram_type %s. " raise ValueError("Unknown specgram_type %s. "
"Supported values: linear." % self._specgram_type) "Supported values: linear." % self._specgram_type)
...@@ -142,3 +148,39 @@ class AudioFeaturizer(object): ...@@ -142,3 +148,39 @@ class AudioFeaturizer(object):
# prepare fft frequency list # prepare fft frequency list
freqs = float(sample_rate) / window_size * np.arange(fft.shape[0]) freqs = float(sample_rate) / window_size * np.arange(fft.shape[0])
return fft, freqs return fft, freqs
def _compute_mfcc(self,
samples,
sample_rate,
stride_ms=10.0,
window_ms=20.0,
max_freq=None):
"""Compute mfcc from samples."""
if max_freq is None:
max_freq = sample_rate / 2
if max_freq > sample_rate / 2:
raise ValueError("max_freq must not be greater than half of "
"sample rate.")
if stride_ms > window_ms:
raise ValueError("Stride size must not be greater than "
"window size.")
# compute the 13 cepstral coefficients, and the first one is replaced
# by log(frame energy)
mfcc_feat = mfcc(
signal=samples,
samplerate=sample_rate,
winlen=0.001 * window_ms,
winstep=0.001 * stride_ms,
highfreq=max_freq)
# Deltas
d_mfcc_feat = delta(mfcc_feat, 2)
# Deltas-Deltas
dd_mfcc_feat = delta(d_mfcc_feat, 2)
# transpose
mfcc_feat = np.transpose(mfcc_feat)
d_mfcc_feat = np.transpose(d_mfcc_feat)
dd_mfcc_feat = np.transpose(dd_mfcc_feat)
# concat above three features
concat_mfcc_feat = np.concatenate(
(mfcc_feat, d_mfcc_feat, dd_mfcc_feat))
return concat_mfcc_feat
...@@ -11,23 +11,24 @@ class SpeechFeaturizer(object): ...@@ -11,23 +11,24 @@ class SpeechFeaturizer(object):
"""Speech featurizer, for extracting features from both audio and transcript """Speech featurizer, for extracting features from both audio and transcript
contents of SpeechSegment. contents of SpeechSegment.
Currently, for audio parts, it only supports feature type of linear Currently, for audio parts, it supports feature types of linear
spectrogram; for transcript parts, it only supports char-level tokenizing spectrogram and mfcc; for transcript parts, it only supports char-level
and conversion into a list of token indices. Note that the token indexing tokenizing and conversion into a list of token indices. Note that the
order follows the given vocabulary file. token indexing order follows the given vocabulary file.
:param vocab_filepath: Filepath to load vocabulary for token indices :param vocab_filepath: Filepath to load vocabulary for token indices
conversion. conversion.
:type specgram_type: basestring :type specgram_type: basestring
:param specgram_type: Specgram feature type. Options: 'linear'. :param specgram_type: Specgram feature type. Options: 'linear', 'mfcc'.
:type specgram_type: str :type specgram_type: str
:param stride_ms: Striding size (in milliseconds) for generating frames. :param stride_ms: Striding size (in milliseconds) for generating frames.
:type stride_ms: float :type stride_ms: float
:param window_ms: Window size (in milliseconds) for generating frames. :param window_ms: Window size (in milliseconds) for generating frames.
:type window_ms: float :type window_ms: float
:param max_freq: Used when specgram_type is 'linear', only FFT bins :param max_freq: When specgram_type is 'linear', only FFT bins
corresponding to frequencies between [0, max_freq] are corresponding to frequencies between [0, max_freq] are
returned. returned; when specgram_type is 'mfcc', max_freq is the
highest band edge of mel filters.
:types max_freq: None|float :types max_freq: None|float
:param target_sample_rate: Speech are resampled (if upsampling or :param target_sample_rate: Speech are resampled (if upsampling or
downsampling is allowed) to this before downsampling is allowed) to this before
......
...@@ -4,6 +4,7 @@ from __future__ import division ...@@ -4,6 +4,7 @@ from __future__ import division
from __future__ import print_function from __future__ import print_function
import os import os
import codecs
class TextFeaturizer(object): class TextFeaturizer(object):
...@@ -59,7 +60,7 @@ class TextFeaturizer(object): ...@@ -59,7 +60,7 @@ class TextFeaturizer(object):
def _load_vocabulary_from_file(self, vocab_filepath): def _load_vocabulary_from_file(self, vocab_filepath):
"""Load vocabulary from file.""" """Load vocabulary from file."""
vocab_lines = [] vocab_lines = []
with open(vocab_filepath, 'r') as file: with codecs.open(vocab_filepath, 'r', 'utf-8') as file:
vocab_lines.extend(file.readlines()) vocab_lines.extend(file.readlines())
vocab_list = [line[:-1] for line in vocab_lines] vocab_list = [line[:-1] for line in vocab_lines]
vocab_dict = dict( vocab_dict = dict(
......
...@@ -115,7 +115,7 @@ class SpeechSegment(AudioSegment): ...@@ -115,7 +115,7 @@ class SpeechSegment(AudioSegment):
speech file. speech file.
:rtype: SpeechSegment :rtype: SpeechSegment
""" """
audio = Audiosegment.slice_from_file(filepath, start, end) audio = AudioSegment.slice_from_file(filepath, start, end)
return cls(audio.samples, audio.sample_rate, transcript) return cls(audio.samples, audio.sample_rate, transcript)
@classmethod @classmethod
......
...@@ -4,6 +4,7 @@ from __future__ import division ...@@ -4,6 +4,7 @@ from __future__ import division
from __future__ import print_function from __future__ import print_function
import json import json
import codecs
def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0): def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
...@@ -23,7 +24,7 @@ def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0): ...@@ -23,7 +24,7 @@ def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
:raises IOError: If failed to parse the manifest. :raises IOError: If failed to parse the manifest.
""" """
manifest = [] manifest = []
for json_line in open(manifest_path): for json_line in codecs.open(manifest_path, 'r', 'utf-8'):
try: try:
json_data = json.loads(json_line) json_data = json.loads(json_line)
except Exception as e: except Exception as e:
......
...@@ -11,11 +11,12 @@ from __future__ import print_function ...@@ -11,11 +11,12 @@ from __future__ import print_function
import distutils.util import distutils.util
import os import os
import wget import sys
import tarfile import tarfile
import argparse import argparse
import soundfile import soundfile
import json import json
import codecs
from paddle.v2.dataset.common import md5file from paddle.v2.dataset.common import md5file
DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
...@@ -66,7 +67,7 @@ def download(url, md5sum, target_dir): ...@@ -66,7 +67,7 @@ def download(url, md5sum, target_dir):
filepath = os.path.join(target_dir, url.split("/")[-1]) filepath = os.path.join(target_dir, url.split("/")[-1])
if not (os.path.exists(filepath) and md5file(filepath) == md5sum): if not (os.path.exists(filepath) and md5file(filepath) == md5sum):
print("Downloading %s ..." % url) print("Downloading %s ..." % url)
wget.download(url, target_dir) os.system("wget -c " + url + " -P " + target_dir)
print("\nMD5 Chesksum %s ..." % filepath) print("\nMD5 Chesksum %s ..." % filepath)
if not md5file(filepath) == md5sum: if not md5file(filepath) == md5sum:
raise RuntimeError("MD5 checksum failed.") raise RuntimeError("MD5 checksum failed.")
...@@ -112,7 +113,7 @@ def create_manifest(data_dir, manifest_path): ...@@ -112,7 +113,7 @@ def create_manifest(data_dir, manifest_path):
'duration': duration, 'duration': duration,
'text': text 'text': text
})) }))
with open(manifest_path, 'w') as out_file: with codecs.open(manifest_path, 'w', 'utf-8') as out_file:
for line in json_lines: for line in json_lines:
out_file.write(line + '\n') out_file.write(line + '\n')
......
"""Prepare CHiME3 background data.
Download, unpack and create manifest files.
Manifest file is a json-format file with each line containing the
meta data (i.e. audio filepath, transcript and audio duration)
of each audio file in the data set.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import distutils.util
import os
import wget
import zipfile
import argparse
import soundfile
import json
from paddle.v2.dataset.common import md5file
DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
URL = "https://d4s.myairbridge.com/packagev2/AG0Y3DNBE5IWRRTV/?dlid=W19XG7T0NNHB027139H0EQ"
MD5 = "c3ff512618d7a67d4f85566ea1bc39ec"
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--target_dir",
default=DATA_HOME + "/chime3_background",
type=str,
help="Directory to save the dataset. (default: %(default)s)")
parser.add_argument(
"--manifest_filepath",
default="manifest.chime3.background",
type=str,
help="Filepath for output manifests. (default: %(default)s)")
args = parser.parse_args()
def download(url, md5sum, target_dir, filename=None):
"""Download file from url to target_dir, and check md5sum."""
if filename == None:
filename = url.split("/")[-1]
if not os.path.exists(target_dir): os.makedirs(target_dir)
filepath = os.path.join(target_dir, filename)
if not (os.path.exists(filepath) and md5file(filepath) == md5sum):
print("Downloading %s ..." % url)
wget.download(url, target_dir)
print("\nMD5 Chesksum %s ..." % filepath)
if not md5file(filepath) == md5sum:
raise RuntimeError("MD5 checksum failed.")
else:
print("File exists, skip downloading. (%s)" % filepath)
return filepath
def unpack(filepath, target_dir):
"""Unpack the file to the target_dir."""
print("Unpacking %s ..." % filepath)
if filepath.endswith('.zip'):
zip = zipfile.ZipFile(filepath, 'r')
zip.extractall(target_dir)
zip.close()
elif filepath.endswith('.tar') or filepath.endswith('.tar.gz'):
tar = zipfile.open(filepath)
tar.extractall(target_dir)
tar.close()
else:
raise ValueError("File format is not supported for unpacking.")
def create_manifest(data_dir, manifest_path):
"""Create a manifest json file summarizing the data set, with each line
containing the meta data (i.e. audio filepath, transcription text, audio
duration) of each audio file within the data set.
"""
print("Creating manifest %s ..." % manifest_path)
json_lines = []
for subfolder, _, filelist in sorted(os.walk(data_dir)):
for filename in filelist:
if filename.endswith('.wav'):
filepath = os.path.join(data_dir, subfolder, filename)
audio_data, samplerate = soundfile.read(filepath)
duration = float(len(audio_data)) / samplerate
json_lines.append(
json.dumps({
'audio_filepath': filepath,
'duration': duration,
'text': ''
}))
with open(manifest_path, 'w') as out_file:
for line in json_lines:
out_file.write(line + '\n')
def prepare_chime3(url, md5sum, target_dir, manifest_path):
"""Download, unpack and create summmary manifest file."""
if not os.path.exists(os.path.join(target_dir, "CHiME3")):
# download
filepath = download(url, md5sum, target_dir,
"myairbridge-AG0Y3DNBE5IWRRTV.zip")
# unpack
unpack(filepath, target_dir)
unpack(
os.path.join(target_dir, 'CHiME3_background_bus.zip'), target_dir)
unpack(
os.path.join(target_dir, 'CHiME3_background_caf.zip'), target_dir)
unpack(
os.path.join(target_dir, 'CHiME3_background_ped.zip'), target_dir)
unpack(
os.path.join(target_dir, 'CHiME3_background_str.zip'), target_dir)
else:
print("Skip downloading and unpacking. Data already exists in %s." %
target_dir)
# create manifest json file
create_manifest(target_dir, manifest_path)
def main():
prepare_chime3(
url=URL,
md5sum=MD5,
target_dir=args.target_dir,
manifest_path=args.manifest_filepath)
if __name__ == '__main__':
main()
cd noise
python chime3_background.py
if [ $? -ne 0 ]; then
echo "Prepare CHiME3 background noise failed. Terminated."
exit 1
fi
cd -
cat noise/manifest.* > manifest.noise
echo "All done."
...@@ -205,9 +205,9 @@ def ctc_beam_search_decoder_batch(probs_split, ...@@ -205,9 +205,9 @@ def ctc_beam_search_decoder_batch(probs_split,
:type num_processes: int :type num_processes: int
:param cutoff_prob: Cutoff probability in pruning, :param cutoff_prob: Cutoff probability in pruning,
default 1.0, no pruning. default 1.0, no pruning.
:type cutoff_prob: float
:param num_processes: Number of parallel processes. :param num_processes: Number of parallel processes.
:type num_processes: int :type num_processes: int
:type cutoff_prob: float
:param ext_scoring_func: External scoring function for :param ext_scoring_func: External scoring function for
partially decoded sentence, e.g. word count partially decoded sentence, e.g. word count
or language model. or language model.
......
"""Client-end for the ASR demo."""
from pynput import keyboard
import struct
import socket
import sys
import argparse
import pyaudio
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--host_ip",
default="localhost",
type=str,
help="Server IP address. (default: %(default)s)")
parser.add_argument(
"--host_port",
default=8086,
type=int,
help="Server Port. (default: %(default)s)")
args = parser.parse_args()
is_recording = False
enable_trigger_record = True
def on_press(key):
"""On-press keyboard callback function."""
global is_recording, enable_trigger_record
if key == keyboard.Key.space:
if (not is_recording) and enable_trigger_record:
sys.stdout.write("Start Recording ... ")
sys.stdout.flush()
is_recording = True
def on_release(key):
"""On-release keyboard callback function."""
global is_recording, enable_trigger_record
if key == keyboard.Key.esc:
return False
elif key == keyboard.Key.space:
if is_recording == True:
is_recording = False
data_list = []
def callback(in_data, frame_count, time_info, status):
"""Audio recorder's stream callback function."""
global data_list, is_recording, enable_trigger_record
if is_recording:
data_list.append(in_data)
enable_trigger_record = False
elif len(data_list) > 0:
# Connect to server and send data
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((args.host_ip, args.host_port))
sent = ''.join(data_list)
sock.sendall(struct.pack('>i', len(sent)) + sent)
print('Speech[length=%d] Sent.' % len(sent))
# Receive data from the server and shut down
received = sock.recv(1024)
print "Recognition Results: {}".format(received)
sock.close()
data_list = []
enable_trigger_record = True
return (in_data, pyaudio.paContinue)
def main():
# prepare audio recorder
p = pyaudio.PyAudio()
stream = p.open(
format=pyaudio.paInt32,
channels=1,
rate=16000,
input=True,
stream_callback=callback)
stream.start_stream()
# prepare keyboard listener
with keyboard.Listener(
on_press=on_press, on_release=on_release) as listener:
listener.join()
# close up
stream.stop_stream()
stream.close()
p.terminate()
if __name__ == "__main__":
main()
"""Server-end for the ASR demo."""
import os
import time
import random
import argparse
import distutils.util
from time import gmtime, strftime
import SocketServer
import struct
import wave
import paddle.v2 as paddle
from utils import print_arguments
from data_utils.data import DataGenerator
from model import DeepSpeech2Model
from data_utils.utils import read_manifest
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--host_ip",
default="localhost",
type=str,
help="Server IP address. (default: %(default)s)")
parser.add_argument(
"--host_port",
default=8086,
type=int,
help="Server Port. (default: %(default)s)")
parser.add_argument(
"--speech_save_dir",
default="demo_cache",
type=str,
help="Directory for saving demo speech. (default: %(default)s)")
parser.add_argument(
"--vocab_filepath",
default='datasets/vocab/eng_vocab.txt',
type=str,
help="Vocabulary filepath. (default: %(default)s)")
parser.add_argument(
"--mean_std_filepath",
default='mean_std.npz',
type=str,
help="Manifest path for normalizer. (default: %(default)s)")
parser.add_argument(
"--warmup_manifest_path",
default='datasets/manifest.test',
type=str,
help="Manifest path for warmup test. (default: %(default)s)")
parser.add_argument(
"--specgram_type",
default='linear',
type=str,
help="Feature type of audio data: 'linear' (power spectrum)"
" or 'mfcc'. (default: %(default)s)")
parser.add_argument(
"--num_conv_layers",
default=2,
type=int,
help="Convolution layer number. (default: %(default)s)")
parser.add_argument(
"--num_rnn_layers",
default=3,
type=int,
help="RNN layer number. (default: %(default)s)")
parser.add_argument(
"--rnn_layer_size",
default=512,
type=int,
help="RNN layer cell number. (default: %(default)s)")
parser.add_argument(
"--use_gpu",
default=True,
type=distutils.util.strtobool,
help="Use gpu or not. (default: %(default)s)")
parser.add_argument(
"--model_filepath",
default='checkpoints/params.latest.tar.gz',
type=str,
help="Model filepath. (default: %(default)s)")
parser.add_argument(
"--decode_method",
default='beam_search',
type=str,
help="Method for ctc decoding: best_path or beam_search. "
"(default: %(default)s)")
parser.add_argument(
"--beam_size",
default=100,
type=int,
help="Width for beam search decoding. (default: %(default)d)")
parser.add_argument(
"--language_model_path",
default="lm/data/common_crawl_00.prune01111.trie.klm",
type=str,
help="Path for language model. (default: %(default)s)")
parser.add_argument(
"--alpha",
default=0.36,
type=float,
help="Parameter associated with language model. (default: %(default)f)")
parser.add_argument(
"--beta",
default=0.25,
type=float,
help="Parameter associated with word count. (default: %(default)f)")
parser.add_argument(
"--cutoff_prob",
default=0.99,
type=float,
help="The cutoff probability of pruning"
"in beam search. (default: %(default)f)")
args = parser.parse_args()
class AsrTCPServer(SocketServer.TCPServer):
"""The ASR TCP Server."""
def __init__(self,
server_address,
RequestHandlerClass,
speech_save_dir,
audio_process_handler,
bind_and_activate=True):
self.speech_save_dir = speech_save_dir
self.audio_process_handler = audio_process_handler
SocketServer.TCPServer.__init__(
self, server_address, RequestHandlerClass, bind_and_activate=True)
class AsrRequestHandler(SocketServer.BaseRequestHandler):
"""The ASR request handler."""
def handle(self):
# receive data through TCP socket
chunk = self.request.recv(1024)
target_len = struct.unpack('>i', chunk[:4])[0]
data = chunk[4:]
while len(data) < target_len:
chunk = self.request.recv(1024)
data += chunk
# write to file
filename = self._write_to_file(data)
print("Received utterance[length=%d] from %s, saved to %s." %
(len(data), self.client_address[0], filename))
start_time = time.time()
transcript = self.server.audio_process_handler(filename)
finish_time = time.time()
print("Response Time: %f, Transcript: %s" %
(finish_time - start_time, transcript))
self.request.sendall(transcript)
def _write_to_file(self, data):
# prepare save dir and filename
if not os.path.exists(self.server.speech_save_dir):
os.mkdir(self.server.speech_save_dir)
timestamp = strftime("%Y%m%d%H%M%S", gmtime())
out_filename = os.path.join(
self.server.speech_save_dir,
timestamp + "_" + self.client_address[0] + ".wav")
# write to wav file
file = wave.open(out_filename, 'wb')
file.setnchannels(1)
file.setsampwidth(4)
file.setframerate(16000)
file.writeframes(data)
file.close()
return out_filename
def warm_up_test(audio_process_handler,
manifest_path,
num_test_cases,
random_seed=0):
"""Warming-up test."""
manifest = read_manifest(manifest_path)
rng = random.Random(random_seed)
samples = rng.sample(manifest, num_test_cases)
for idx, sample in enumerate(samples):
print("Warm-up Test Case %d: %s", idx, sample['audio_filepath'])
start_time = time.time()
transcript = audio_process_handler(sample['audio_filepath'])
finish_time = time.time()
print("Response Time: %f, Transcript: %s" %
(finish_time - start_time, transcript))
def start_server():
"""Start the ASR server"""
# prepare data generator
data_generator = DataGenerator(
vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath,
augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=1)
# prepare ASR model
ds2_model = DeepSpeech2Model(
vocab_size=data_generator.vocab_size,
num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
pretrained_model_path=args.model_filepath)
# prepare ASR inference handler
def file_to_transcript(filename):
feature = data_generator.process_utterance(filename, "")
result_transcript = ds2_model.infer_batch(
infer_data=[feature],
decode_method=args.decode_method,
beam_alpha=args.alpha,
beam_beta=args.beta,
beam_size=args.beam_size,
cutoff_prob=args.cutoff_prob,
vocab_list=data_generator.vocab_list,
language_model_path=args.language_model_path,
num_processes=1)
return result_transcript[0]
# warming up with utterrances sampled from Librispeech
print('-----------------------------------------------------------')
print('Warming up ...')
warm_up_test(
audio_process_handler=file_to_transcript,
manifest_path=args.warmup_manifest_path,
num_test_cases=3)
print('-----------------------------------------------------------')
# start the server
server = AsrTCPServer(
server_address=(args.host_ip, args.host_port),
RequestHandlerClass=AsrRequestHandler,
speech_save_dir=args.speech_save_dir,
audio_process_handler=file_to_transcript)
print("ASR Server Started.")
server.serve_forever()
def main():
print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=1)
start_server()
if __name__ == "__main__":
main()
...@@ -10,43 +10,50 @@ import numpy as np ...@@ -10,43 +10,50 @@ import numpy as np
def _levenshtein_distance(ref, hyp): def _levenshtein_distance(ref, hyp):
"""Levenshtein distance is a string metric for measuring the difference between """Levenshtein distance is a string metric for measuring the difference
two sequences. Informally, the levenshtein disctance is defined as the minimum between two sequences. Informally, the levenshtein disctance is defined as
number of single-character edits (substitutions, insertions or deletions) the minimum number of single-character edits (substitutions, insertions or
required to change one word into the other. We can naturally extend the edits to deletions) required to change one word into the other. We can naturally
word level when calculate levenshtein disctance for two sentences. extend the edits to word level when calculate levenshtein disctance for
two sentences.
""" """
ref_len = len(ref) m = len(ref)
hyp_len = len(hyp) n = len(hyp)
# special case # special case
if ref == hyp: if ref == hyp:
return 0 return 0
if ref_len == 0: if m == 0:
return hyp_len return n
if hyp_len == 0: if n == 0:
return ref_len return m
distance = np.zeros((ref_len + 1, hyp_len + 1), dtype=np.int32) if m < n:
ref, hyp = hyp, ref
m, n = n, m
# use O(min(m, n)) space
distance = np.zeros((2, n + 1), dtype=np.int32)
# initialize distance matrix # initialize distance matrix
for j in xrange(hyp_len + 1): for j in xrange(n + 1):
distance[0][j] = j distance[0][j] = j
for i in xrange(ref_len + 1):
distance[i][0] = i
# calculate levenshtein distance # calculate levenshtein distance
for i in xrange(1, ref_len + 1): for i in xrange(1, m + 1):
for j in xrange(1, hyp_len + 1): prev_row_idx = (i - 1) % 2
cur_row_idx = i % 2
distance[cur_row_idx][0] = i
for j in xrange(1, n + 1):
if ref[i - 1] == hyp[j - 1]: if ref[i - 1] == hyp[j - 1]:
distance[i][j] = distance[i - 1][j - 1] distance[cur_row_idx][j] = distance[prev_row_idx][j - 1]
else: else:
s_num = distance[i - 1][j - 1] + 1 s_num = distance[prev_row_idx][j - 1] + 1
i_num = distance[i][j - 1] + 1 i_num = distance[cur_row_idx][j - 1] + 1
d_num = distance[i - 1][j] + 1 d_num = distance[prev_row_idx][j] + 1
distance[i][j] = min(s_num, i_num, d_num) distance[cur_row_idx][j] = min(s_num, i_num, d_num)
return distance[ref_len][hyp_len] return distance[m % 2][n]
def wer(reference, hypothesis, ignore_case=False, delimiter=' '): def wer(reference, hypothesis, ignore_case=False, delimiter=' '):
...@@ -65,8 +72,8 @@ def wer(reference, hypothesis, ignore_case=False, delimiter=' '): ...@@ -65,8 +72,8 @@ def wer(reference, hypothesis, ignore_case=False, delimiter=' '):
Iw is the number of words inserted, Iw is the number of words inserted,
Nw is the number of words in the reference Nw is the number of words in the reference
We can use levenshtein distance to calculate WER. Please draw an attention that We can use levenshtein distance to calculate WER. Please draw an attention
empty items will be removed when splitting sentences by delimiter. that empty items will be removed when splitting sentences by delimiter.
:param reference: The reference sentence. :param reference: The reference sentence.
:type reference: basestring :type reference: basestring
...@@ -95,7 +102,7 @@ def wer(reference, hypothesis, ignore_case=False, delimiter=' '): ...@@ -95,7 +102,7 @@ def wer(reference, hypothesis, ignore_case=False, delimiter=' '):
return wer return wer
def cer(reference, hypothesis, ignore_case=False): def cer(reference, hypothesis, ignore_case=False, remove_space=False):
"""Calculate charactor error rate (CER). CER compares reference text and """Calculate charactor error rate (CER). CER compares reference text and
hypothesis text in char-level. CER is defined as: hypothesis text in char-level. CER is defined as:
...@@ -113,8 +120,8 @@ def cer(reference, hypothesis, ignore_case=False): ...@@ -113,8 +120,8 @@ def cer(reference, hypothesis, ignore_case=False):
We can use levenshtein distance to calculate CER. Chinese input should be We can use levenshtein distance to calculate CER. Chinese input should be
encoded to unicode. Please draw an attention that the leading and tailing encoded to unicode. Please draw an attention that the leading and tailing
white space characters will be truncated and multiple consecutive white space characters will be truncated and multiple consecutive space
space characters in a sentence will be replaced by one white space character. characters in a sentence will be replaced by one space character.
:param reference: The reference sentence. :param reference: The reference sentence.
:type reference: basestring :type reference: basestring
...@@ -122,6 +129,8 @@ def cer(reference, hypothesis, ignore_case=False): ...@@ -122,6 +129,8 @@ def cer(reference, hypothesis, ignore_case=False):
:type hypothesis: basestring :type hypothesis: basestring
:param ignore_case: Whether case-sensitive or not. :param ignore_case: Whether case-sensitive or not.
:type ignore_case: bool :type ignore_case: bool
:param remove_space: Whether remove internal space characters
:type remove_space: bool
:return: Character error rate. :return: Character error rate.
:rtype: float :rtype: float
:raises ValueError: If the reference length is zero. :raises ValueError: If the reference length is zero.
...@@ -130,8 +139,12 @@ def cer(reference, hypothesis, ignore_case=False): ...@@ -130,8 +139,12 @@ def cer(reference, hypothesis, ignore_case=False):
reference = reference.lower() reference = reference.lower()
hypothesis = hypothesis.lower() hypothesis = hypothesis.lower()
reference = ' '.join(filter(None, reference.split(' '))) join_char = ' '
hypothesis = ' '.join(filter(None, hypothesis.split(' '))) if remove_space == True:
join_char = ''
reference = join_char.join(filter(None, reference.split(' ')))
hypothesis = join_char.join(filter(None, hypothesis.split(' ')))
if len(reference) == 0: if len(reference) == 0:
raise ValueError("Length of reference should be greater than 0.") raise ValueError("Length of reference should be greater than 0.")
......
...@@ -5,20 +5,24 @@ from __future__ import print_function ...@@ -5,20 +5,24 @@ from __future__ import print_function
import distutils.util import distutils.util
import argparse import argparse
import gzip import multiprocessing
import paddle.v2 as paddle import paddle.v2 as paddle
from data_utils.data import DataGenerator from data_utils.data import DataGenerator
from model import deep_speech2 from model import DeepSpeech2Model
from decoder import * from error_rate import wer, cer
from lm.lm_scorer import LmScorer import utils
from error_rate import wer
parser = argparse.ArgumentParser(description=__doc__) parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument( parser.add_argument(
"--batch_size", "--batch_size",
default=100, default=128,
type=int, type=int,
help="Minibatch size for evaluation. (default: %(default)s)") help="Minibatch size for evaluation. (default: %(default)s)")
parser.add_argument(
"--trainer_count",
default=8,
type=int,
help="Trainer number. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_conv_layers", "--num_conv_layers",
default=2, default=2,
...@@ -41,12 +45,12 @@ parser.add_argument( ...@@ -41,12 +45,12 @@ parser.add_argument(
help="Use gpu or not. (default: %(default)s)") help="Use gpu or not. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_threads_data", "--num_threads_data",
default=multiprocessing.cpu_count(), default=multiprocessing.cpu_count() // 2,
type=int, type=int,
help="Number of cpu threads for preprocessing data. (default: %(default)s)") help="Number of cpu threads for preprocessing data. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_processes_beam_search", "--num_processes_beam_search",
default=multiprocessing.cpu_count(), default=multiprocessing.cpu_count() // 2,
type=int, type=int,
help="Number of cpu processes for beam search. (default: %(default)s)") help="Number of cpu processes for beam search. (default: %(default)s)")
parser.add_argument( parser.add_argument(
...@@ -58,21 +62,21 @@ parser.add_argument( ...@@ -58,21 +62,21 @@ parser.add_argument(
"--decode_method", "--decode_method",
default='beam_search', default='beam_search',
type=str, type=str,
help="Method for ctc decoding, best_path or beam_search. (default: %(default)s)" help="Method for ctc decoding, best_path or beam_search. "
) "(default: %(default)s)")
parser.add_argument( parser.add_argument(
"--language_model_path", "--language_model_path",
default="lm/data/1Billion.klm", default="lm/data/common_crawl_00.prune01111.trie.klm",
type=str, type=str,
help="Path for language model. (default: %(default)s)") help="Path for language model. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--alpha", "--alpha",
default=0.26, default=0.36,
type=float, type=float,
help="Parameter associated with language model. (default: %(default)f)") help="Parameter associated with language model. (default: %(default)f)")
parser.add_argument( parser.add_argument(
"--beta", "--beta",
default=0.1, default=0.25,
type=float, type=float,
help="Parameter associated with word count. (default: %(default)f)") help="Parameter associated with word count. (default: %(default)f)")
parser.add_argument( parser.add_argument(
...@@ -86,6 +90,12 @@ parser.add_argument( ...@@ -86,6 +90,12 @@ parser.add_argument(
default=500, default=500,
type=int, type=int,
help="Width for beam search decoding. (default: %(default)d)") help="Width for beam search decoding. (default: %(default)d)")
parser.add_argument(
"--specgram_type",
default='linear',
type=str,
help="Feature type of audio data: 'linear' (power spectrum)"
" or 'mfcc'. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--decode_manifest_path", "--decode_manifest_path",
default='datasets/manifest.test', default='datasets/manifest.test',
...@@ -101,102 +111,68 @@ parser.add_argument( ...@@ -101,102 +111,68 @@ parser.add_argument(
default='datasets/vocab/eng_vocab.txt', default='datasets/vocab/eng_vocab.txt',
type=str, type=str,
help="Vocabulary filepath. (default: %(default)s)") help="Vocabulary filepath. (default: %(default)s)")
parser.add_argument(
"--error_rate_type",
default='wer',
choices=['wer', 'cer'],
type=str,
help="Error rate type for evaluation. 'wer' for word error rate and 'cer' "
"for character error rate. "
"(default: %(default)s)")
args = parser.parse_args() args = parser.parse_args()
def evaluate(): def evaluate():
"""Evaluate on whole test data for DeepSpeech2.""" """Evaluate on whole test data for DeepSpeech2."""
# initialize data generator
data_generator = DataGenerator( data_generator = DataGenerator(
vocab_filepath=args.vocab_filepath, vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath, mean_std_filepath=args.mean_std_filepath,
augmentation_config='{}', augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=args.num_threads_data) num_threads=args.num_threads_data)
# create network config
# paddle.data_type.dense_array is used for variable batch input.
# The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data(
name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
text_data = paddle.layer.data(
name="transcript_text",
type=paddle.data_type.integer_value_sequence(data_generator.vocab_size))
output_probs = deep_speech2(
audio_data=audio_data,
text_data=text_data,
dict_size=data_generator.vocab_size,
num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
rnn_size=args.rnn_layer_size,
is_inference=True)
# load parameters
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(args.model_filepath))
# prepare infer data
batch_reader = data_generator.batch_reader_creator( batch_reader = data_generator.batch_reader_creator(
manifest_path=args.decode_manifest_path, manifest_path=args.decode_manifest_path,
batch_size=args.batch_size, batch_size=args.batch_size,
min_batch_size=1,
sortagrad=False, sortagrad=False,
shuffle_method=None) shuffle_method=None)
# define inferer ds2_model = DeepSpeech2Model(
inferer = paddle.inference.Inference( vocab_size=data_generator.vocab_size,
output_layer=output_probs, parameters=parameters) num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
# initialize external scorer for beam search decoding rnn_layer_size=args.rnn_layer_size,
if args.decode_method == 'beam_search': pretrained_model_path=args.model_filepath)
ext_scorer = LmScorer(args.alpha, args.beta, args.language_model_path)
wer_counter, wer_sum = 0, 0.0 error_rate_func = cer if args.error_rate_type == 'cer' else wer
error_sum, num_ins = 0.0, 0
for infer_data in batch_reader(): for infer_data in batch_reader():
# run inference result_transcripts = ds2_model.infer_batch(
infer_results = inferer.infer(input=infer_data) infer_data=infer_data,
num_steps = len(infer_results) // len(infer_data) decode_method=args.decode_method,
probs_split = [ beam_alpha=args.alpha,
infer_results[i * num_steps:(i + 1) * num_steps] beam_beta=args.beta,
for i in xrange(0, len(infer_data))
]
# target transcription
target_transcription = [
''.join([
data_generator.vocab_list[index] for index in infer_data[i][1]
]) for i, probs in enumerate(probs_split)
]
# decode and print
# best path decode
if args.decode_method == "best_path":
for i, probs in enumerate(probs_split):
output_transcription = ctc_best_path_decoder(
probs_seq=probs, vocabulary=data_generator.vocab_list)
wer_sum += wer(target_transcription[i], output_transcription)
wer_counter += 1
# beam search decode
elif args.decode_method == "beam_search":
# beam search using multiple processes
beam_search_results = ctc_beam_search_decoder_batch(
probs_split=probs_split,
vocabulary=data_generator.vocab_list,
beam_size=args.beam_size, beam_size=args.beam_size,
blank_id=len(data_generator.vocab_list), cutoff_prob=args.cutoff_prob,
num_processes=args.num_processes_beam_search, vocab_list=data_generator.vocab_list,
ext_scoring_func=ext_scorer, language_model_path=args.language_model_path,
cutoff_prob=args.cutoff_prob, ) num_processes=args.num_processes_beam_search)
for i, beam_search_result in enumerate(beam_search_results): target_transcripts = [
wer_sum += wer(target_transcription[i], ''.join([data_generator.vocab_list[token] for token in transcript])
beam_search_result[0][1]) for _, transcript in infer_data
wer_counter += 1 ]
else: for target, result in zip(target_transcripts, result_transcripts):
raise ValueError("Decoding method [%s] is not supported." % error_sum += error_rate_func(target, result)
decode_method) num_ins += 1
print("Error rate [%s] (%d/?) = %f" %
print("Final WER = %f" % (wer_sum / wer_counter)) (args.error_rate_type, num_ins, error_sum / num_ins))
print("Final error rate [%s] (%d/%d) = %f" %
(args.error_rate_type, num_ins, num_ins, error_sum / num_ins))
def main(): def main():
paddle.init(use_gpu=args.use_gpu, trainer_count=1) utils.print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
evaluate() evaluate()
......
...@@ -4,15 +4,12 @@ from __future__ import division ...@@ -4,15 +4,12 @@ from __future__ import division
from __future__ import print_function from __future__ import print_function
import argparse import argparse
import gzip
import distutils.util import distutils.util
import multiprocessing import multiprocessing
import paddle.v2 as paddle import paddle.v2 as paddle
from data_utils.data import DataGenerator from data_utils.data import DataGenerator
from model import deep_speech2 from model import DeepSpeech2Model
from decoder import * from error_rate import wer, cer
from lm.lm_scorer import LmScorer
from error_rate import wer
import utils import utils
parser = argparse.ArgumentParser(description=__doc__) parser = argparse.ArgumentParser(description=__doc__)
...@@ -43,14 +40,25 @@ parser.add_argument( ...@@ -43,14 +40,25 @@ parser.add_argument(
help="Use gpu or not. (default: %(default)s)") help="Use gpu or not. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_threads_data", "--num_threads_data",
default=multiprocessing.cpu_count(), default=1,
type=int, type=int,
help="Number of cpu threads for preprocessing data. (default: %(default)s)") help="Number of cpu threads for preprocessing data. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_processes_beam_search", "--num_processes_beam_search",
default=multiprocessing.cpu_count(), default=multiprocessing.cpu_count() // 2,
type=int, type=int,
help="Number of cpu processes for beam search. (default: %(default)s)") help="Number of cpu processes for beam search. (default: %(default)s)")
parser.add_argument(
"--specgram_type",
default='linear',
type=str,
help="Feature type of audio data: 'linear' (power spectrum)"
" or 'mfcc'. (default: %(default)s)")
parser.add_argument(
"--trainer_count",
default=8,
type=int,
help="Trainer number. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--mean_std_filepath", "--mean_std_filepath",
default='mean_std.npz', default='mean_std.npz',
...@@ -75,31 +83,26 @@ parser.add_argument( ...@@ -75,31 +83,26 @@ parser.add_argument(
"--decode_method", "--decode_method",
default='beam_search', default='beam_search',
type=str, type=str,
help="Method for ctc decoding: best_path or beam_search. (default: %(default)s)" help="Method for ctc decoding: best_path or beam_search. "
) "(default: %(default)s)")
parser.add_argument( parser.add_argument(
"--beam_size", "--beam_size",
default=500, default=500,
type=int, type=int,
help="Width for beam search decoding. (default: %(default)d)") help="Width for beam search decoding. (default: %(default)d)")
parser.add_argument(
"--num_results_per_sample",
default=1,
type=int,
help="Number of output per sample in beam search. (default: %(default)d)")
parser.add_argument( parser.add_argument(
"--language_model_path", "--language_model_path",
default="lm/data/1Billion.klm", default="lm/data/common_crawl_00.prune01111.trie.klm",
type=str, type=str,
help="Path for language model. (default: %(default)s)") help="Path for language model. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--alpha", "--alpha",
default=0.26, default=0.36,
type=float, type=float,
help="Parameter associated with language model. (default: %(default)f)") help="Parameter associated with language model. (default: %(default)f)")
parser.add_argument( parser.add_argument(
"--beta", "--beta",
default=0.1, default=0.25,
type=float, type=float,
help="Parameter associated with word count. (default: %(default)f)") help="Parameter associated with word count. (default: %(default)f)")
parser.add_argument( parser.add_argument(
...@@ -108,41 +111,25 @@ parser.add_argument( ...@@ -108,41 +111,25 @@ parser.add_argument(
type=float, type=float,
help="The cutoff probability of pruning" help="The cutoff probability of pruning"
"in beam search. (default: %(default)f)") "in beam search. (default: %(default)f)")
parser.add_argument(
"--error_rate_type",
default='wer',
choices=['wer', 'cer'],
type=str,
help="Error rate type for evaluation. 'wer' for word error rate and 'cer' "
"for character error rate. "
"(default: %(default)s)")
args = parser.parse_args() args = parser.parse_args()
def infer(): def infer():
"""Inference for DeepSpeech2.""" """Inference for DeepSpeech2."""
# initialize data generator
data_generator = DataGenerator( data_generator = DataGenerator(
vocab_filepath=args.vocab_filepath, vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath, mean_std_filepath=args.mean_std_filepath,
augmentation_config='{}', augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=args.num_threads_data) num_threads=args.num_threads_data)
# create network config
# paddle.data_type.dense_array is used for variable batch input.
# The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data(
name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
text_data = paddle.layer.data(
name="transcript_text",
type=paddle.data_type.integer_value_sequence(data_generator.vocab_size))
output_probs = deep_speech2(
audio_data=audio_data,
text_data=text_data,
dict_size=data_generator.vocab_size,
num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
rnn_size=args.rnn_layer_size,
is_inference=True)
# load parameters
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(args.model_filepath))
# prepare infer data
batch_reader = data_generator.batch_reader_creator( batch_reader = data_generator.batch_reader_creator(
manifest_path=args.decode_manifest_path, manifest_path=args.decode_manifest_path,
batch_size=args.num_samples, batch_size=args.num_samples,
...@@ -151,66 +138,38 @@ def infer(): ...@@ -151,66 +138,38 @@ def infer():
shuffle_method=None) shuffle_method=None)
infer_data = batch_reader().next() infer_data = batch_reader().next()
# run inference ds2_model = DeepSpeech2Model(
infer_results = paddle.infer( vocab_size=data_generator.vocab_size,
output_layer=output_probs, parameters=parameters, input=infer_data) num_conv_layers=args.num_conv_layers,
num_steps = len(infer_results) // len(infer_data) num_rnn_layers=args.num_rnn_layers,
probs_split = [ rnn_layer_size=args.rnn_layer_size,
infer_results[i * num_steps:(i + 1) * num_steps] pretrained_model_path=args.model_filepath)
for i in xrange(len(infer_data)) result_transcripts = ds2_model.infer_batch(
] infer_data=infer_data,
decode_method=args.decode_method,
beam_alpha=args.alpha,
beam_beta=args.beta,
beam_size=args.beam_size,
cutoff_prob=args.cutoff_prob,
vocab_list=data_generator.vocab_list,
language_model_path=args.language_model_path,
num_processes=args.num_processes_beam_search)
# targe transcription error_rate_func = cer if args.error_rate_type == 'cer' else wer
target_transcription = [ target_transcripts = [
''.join( ''.join([data_generator.vocab_list[token] for token in transcript])
[data_generator.vocab_list[index] for index in infer_data[i][1]]) for _, transcript in infer_data
for i, probs in enumerate(probs_split)
] ]
for target, result in zip(target_transcripts, result_transcripts):
## decode and print
# best path decode
wer_sum, wer_counter = 0, 0
if args.decode_method == "best_path":
for i, probs in enumerate(probs_split):
best_path_transcription = ctc_best_path_decoder(
probs_seq=probs, vocabulary=data_generator.vocab_list)
print("\nTarget Transcription: %s\nOutput Transcription: %s" % print("\nTarget Transcription: %s\nOutput Transcription: %s" %
(target_transcription[i], best_path_transcription)) (target, result))
wer_cur = wer(target_transcription[i], best_path_transcription) print("Current error rate [%s] = %f" %
wer_sum += wer_cur (args.error_rate_type, error_rate_func(target, result)))
wer_counter += 1
print("cur wer = %f, average wer = %f" %
(wer_cur, wer_sum / wer_counter))
# beam search decode
elif args.decode_method == "beam_search":
ext_scorer = LmScorer(args.alpha, args.beta, args.language_model_path)
beam_search_batch_results = ctc_beam_search_decoder_batch(
probs_split=probs_split,
vocabulary=data_generator.vocab_list,
beam_size=args.beam_size,
blank_id=len(data_generator.vocab_list),
num_processes=args.num_processes_beam_search,
cutoff_prob=args.cutoff_prob,
ext_scoring_func=ext_scorer, )
for i, beam_search_result in enumerate(beam_search_batch_results):
print("\nTarget Transcription:\t%s" % target_transcription[i])
for index in xrange(args.num_results_per_sample):
result = beam_search_result[index]
#output: index, log prob, beam result
print("Beam %d: %f \t%s" % (index, result[0], result[1]))
wer_cur = wer(target_transcription[i], beam_search_result[0][1])
wer_sum += wer_cur
wer_counter += 1
print("cur wer = %f , average wer = %f" %
(wer_cur, wer_sum / wer_counter))
else:
raise ValueError("Decoding method [%s] is not supported." %
decode_method)
def main(): def main():
utils.print_arguments(args) utils.print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=1) paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
infer() infer()
......
"""Contains DeepSpeech2 layers."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import paddle.v2 as paddle
def conv_bn_layer(input, filter_size, num_channels_in, num_channels_out, stride,
padding, act):
"""Convolution layer with batch normalization.
:param input: Input layer.
:type input: LayerOutput
:param filter_size: The x dimension of a filter kernel. Or input a tuple for
two image dimension.
:type filter_size: int|tuple|list
:param num_channels_in: Number of input channels.
:type num_channels_in: int
:type num_channels_out: Number of output channels.
:type num_channels_in: out
:param padding: The x dimension of the padding. Or input a tuple for two
image dimension.
:type padding: int|tuple|list
:param act: Activation type.
:type act: BaseActivation
:return: Batch norm layer after convolution layer.
:rtype: LayerOutput
"""
conv_layer = paddle.layer.img_conv(
input=input,
filter_size=filter_size,
num_channels=num_channels_in,
num_filters=num_channels_out,
stride=stride,
padding=padding,
act=paddle.activation.Linear(),
bias_attr=False)
return paddle.layer.batch_norm(input=conv_layer, act=act)
def bidirectional_simple_rnn_bn_layer(name, input, size, act):
"""Bidirectonal simple rnn layer with sequence-wise batch normalization.
The batch normalization is only performed on input-state weights.
:param name: Name of the layer.
:type name: string
:param input: Input layer.
:type input: LayerOutput
:param size: Number of RNN cells.
:type size: int
:param act: Activation type.
:type act: BaseActivation
:return: Bidirectional simple rnn layer.
:rtype: LayerOutput
"""
# input-hidden weights shared across bi-direcitonal rnn.
input_proj = paddle.layer.fc(
input=input, size=size, act=paddle.activation.Linear(), bias_attr=False)
# batch norm is only performed on input-state projection
input_proj_bn = paddle.layer.batch_norm(
input=input_proj, act=paddle.activation.Linear())
# forward and backward in time
forward_simple_rnn = paddle.layer.recurrent(
input=input_proj_bn, act=act, reverse=False)
backward_simple_rnn = paddle.layer.recurrent(
input=input_proj_bn, act=act, reverse=True)
return paddle.layer.concat(input=[forward_simple_rnn, backward_simple_rnn])
def conv_group(input, num_stacks):
"""Convolution group with stacked convolution layers.
:param input: Input layer.
:type input: LayerOutput
:param num_stacks: Number of stacked convolution layers.
:type num_stacks: int
:return: Output layer of the convolution group.
:rtype: LayerOutput
"""
conv = conv_bn_layer(
input=input,
filter_size=(11, 41),
num_channels_in=1,
num_channels_out=32,
stride=(3, 2),
padding=(5, 20),
act=paddle.activation.BRelu())
for i in xrange(num_stacks - 1):
conv = conv_bn_layer(
input=conv,
filter_size=(11, 21),
num_channels_in=32,
num_channels_out=32,
stride=(1, 2),
padding=(5, 10),
act=paddle.activation.BRelu())
output_num_channels = 32
output_height = 160 // pow(2, num_stacks) + 1
return conv, output_num_channels, output_height
def rnn_group(input, size, num_stacks):
"""RNN group with stacked bidirectional simple RNN layers.
:param input: Input layer.
:type input: LayerOutput
:param size: Number of RNN cells in each layer.
:type size: int
:param num_stacks: Number of stacked rnn layers.
:type num_stacks: int
:return: Output layer of the RNN group.
:rtype: LayerOutput
"""
output = input
for i in xrange(num_stacks):
output = bidirectional_simple_rnn_bn_layer(
name=str(i), input=output, size=size, act=paddle.activation.BRelu())
return output
def deep_speech2(audio_data,
text_data,
dict_size,
num_conv_layers=2,
num_rnn_layers=3,
rnn_size=256):
"""
The whole DeepSpeech2 model structure (a simplified version).
:param audio_data: Audio spectrogram data layer.
:type audio_data: LayerOutput
:param text_data: Transcription text data layer.
:type text_data: LayerOutput
:param dict_size: Dictionary size for tokenized transcription.
:type dict_size: int
:param num_conv_layers: Number of stacking convolution layers.
:type num_conv_layers: int
:param num_rnn_layers: Number of stacking RNN layers.
:type num_rnn_layers: int
:param rnn_size: RNN layer size (number of RNN cells).
:type rnn_size: int
:return: A tuple of an output unnormalized log probability layer (
before softmax) and a ctc cost layer.
:rtype: tuple of LayerOutput
"""
# convolution group
conv_group_output, conv_group_num_channels, conv_group_height = conv_group(
input=audio_data, num_stacks=num_conv_layers)
# convert data form convolution feature map to sequence of vectors
conv2seq = paddle.layer.block_expand(
input=conv_group_output,
num_channels=conv_group_num_channels,
stride_x=1,
stride_y=1,
block_x=1,
block_y=conv_group_height)
# rnn group
rnn_group_output = rnn_group(
input=conv2seq, size=rnn_size, num_stacks=num_rnn_layers)
fc = paddle.layer.fc(
input=rnn_group_output,
size=dict_size + 1,
act=paddle.activation.Linear(),
bias_attr=True)
# probability distribution with softmax
log_probs = paddle.layer.mixed(
input=paddle.layer.identity_projection(input=fc),
act=paddle.activation.Softmax())
# ctc cost
ctc_loss = paddle.layer.warp_ctc(
input=fc,
label=text_data,
size=dict_size + 1,
blank=dict_size,
norm_by_times=True)
return log_probs, ctc_loss
echo "Downloading language model." echo "Downloading language model ..."
mkdir data
LM=common_crawl_00.prune01111.trie.klm
MD5="099a601759d467cd0a8523ff939819c5"
wget -c http://paddlepaddle.bj.bcebos.com/model_zoo/speech/$LM -P ./data
echo "Checking md5sum ..."
md5_tmp=`md5sum ./data/$LM | awk -F[' '] '{print $1}'`
if [ $MD5 != $md5_tmp ]; then
echo "Fail to download the language model!"
exit 1
fi
wget -c ftp://xxx/xxx/en.00.UNKNOWN.klm -P ./data
...@@ -3,141 +3,244 @@ from __future__ import absolute_import ...@@ -3,141 +3,244 @@ from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
import sys
import os
import time
import gzip
from decoder import *
from lm.lm_scorer import LmScorer
import paddle.v2 as paddle import paddle.v2 as paddle
from layer import *
def conv_bn_layer(input, filter_size, num_channels_in, num_channels_out, stride, class DeepSpeech2Model(object):
padding, act): """DeepSpeech2Model class.
"""
Convolution layer with batch normalization. :param vocab_size: Decoding vocabulary size.
""" :type vocab_size: int
conv_layer = paddle.layer.img_conv(
input=input,
filter_size=filter_size,
num_channels=num_channels_in,
num_filters=num_channels_out,
stride=stride,
padding=padding,
act=paddle.activation.Linear(),
bias_attr=False)
return paddle.layer.batch_norm(input=conv_layer, act=act)
def bidirectional_simple_rnn_bn_layer(name, input, size, act):
"""
Bidirectonal simple rnn layer with sequence-wise batch normalization.
The batch normalization is only performed on input-state weights.
"""
# input-hidden weights shared across bi-direcitonal rnn.
input_proj = paddle.layer.fc(
input=input, size=size, act=paddle.activation.Linear(), bias_attr=False)
# batch norm is only performed on input-state projection
input_proj_bn = paddle.layer.batch_norm(
input=input_proj, act=paddle.activation.Linear())
# forward and backward in time
forward_simple_rnn = paddle.layer.recurrent(
input=input_proj_bn, act=act, reverse=False)
backward_simple_rnn = paddle.layer.recurrent(
input=input_proj_bn, act=act, reverse=True)
return paddle.layer.concat(input=[forward_simple_rnn, backward_simple_rnn])
def conv_group(input, num_stacks):
"""
Convolution group with several stacking convolution layers.
"""
conv = conv_bn_layer(
input=input,
filter_size=(11, 41),
num_channels_in=1,
num_channels_out=32,
stride=(3, 2),
padding=(5, 20),
act=paddle.activation.BRelu())
for i in xrange(num_stacks - 1):
conv = conv_bn_layer(
input=conv,
filter_size=(11, 21),
num_channels_in=32,
num_channels_out=32,
stride=(1, 2),
padding=(5, 10),
act=paddle.activation.BRelu())
output_num_channels = 32
output_height = 160 // pow(2, num_stacks) + 1
return conv, output_num_channels, output_height
def rnn_group(input, size, num_stacks):
"""
RNN group with several stacking RNN layers.
"""
output = input
for i in xrange(num_stacks):
output = bidirectional_simple_rnn_bn_layer(
name=str(i), input=output, size=size, act=paddle.activation.BRelu())
return output
def deep_speech2(audio_data,
text_data,
dict_size,
num_conv_layers=2,
num_rnn_layers=3,
rnn_size=256,
is_inference=False):
"""
The whole DeepSpeech2 model structure (a simplified version).
:param audio_data: Audio spectrogram data layer.
:type audio_data: LayerOutput
:param text_data: Transcription text data layer.
:type text_data: LayerOutput
:param dict_size: Dictionary size for tokenized transcription.
:type dict_size: int
:param num_conv_layers: Number of stacking convolution layers. :param num_conv_layers: Number of stacking convolution layers.
:type num_conv_layers: int :type num_conv_layers: int
:param num_rnn_layers: Number of stacking RNN layers. :param num_rnn_layers: Number of stacking RNN layers.
:type num_rnn_layers: int :type num_rnn_layers: int
:param rnn_size: RNN layer size (number of RNN cells). :param rnn_layer_size: RNN layer size (number of RNN cells).
:type rnn_size: int :type rnn_layer_size: int
:param is_inference: False in the training mode, and True in the :param pretrained_model_path: Pretrained model path. If None, will train
inferene mode. from stratch.
:type is_inference: bool :type pretrained_model_path: basestring|None
:return: If is_inference set False, return a ctc cost layer; """
if is_inference set True, return a sequence layer of output
probability distribution. def __init__(self, vocab_size, num_conv_layers, num_rnn_layers,
:rtype: tuple of LayerOutput rnn_layer_size, pretrained_model_path):
self._create_network(vocab_size, num_conv_layers, num_rnn_layers,
rnn_layer_size)
self._create_parameters(pretrained_model_path)
self._inferer = None
self._loss_inferer = None
self._ext_scorer = None
def train(self,
train_batch_reader,
dev_batch_reader,
feeding_dict,
learning_rate,
gradient_clipping,
num_passes,
output_model_dir,
is_local=True,
num_iterations_print=100):
"""Train the model.
:param train_batch_reader: Train data reader.
:type train_batch_reader: callable
:param dev_batch_reader: Validation data reader.
:type dev_batch_reader: callable
:param feeding_dict: Feeding is a map of field name and tuple index
of the data that reader returns.
:type feeding_dict: dict|list
:param learning_rate: Learning rate for ADAM optimizer.
:type learning_rate: float
:param gradient_clipping: Gradient clipping threshold.
:type gradient_clipping: float
:param num_passes: Number of training epochs.
:type num_passes: int
:param num_iterations_print: Number of training iterations for printing
a training loss.
:type rnn_iteratons_print: int
:param is_local: Set to False if running with pserver with multi-nodes.
:type is_local: bool
:param output_model_dir: Directory for saving the model (every pass).
:type output_model_dir: basestring
"""
# prepare model output directory
if not os.path.exists(output_model_dir):
os.mkdir(output_model_dir)
# prepare optimizer and trainer
optimizer = paddle.optimizer.Adam(
learning_rate=learning_rate,
gradient_clipping_threshold=gradient_clipping)
trainer = paddle.trainer.SGD(
cost=self._loss,
parameters=self._parameters,
update_equation=optimizer,
is_local=is_local)
# create event handler
def event_handler(event):
global start_time, cost_sum, cost_counter
if isinstance(event, paddle.event.EndIteration):
cost_sum += event.cost
cost_counter += 1
if (event.batch_id + 1) % num_iterations_print == 0:
output_model_path = os.path.join(output_model_dir,
"params.latest.tar.gz")
with gzip.open(output_model_path, 'w') as f:
self._parameters.to_tar(f)
print("\nPass: %d, Batch: %d, TrainCost: %f" %
(event.pass_id, event.batch_id + 1,
cost_sum / cost_counter))
cost_sum, cost_counter = 0.0, 0
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.BeginPass):
start_time = time.time()
cost_sum, cost_counter = 0.0, 0
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
reader=dev_batch_reader, feeding=feeding_dict)
output_model_path = os.path.join(
output_model_dir, "params.pass-%d.tar.gz" % event.pass_id)
with gzip.open(output_model_path, 'w') as f:
self._parameters.to_tar(f)
print("\n------- Time: %d sec, Pass: %d, ValidationCost: %s" %
(time.time() - start_time, event.pass_id, result.cost))
# run train
trainer.train(
reader=train_batch_reader,
event_handler=event_handler,
num_passes=num_passes,
feeding=feeding_dict)
def infer_loss_batch(self, infer_data):
"""Model inference. Infer the ctc loss for a batch of speech
utterances.
:param infer_data: List of utterances to infer, with each utterance a
tuple of audio features and transcription text (empty
string).
:type infer_data: list
:return: List of ctc loss.
:rtype: List of float
""" """
# convolution group # define inferer
conv_group_output, conv_group_num_channels, conv_group_height = conv_group( if self._loss_inferer == None:
input=audio_data, num_stacks=num_conv_layers) self._loss_inferer = paddle.inference.Inference(
# convert data form convolution feature map to sequence of vectors output_layer=self._loss, parameters=self._parameters)
conv2seq = paddle.layer.block_expand( # run inference
input=conv_group_output, return self._loss_inferer.infer(input=infer_data)
num_channels=conv_group_num_channels,
stride_x=1, def infer_batch(self, infer_data, decode_method, beam_alpha, beam_beta,
stride_y=1, beam_size, cutoff_prob, vocab_list, language_model_path,
block_x=1, num_processes):
block_y=conv_group_height) """Model inference. Infer the transcription for a batch of speech
# rnn group utterances.
rnn_group_output = rnn_group(
input=conv2seq, size=rnn_size, num_stacks=num_rnn_layers) :param infer_data: List of utterances to infer, with each utterance
fc = paddle.layer.fc( consisting of a tuple of audio features and
input=rnn_group_output, transcription text (empty string).
size=dict_size + 1, :type infer_data: list
act=paddle.activation.Linear(), :param decode_method: Decoding method name, 'best_path' or
bias_attr=True) 'beam search'.
if is_inference: :param decode_method: string
# probability distribution with softmax :param beam_alpha: Parameter associated with language model.
return paddle.layer.mixed( :type beam_alpha: float
input=paddle.layer.identity_projection(input=fc), :param beam_beta: Parameter associated with word count.
act=paddle.activation.Softmax()) :type beam_beta: float
:param beam_size: Width for Beam search.
:type beam_size: int
:param cutoff_prob: Cutoff probability in pruning,
default 1.0, no pruning.
:type cutoff_prob: float
:param vocab_list: List of tokens in the vocabulary, for decoding.
:type vocab_list: list
:param language_model_path: Filepath for language model.
:type language_model_path: basestring|None
:param num_processes: Number of processes (CPU) for decoder.
:type num_processes: int
:return: List of transcription texts.
:rtype: List of basestring
"""
# define inferer
if self._inferer == None:
self._inferer = paddle.inference.Inference(
output_layer=self._log_probs, parameters=self._parameters)
# run inference
infer_results = self._inferer.infer(input=infer_data)
num_steps = len(infer_results) // len(infer_data)
probs_split = [
infer_results[i * num_steps:(i + 1) * num_steps]
for i in xrange(0, len(infer_data))
]
# run decoder
results = []
if decode_method == "best_path":
# best path decode
for i, probs in enumerate(probs_split):
output_transcription = ctc_best_path_decoder(
probs_seq=probs, vocabulary=vocab_list)
results.append(output_transcription)
elif decode_method == "beam_search":
# initialize external scorer
if self._ext_scorer == None:
self._ext_scorer = LmScorer(beam_alpha, beam_beta,
language_model_path)
self._loaded_lm_path = language_model_path
else:
self._ext_scorer.reset_params(beam_alpha, beam_beta)
assert self._loaded_lm_path == language_model_path
# beam search decode
beam_search_results = ctc_beam_search_decoder_batch(
probs_split=probs_split,
vocabulary=vocab_list,
beam_size=beam_size,
blank_id=len(vocab_list),
num_processes=num_processes,
ext_scoring_func=self._ext_scorer,
cutoff_prob=cutoff_prob)
results = [result[0][1] for result in beam_search_results]
else: else:
# ctc cost raise ValueError("Decoding method [%s] is not supported." %
return paddle.layer.warp_ctc( decode_method)
input=fc, return results
label=text_data,
size=dict_size + 1, def _create_parameters(self, model_path=None):
blank=dict_size, """Load or create model parameters."""
norm_by_times=True) if model_path is None:
self._parameters = paddle.parameters.create(self._loss)
else:
self._parameters = paddle.parameters.Parameters.from_tar(
gzip.open(model_path))
def _create_network(self, vocab_size, num_conv_layers, num_rnn_layers,
rnn_layer_size):
"""Create data layers and model network."""
# paddle.data_type.dense_array is used for variable batch input.
# The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data(
name="audio_spectrogram",
type=paddle.data_type.dense_array(161 * 161))
text_data = paddle.layer.data(
name="transcript_text",
type=paddle.data_type.integer_value_sequence(vocab_size))
self._log_probs, self._loss = deep_speech2(
audio_data=audio_data,
text_data=text_data,
dict_size=vocab_size,
num_conv_layers=num_conv_layers,
num_rnn_layers=num_rnn_layers,
rnn_size=rnn_layer_size)
wget==3.2
scipy==0.13.1 scipy==0.13.1
resampy==0.1.5 resampy==0.1.5
https://github.com/kpu/kenlm/archive/master.zip SoundFile==0.9.0.post1
python_speech_features
https://github.com/luotao1/kenlm/archive/master.zip
...@@ -9,25 +9,21 @@ if [ $? != 0 ]; then ...@@ -9,25 +9,21 @@ if [ $? != 0 ]; then
exit 1 exit 1
fi fi
# install package Soundfile # install package libsndfile
curl -O "http://www.mega-nerd.com/libsndfile/files/libsndfile-1.0.28.tar.gz" python -c "import soundfile"
if [ $? != 0 ]; then if [ $? != 0 ]; then
echo "Install package libsndfile into default system path."
wget "http://www.mega-nerd.com/libsndfile/files/libsndfile-1.0.28.tar.gz"
if [ $? != 0 ]; then
echo "Download libsndfile-1.0.28.tar.gz failed !!!" echo "Download libsndfile-1.0.28.tar.gz failed !!!"
exit 1 exit 1
fi
tar -zxvf libsndfile-1.0.28.tar.gz
cd libsndfile-1.0.28
./configure && make && make install
cd ..
rm -rf libsndfile-1.0.28
rm libsndfile-1.0.28.tar.gz
fi fi
tar -zxvf libsndfile-1.0.28.tar.gz
cd libsndfile-1.0.28
./configure && make && make install
cd -
rm -rf libsndfile-1.0.28
rm libsndfile-1.0.28.tar.gz
pip install SoundFile==0.9.0.post1
if [ $? != 0 ]; then
echo "Install SoundFile failed !!!"
exit 1
fi
# prepare ./checkpoints
mkdir checkpoints
echo "Install all dependencies successfully." echo "Install all dependencies successfully."
...@@ -11,16 +11,54 @@ import error_rate ...@@ -11,16 +11,54 @@ import error_rate
class TestParse(unittest.TestCase): class TestParse(unittest.TestCase):
def test_wer_1(self): def test_wer_1(self):
ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night' ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night'
hyp = 'i GOT IT TO the FULLEST i LOVE TO portable FROM OF STORES last night' hyp = 'i GOT IT TO the FULLEST i LOVE TO portable FROM OF STORES last '\
'night'
word_error_rate = error_rate.wer(ref, hyp) word_error_rate = error_rate.wer(ref, hyp)
self.assertTrue(abs(word_error_rate - 0.769230769231) < 1e-6) self.assertTrue(abs(word_error_rate - 0.769230769231) < 1e-6)
def test_wer_2(self): def test_wer_2(self):
ref = 'as any in england i would say said gamewell proudly that is '\
'in his day'
hyp = 'as any in england i would say said came well proudly that is '\
'in his day'
word_error_rate = error_rate.wer(ref, hyp)
self.assertTrue(abs(word_error_rate - 0.1333333) < 1e-6)
def test_wer_3(self):
ref = 'the lieutenant governor lilburn w boggs afterward governor '\
'was a pronounced mormon hater and throughout the period of '\
'the troubles he manifested sympathy with the persecutors'
hyp = 'the lieutenant governor little bit how bags afterward '\
'governor was a pronounced warman hater and throughout the '\
'period of th troubles he manifests sympathy with the '\
'persecutors'
word_error_rate = error_rate.wer(ref, hyp)
self.assertTrue(abs(word_error_rate - 0.2692307692) < 1e-6)
def test_wer_4(self):
ref = 'the wood flamed up splendidly under the large brewing copper '\
'and it sighed so deeply'
hyp = 'the wood flame do splendidly under the large brewing copper '\
'and its side so deeply'
word_error_rate = error_rate.wer(ref, hyp)
self.assertTrue(abs(word_error_rate - 0.2666666667) < 1e-6)
def test_wer_5(self):
ref = 'all the morning they trudged up the mountain path and at noon '\
'unc and ojo sat on a fallen tree trunk and ate the last of '\
'the bread which the old munchkin had placed in his pocket'
hyp = 'all the morning they trudged up the mountain path and at noon '\
'unc in ojo sat on a fallen tree trunk and ate the last of '\
'the bread which the old munchkin had placed in his pocket'
word_error_rate = error_rate.wer(ref, hyp)
self.assertTrue(abs(word_error_rate - 0.027027027) < 1e-6)
def test_wer_6(self):
ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night' ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night'
word_error_rate = error_rate.wer(ref, ref) word_error_rate = error_rate.wer(ref, ref)
self.assertEqual(word_error_rate, 0.0) self.assertEqual(word_error_rate, 0.0)
def test_wer_3(self): def test_wer_7(self):
ref = ' ' ref = ' '
hyp = 'Hypothesis sentence' hyp = 'Hypothesis sentence'
with self.assertRaises(ValueError): with self.assertRaises(ValueError):
...@@ -33,22 +71,40 @@ class TestParse(unittest.TestCase): ...@@ -33,22 +71,40 @@ class TestParse(unittest.TestCase):
self.assertTrue(abs(char_error_rate - 0.25) < 1e-6) self.assertTrue(abs(char_error_rate - 0.25) < 1e-6)
def test_cer_2(self): def test_cer_2(self):
ref = 'werewolf'
hyp = 'weae wolf'
char_error_rate = error_rate.cer(ref, hyp, remove_space=True)
self.assertTrue(abs(char_error_rate - 0.125) < 1e-6)
def test_cer_3(self):
ref = 'were wolf'
hyp = 'were wolf'
char_error_rate = error_rate.cer(ref, hyp)
self.assertTrue(abs(char_error_rate - 0.0) < 1e-6)
def test_cer_4(self):
ref = 'werewolf' ref = 'werewolf'
char_error_rate = error_rate.cer(ref, ref) char_error_rate = error_rate.cer(ref, ref)
self.assertEqual(char_error_rate, 0.0) self.assertEqual(char_error_rate, 0.0)
def test_cer_3(self): def test_cer_5(self):
ref = u'我是中国人' ref = u'我是中国人'
hyp = u'我是 美洲人' hyp = u'我是 美洲人'
char_error_rate = error_rate.cer(ref, hyp) char_error_rate = error_rate.cer(ref, hyp)
self.assertTrue(abs(char_error_rate - 0.6) < 1e-6) self.assertTrue(abs(char_error_rate - 0.6) < 1e-6)
def test_cer_4(self): def test_cer_6(self):
ref = u'我 是 中 国 人'
hyp = u'我 是 美 洲 人'
char_error_rate = error_rate.cer(ref, hyp, remove_space=True)
self.assertTrue(abs(char_error_rate - 0.4) < 1e-6)
def test_cer_7(self):
ref = u'我是中国人' ref = u'我是中国人'
char_error_rate = error_rate.cer(ref, ref) char_error_rate = error_rate.cer(ref, ref)
self.assertFalse(char_error_rate, 0.0) self.assertFalse(char_error_rate, 0.0)
def test_cer_5(self): def test_cer_8(self):
ref = '' ref = ''
hyp = 'Hypothesis' hyp = 'Hypothesis'
with self.assertRaises(ValueError): with self.assertRaises(ValueError):
......
"""Test Setup."""
import unittest
import numpy as np
import os
class TestSetup(unittest.TestCase):
def test_soundfile(self):
import soundfile as sf
# floating point data is typically limited to the interval [-1.0, 1.0],
# but smaller/larger values are supported as well
data = np.array([[1.75, -1.75], [1.0, -1.0], [0.5, -0.5],
[0.25, -0.25]])
file = 'test.wav'
sf.write(file, data, 44100, format='WAV', subtype='FLOAT')
read, fs = sf.read(file)
self.assertTrue(np.all(read == data))
self.assertEqual(fs, 44100)
os.remove(file)
if __name__ == '__main__':
unittest.main()
"""Set up paths for DS2"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os.path
import sys
def add_path(path):
if path not in sys.path:
sys.path.insert(0, path)
this_dir = os.path.dirname(__file__)
# Add project path to PYTHONPATH
proj_path = os.path.join(this_dir, '..')
add_path(proj_path)
"""Build vocabulary from manifest files.
Each item in vocabulary file is a character.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import codecs
import json
from collections import Counter
import os.path
import _init_paths
from data_utils import utils
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--manifest_paths",
type=str,
help="Manifest paths for building vocabulary."
"You can provide multiple manifest files.",
nargs='+',
required=True)
parser.add_argument(
"--count_threshold",
default=0,
type=int,
help="Characters whose counts are below the threshold will be truncated. "
"(default: %(default)i)")
parser.add_argument(
"--vocab_path",
default='datasets/vocab/zh_vocab.txt',
type=str,
help="File path to write the vocabulary. (default: %(default)s)")
args = parser.parse_args()
def count_manifest(counter, manifest_path):
manifest_jsons = utils.read_manifest(manifest_path)
for line_json in manifest_jsons:
for char in line_json['text']:
counter.update(char)
def main():
counter = Counter()
for manifest_path in args.manifest_paths:
count_manifest(counter, manifest_path)
count_sorted = sorted(counter.items(), key=lambda x: x[1], reverse=True)
with codecs.open(args.vocab_path, 'w', 'utf-8') as fout:
for char, count in count_sorted:
if count < args.count_threshold: break
fout.write(char + '\n')
if __name__ == '__main__':
main()
...@@ -4,12 +4,19 @@ from __future__ import division ...@@ -4,12 +4,19 @@ from __future__ import division
from __future__ import print_function from __future__ import print_function
import argparse import argparse
import _init_paths
from data_utils.normalizer import FeatureNormalizer from data_utils.normalizer import FeatureNormalizer
from data_utils.augmentor.augmentation import AugmentationPipeline from data_utils.augmentor.augmentation import AugmentationPipeline
from data_utils.featurizer.audio_featurizer import AudioFeaturizer from data_utils.featurizer.audio_featurizer import AudioFeaturizer
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description='Computing mean and stddev for feature normalizer.') description='Computing mean and stddev for feature normalizer.')
parser.add_argument(
"--specgram_type",
default='linear',
type=str,
help="Feature type of audio data: 'linear' (power spectrum)"
" or 'mfcc'. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--manifest_path", "--manifest_path",
default='datasets/manifest.train', default='datasets/manifest.train',
...@@ -39,7 +46,7 @@ args = parser.parse_args() ...@@ -39,7 +46,7 @@ args = parser.parse_args()
def main(): def main():
augmentation_pipeline = AugmentationPipeline(args.augmentation_config) augmentation_pipeline = AugmentationPipeline(args.augmentation_config)
audio_featurizer = AudioFeaturizer() audio_featurizer = AudioFeaturizer(specgram_type=args.specgram_type)
def augment_and_featurize(audio_segment): def augment_and_featurize(audio_segment):
augmentation_pipeline.transform_audio(audio_segment) augmentation_pipeline.transform_audio(audio_segment)
......
...@@ -3,15 +3,11 @@ from __future__ import absolute_import ...@@ -3,15 +3,11 @@ from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
import sys
import os
import argparse import argparse
import gzip
import time
import distutils.util import distutils.util
import multiprocessing import multiprocessing
import paddle.v2 as paddle import paddle.v2 as paddle
from model import deep_speech2 from model import DeepSpeech2Model
from data_utils.data import DataGenerator from data_utils.data import DataGenerator
import utils import utils
...@@ -23,6 +19,12 @@ parser.add_argument( ...@@ -23,6 +19,12 @@ parser.add_argument(
default=200, default=200,
type=int, type=int,
help="Training pass number. (default: %(default)s)") help="Training pass number. (default: %(default)s)")
parser.add_argument(
"--num_iterations_print",
default=100,
type=int,
help="Number of iterations for every train cost printing. "
"(default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_conv_layers", "--num_conv_layers",
default=2, default=2,
...@@ -53,6 +55,12 @@ parser.add_argument( ...@@ -53,6 +55,12 @@ parser.add_argument(
default=True, default=True,
type=distutils.util.strtobool, type=distutils.util.strtobool,
help="Use sortagrad or not. (default: %(default)s)") help="Use sortagrad or not. (default: %(default)s)")
parser.add_argument(
"--specgram_type",
default='linear',
type=str,
help="Feature type of audio data: 'linear' (power spectrum)"
" or 'mfcc'. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--max_duration", "--max_duration",
default=27.0, default=27.0,
...@@ -78,7 +86,7 @@ parser.add_argument( ...@@ -78,7 +86,7 @@ parser.add_argument(
help="Trainer number. (default: %(default)s)") help="Trainer number. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_threads_data", "--num_threads_data",
default=multiprocessing.cpu_count(), default=multiprocessing.cpu_count() // 2,
type=int, type=int,
help="Number of cpu threads for preprocessing data. (default: %(default)s)") help="Number of cpu threads for preprocessing data. (default: %(default)s)")
parser.add_argument( parser.add_argument(
...@@ -108,112 +116,71 @@ parser.add_argument( ...@@ -108,112 +116,71 @@ parser.add_argument(
help="If set None, the training will start from scratch. " help="If set None, the training will start from scratch. "
"Otherwise, the training will resume from " "Otherwise, the training will resume from "
"the existing model of this path. (default: %(default)s)") "the existing model of this path. (default: %(default)s)")
parser.add_argument(
"--output_model_dir",
default="./checkpoints",
type=str,
help="Directory for saving models. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--augmentation_config", "--augmentation_config",
default='[{"type": "shift", ' default=open('conf/augmentation.config', 'r').read(),
'"params": {"min_shift_ms": -5, "max_shift_ms": 5},'
'"prob": 1.0}]',
type=str, type=str,
help="Augmentation configuration in json-format. " help="Augmentation configuration in json-format. "
"(default: %(default)s)") "(default: %(default)s)")
parser.add_argument(
"--is_local",
default=True,
type=distutils.util.strtobool,
help="Set to false if running with pserver in paddlecloud. "
"(default: %(default)s)")
args = parser.parse_args() args = parser.parse_args()
def train(): def train():
"""DeepSpeech2 training.""" """DeepSpeech2 training."""
train_generator = DataGenerator(
# initialize data generator
def data_generator():
return DataGenerator(
vocab_filepath=args.vocab_filepath, vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath, mean_std_filepath=args.mean_std_filepath,
augmentation_config=args.augmentation_config, augmentation_config=args.augmentation_config,
max_duration=args.max_duration, max_duration=args.max_duration,
min_duration=args.min_duration, min_duration=args.min_duration,
specgram_type=args.specgram_type,
num_threads=args.num_threads_data)
dev_generator = DataGenerator(
vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath,
augmentation_config="{}",
specgram_type=args.specgram_type,
num_threads=args.num_threads_data) num_threads=args.num_threads_data)
train_generator = data_generator()
test_generator = data_generator()
# create network config
# paddle.data_type.dense_array is used for variable batch input.
# The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data(
name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
text_data = paddle.layer.data(
name="transcript_text",
type=paddle.data_type.integer_value_sequence(
train_generator.vocab_size))
cost = deep_speech2(
audio_data=audio_data,
text_data=text_data,
dict_size=train_generator.vocab_size,
num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
rnn_size=args.rnn_layer_size,
is_inference=False)
# create/load parameters and optimizer
if args.init_model_path is None:
parameters = paddle.parameters.create(cost)
else:
if not os.path.isfile(args.init_model_path):
raise IOError("Invalid model!")
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(args.init_model_path))
optimizer = paddle.optimizer.Adam(
learning_rate=args.adam_learning_rate, gradient_clipping_threshold=400)
trainer = paddle.trainer.SGD(
cost=cost, parameters=parameters, update_equation=optimizer)
# prepare data reader
train_batch_reader = train_generator.batch_reader_creator( train_batch_reader = train_generator.batch_reader_creator(
manifest_path=args.train_manifest_path, manifest_path=args.train_manifest_path,
batch_size=args.batch_size, batch_size=args.batch_size,
min_batch_size=args.trainer_count, min_batch_size=args.trainer_count,
sortagrad=args.use_sortagrad if args.init_model_path is None else False, sortagrad=args.use_sortagrad if args.init_model_path is None else False,
shuffle_method=args.shuffle_method) shuffle_method=args.shuffle_method)
test_batch_reader = test_generator.batch_reader_creator( dev_batch_reader = dev_generator.batch_reader_creator(
manifest_path=args.dev_manifest_path, manifest_path=args.dev_manifest_path,
batch_size=args.batch_size, batch_size=args.batch_size,
min_batch_size=1, # must be 1, but will have errors. min_batch_size=1, # must be 1, but will have errors.
sortagrad=False, sortagrad=False,
shuffle_method=None) shuffle_method=None)
# create event handler ds2_model = DeepSpeech2Model(
def event_handler(event): vocab_size=train_generator.vocab_size,
global start_time, cost_sum, cost_counter num_conv_layers=args.num_conv_layers,
if isinstance(event, paddle.event.EndIteration): num_rnn_layers=args.num_rnn_layers,
cost_sum += event.cost rnn_layer_size=args.rnn_layer_size,
cost_counter += 1 pretrained_model_path=args.init_model_path)
if (event.batch_id + 1) % 100 == 0: ds2_model.train(
print("\nPass: %d, Batch: %d, TrainCost: %f" % ( train_batch_reader=train_batch_reader,
event.pass_id, event.batch_id + 1, cost_sum / cost_counter)) dev_batch_reader=dev_batch_reader,
cost_sum, cost_counter = 0.0, 0 feeding_dict=train_generator.feeding,
with gzip.open("checkpoints/params.latest.tar.gz", 'w') as f: learning_rate=args.adam_learning_rate,
parameters.to_tar(f) gradient_clipping=400,
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.BeginPass):
start_time = time.time()
cost_sum, cost_counter = 0.0, 0
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
reader=test_batch_reader, feeding=test_generator.feeding)
print("\n------- Time: %d sec, Pass: %d, ValidationCost: %s" %
(time.time() - start_time, event.pass_id, result.cost))
with gzip.open("checkpoints/params.pass-%d.tar.gz" % event.pass_id,
'w') as f:
parameters.to_tar(f)
# run train
trainer.train(
reader=train_batch_reader,
event_handler=event_handler,
num_passes=args.num_passes, num_passes=args.num_passes,
feeding=train_generator.feeding) num_iterations_print=args.num_iterations_print,
output_model_dir=args.output_model_dir,
is_local=args.is_local)
def main(): def main():
......
...@@ -3,14 +3,13 @@ from __future__ import absolute_import ...@@ -3,14 +3,13 @@ from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
import numpy as np
import distutils.util import distutils.util
import argparse import argparse
import gzip import multiprocessing
import paddle.v2 as paddle import paddle.v2 as paddle
from data_utils.data import DataGenerator from data_utils.data import DataGenerator
from model import deep_speech2 from model import DeepSpeech2Model
from decoder import *
from lm.lm_scorer import LmScorer
from error_rate import wer from error_rate import wer
import utils import utils
...@@ -40,26 +39,37 @@ parser.add_argument( ...@@ -40,26 +39,37 @@ parser.add_argument(
default=True, default=True,
type=distutils.util.strtobool, type=distutils.util.strtobool,
help="Use gpu or not. (default: %(default)s)") help="Use gpu or not. (default: %(default)s)")
parser.add_argument(
"--trainer_count",
default=8,
type=int,
help="Trainer number. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_threads_data", "--num_threads_data",
default=multiprocessing.cpu_count(), default=1,
type=int, type=int,
help="Number of cpu threads for preprocessing data. (default: %(default)s)") help="Number of cpu threads for preprocessing data. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_processes_beam_search", "--num_processes_beam_search",
default=multiprocessing.cpu_count(), default=multiprocessing.cpu_count() // 2,
type=int, type=int,
help="Number of cpu processes for beam search. (default: %(default)s)") help="Number of cpu processes for beam search. (default: %(default)s)")
parser.add_argument(
"--specgram_type",
default='linear',
type=str,
help="Feature type of audio data: 'linear' (power spectrum)"
" or 'mfcc'. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--mean_std_filepath", "--mean_std_filepath",
default='mean_std.npz', default='mean_std.npz',
type=str, type=str,
help="Manifest path for normalizer. (default: %(default)s)") help="Manifest path for normalizer. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--decode_manifest_path", "--tune_manifest_path",
default='datasets/manifest.test', default='datasets/manifest.dev',
type=str, type=str,
help="Manifest path for decoding. (default: %(default)s)") help="Manifest path for tuning. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--model_filepath", "--model_filepath",
default='checkpoints/params.latest.tar.gz', default='checkpoints/params.latest.tar.gz',
...@@ -77,7 +87,7 @@ parser.add_argument( ...@@ -77,7 +87,7 @@ parser.add_argument(
help="Width for beam search decoding. (default: %(default)d)") help="Width for beam search decoding. (default: %(default)d)")
parser.add_argument( parser.add_argument(
"--language_model_path", "--language_model_path",
default="lm/data/1Billion.klm", default="lm/data/common_crawl_00.prune01111.trie.klm",
type=str, type=str,
help="Path for language model. (default: %(default)s)") help="Path for language model. (default: %(default)s)")
parser.add_argument( parser.add_argument(
...@@ -121,95 +131,64 @@ args = parser.parse_args() ...@@ -121,95 +131,64 @@ args = parser.parse_args()
def tune(): def tune():
"""Tune parameters alpha and beta on one minibatch.""" """Tune parameters alpha and beta on one minibatch."""
if not args.num_alphas >= 0: if not args.num_alphas >= 0:
raise ValueError("num_alphas must be non-negative!") raise ValueError("num_alphas must be non-negative!")
if not args.num_betas >= 0: if not args.num_betas >= 0:
raise ValueError("num_betas must be non-negative!") raise ValueError("num_betas must be non-negative!")
# initialize data generator
data_generator = DataGenerator( data_generator = DataGenerator(
vocab_filepath=args.vocab_filepath, vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath, mean_std_filepath=args.mean_std_filepath,
augmentation_config='{}', augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=args.num_threads_data) num_threads=args.num_threads_data)
# create network config
# paddle.data_type.dense_array is used for variable batch input.
# The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data(
name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
text_data = paddle.layer.data(
name="transcript_text",
type=paddle.data_type.integer_value_sequence(data_generator.vocab_size))
output_probs = deep_speech2(
audio_data=audio_data,
text_data=text_data,
dict_size=data_generator.vocab_size,
num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
rnn_size=args.rnn_layer_size,
is_inference=True)
# load parameters
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(args.model_filepath))
# prepare infer data
batch_reader = data_generator.batch_reader_creator( batch_reader = data_generator.batch_reader_creator(
manifest_path=args.decode_manifest_path, manifest_path=args.tune_manifest_path,
batch_size=args.num_samples, batch_size=args.num_samples,
sortagrad=False, sortagrad=False,
shuffle_method=None) shuffle_method=None)
# get one batch data for tuning tune_data = batch_reader().next()
infer_data = batch_reader().next() target_transcripts = [
''.join([data_generator.vocab_list[token] for token in transcript])
# run inference for _, transcript in tune_data
infer_results = paddle.infer(
output_layer=output_probs, parameters=parameters, input=infer_data)
num_steps = len(infer_results) // len(infer_data)
probs_split = [
infer_results[i * num_steps:(i + 1) * num_steps]
for i in xrange(0, len(infer_data))
] ]
ds2_model = DeepSpeech2Model(
vocab_size=data_generator.vocab_size,
num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
pretrained_model_path=args.model_filepath)
# create grid for search # create grid for search
cand_alphas = np.linspace(args.alpha_from, args.alpha_to, args.num_alphas) cand_alphas = np.linspace(args.alpha_from, args.alpha_to, args.num_alphas)
cand_betas = np.linspace(args.beta_from, args.beta_to, args.num_betas) cand_betas = np.linspace(args.beta_from, args.beta_to, args.num_betas)
params_grid = [(alpha, beta) for alpha in cand_alphas params_grid = [(alpha, beta) for alpha in cand_alphas
for beta in cand_betas] for beta in cand_betas]
ext_scorer = LmScorer(args.alpha_from, args.beta_from,
args.language_model_path)
## tune parameters in loop ## tune parameters in loop
for alpha, beta in params_grid: for alpha, beta in params_grid:
wer_sum, wer_counter = 0, 0 result_transcripts = ds2_model.infer_batch(
# reset scorer infer_data=tune_data,
ext_scorer.reset_params(alpha, beta) decode_method='beam_search',
# beam search using multiple processes beam_alpha=alpha,
beam_search_results = ctc_beam_search_decoder_batch( beam_beta=beta,
probs_split=probs_split,
vocabulary=data_generator.vocab_list,
beam_size=args.beam_size, beam_size=args.beam_size,
cutoff_prob=args.cutoff_prob, cutoff_prob=args.cutoff_prob,
blank_id=len(data_generator.vocab_list), vocab_list=data_generator.vocab_list,
num_processes=args.num_processes_beam_search, language_model_path=args.language_model_path,
ext_scoring_func=ext_scorer, ) num_processes=args.num_processes_beam_search)
for i, beam_search_result in enumerate(beam_search_results): wer_sum, num_ins = 0.0, 0
target_transcription = ''.join([ for target, result in zip(target_transcripts, result_transcripts):
data_generator.vocab_list[index] for index in infer_data[i][1] wer_sum += wer(target, result)
]) num_ins += 1
wer_sum += wer(target_transcription, beam_search_result[0][1])
wer_counter += 1
print("alpha = %f\tbeta = %f\tWER = %f" % print("alpha = %f\tbeta = %f\tWER = %f" %
(alpha, beta, wer_sum / wer_counter)) (alpha, beta, wer_sum / num_ins))
def main(): def main():
paddle.init(use_gpu=args.use_gpu, trainer_count=1) utils.print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
tune() tune()
......
# 深度结构化语义模型 (Deep Structured Semantic Models, DSSM)
DSSM使用DNN模型在一个连续的语义空间中学习文本低纬的表示向量,并且建模两个句子间的语义相似度。
本例演示如何使用 PaddlePaddle实现一个通用的DSSM 模型,用于建模两个字符串间的语义相似度,
模型实现支持通用的数据格式,用户替换数据便可以在真实场景中使用该模型。
## 背景介绍
DSSM \[[1](##参考文献)\]是微软研究院13年提出来的经典的语义模型,用于学习两个文本之间的语义距离,
广义上模型也可以推广和适用如下场景:
1. CTR预估模型,衡量用户搜索词(Query)与候选网页集合(Documents)之间的相关联程度。
2. 文本相关性,衡量两个字符串间的语义相关程度。
3. 自动推荐,衡量User与被推荐的Item之间的关联程度。
DSSM 已经发展成了一个框架,可以很自然地建模两个记录之间的距离关系,
例如对于文本相关性问题,可以用余弦相似度 (cosin similarity) 来刻画语义距离;
而对于搜索引擎的结果排序,可以在DSSM上接上Rank损失训练处一个排序模型。
## 模型简介
在原论文\[[1](#参考文献)\]中,DSSM模型用来衡量用户搜索词 Query 和文档集合 Documents 之间隐含的语义关系,模型结构如下
<p align="center">
<img src="./images/dssm.png"/><br/><br/>
图 1. DSSM 原始结构
</p>
其贯彻的思想是, **用DNN将高维特征向量转化为低纬空间的连续向量(图中红色框部分)**
**在上层用cosin similarity来衡量用户搜索词与候选文档间的语义相关性**
在最顶层损失函数的设计上,原始模型使用类似Word2Vec中负例采样的方法,
一个Query会抽取正例 $D+$ 和4个负例 $D-$ 整体上算条件概率用对数似然函数作为损失,
这也就是图 1中类似 $P(D_1|Q)$ 的结构,具体细节请参考原论文。
随着后续优化DSSM模型的结构得以简化\[[3](#参考文献)\],演变为:
<p align="center">
<img src="./images/dssm2.png" width="600"/><br/><br/>
图 2. DSSM通用结构
</p>
图中的空白方框可以用任何模型替代,比如全连接FC,卷积CNN,RNN等都可以,
该模型结构专门用于衡量两个元素(比如字符串)间的语义距离。
在现实使用中,DSSM模型会作为基础的积木,搭配上不同的损失函数来实现具体的功能,比如
- 在排序学习中,将 图 2 中结构添加 pairwise rank损失,变成一个排序模型
- 在CTR预估中,对点击与否做0,1二元分类,添加交叉熵损失变成一个分类模型
- 在需要对一个子串打分时,可以使用余弦相似度来计算相似度,变成一个回归模型
本例将尝试面向应用提供一个比较通用的解决方案,在模型任务类型上支持
- 分类
- [-1, 1] 值域内的回归
- Pairwise-Rank
在生成低纬语义向量的模型结构上,本模型支持以下三种:
- FC, 多层全连接层
- CNN,卷积神经网络
- RNN,递归神经网络
## 模型实现
DSSM模型可以拆成三小块实现,分别是左边和右边的DNN,以及顶层的损失函数。
在复杂任务中,左右两边DNN的结构可以是不同的,比如在原始论文中左右分别学习Query和Document的semantic vector,
两者数据的数据不同,建议对应定制DNN的结构。
本例中为了简便和通用,将左右两个DNN的结构都设为相同的,因此只有三个选项FC,CNN,RNN等。
在损失函数的设计方面,也支持三种,分类, 回归, 排序;
其中,在回归和排序两种损失中,左右两边的匹配程度通过余弦相似度(cossim)来计算;
在分类任务中,类别预测的分布通过softmax计算。
在其它教程中,对上述很多内容都有过详细的介绍,例如:
- 如何CNN, FC 做文本信息提取可以参考 [text classification](https://github.com/PaddlePaddle/models/blob/develop/text_classification/README.md#模型详解)
- RNN/GRU 的内容可以参考 [Machine Translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md#gated-recurrent-unit-gru)
- Pairwise Rank即排序学习可参考 [learn to rank](https://github.com/PaddlePaddle/models/blob/develop/ltr/README.md)
相关原理在此不再赘述,本文接下来的篇幅主要集中介绍使用PaddlePaddle实现这些结构上。
如图3,回归和分类模型的结构很相似
<p align="center">
<img src="./images/dssm3.jpg"/><br/><br/>
图 3. DSSM for REGRESSION or CLASSIFICATION
</p>
最重要的组成部分包括词向量,图中`(1)`,`(2)`两个低纬向量的学习器(可以用RNN/CNN/FC中的任意一种实现),
最上层对应的损失函数。
而Pairwise Rank的结构会复杂一些,类似两个 图 4. 中的结构,增加了对应的损失函数:
- 模型总体思想是,用同一个source(源)为左右两个target(目标)分别打分——`(a),(b)`,学习目标是(a),(b)间的大小关系
- `(a)``(b)`类似图3中结构,用于给source和target的pair打分
- `(1)``(2)`的结构其实是共用的,都表示同一个source,图中为了表达效果展开成两个
<p align="center">
<img src="./images/dssm2.jpg"/><br/><br/>
图 4. DSSM for Pairwise Rank
</p>
下面是各个部分具体的实现方法,所有的代码均包含在 `./network_conf.py` 中。
### 创建文本的词向量表
```python
def create_embedding(self, input, prefix=''):
'''
Create an embedding table whose name has a `prefix`.
'''
logger.info("create embedding table [%s] which dimention is %d" %
(prefix, self.dnn_dims[0]))
emb = paddle.layer.embedding(
input=input,
size=self.dnn_dims[0],
param_attr=ParamAttr(name='%s_emb.w' % prefix))
return emb
```
由于输入给词向量表(embedding table)的是一个句子对应的词的ID的列表 ,因此词向量表输出的是词向量的序列。
### CNN 结构实现
```python
def create_cnn(self, emb, prefix=''):
'''
A multi-layer CNN.
@emb: paddle.layer
output of the embedding layer
@prefix: str
prefix of layers' names, used to share parameters between more than one `cnn` parts.
'''
def create_conv(context_len, hidden_size, prefix):
key = "%s_%d_%d" % (prefix, context_len, hidden_size)
conv = paddle.networks.sequence_conv_pool(
input=emb,
context_len=context_len,
hidden_size=hidden_size,
# set parameter attr for parameter sharing
context_proj_param_attr=ParamAttr(name=key + 'contex_proj.w'),
fc_param_attr=ParamAttr(name=key + '_fc.w'),
fc_bias_attr=ParamAttr(name=key + '_fc.b'),
pool_bias_attr=ParamAttr(name=key + '_pool.b'))
return conv
logger.info('create a sequence_conv_pool which context width is 3')
conv_3 = create_conv(3, self.dnn_dims[1], "cnn")
logger.info('create a sequence_conv_pool which context width is 4')
conv_4 = create_conv(4, self.dnn_dims[1], "cnn")
return conv_3, conv_4
```
CNN 接受 embedding table输出的词向量序列,通过卷积和池化操作捕捉到原始句子的关键信息,
最终输出一个语义向量(可以认为是句子向量)。
本例的实现中,分别使用了窗口长度为3和4的CNN学到的句子向量按元素求和得到最终的句子向量。
### RNN 结构实现
RNN很适合学习变长序列的信息,使用RNN来学习句子的信息几乎是自然语言处理任务的标配。
```python
def create_rnn(self, emb, prefix=''):
'''
A GRU sentence vector learner.
'''
gru = paddle.layer.gru_memory(input=emb,)
sent_vec = paddle.layer.last_seq(gru)
return sent_vec
```
### FC 结构实现
```python
def create_fc(self, emb, prefix=''):
'''
A multi-layer fully connected neural networks.
@emb: paddle.layer
output of the embedding layer
@prefix: str
prefix of layers' names, used to share parameters between more than one `fc` parts.
'''
_input_layer = paddle.layer.pooling(
input=emb, pooling_type=paddle.pooling.Max())
fc = paddle.layer.fc(input=_input_layer, size=self.dnn_dims[1])
return fc
```
在构建FC时需要首先使用`paddle.layer.pooling` 对词向量序列进行最大池化操作,将边长序列转化为一个固定维度向量,
作为整个句子的语义表达,使用最大池化能够降低句子长度对句向量表达的影响。
### 多层DNN实现
在 CNN/DNN/FC提取出 semantic vector后,在上层可继续接多层FC来实现深层DNN结构。
```python
def create_dnn(self, sent_vec, prefix):
# if more than three layers exists, a fc layer will be added.
if len(self.dnn_dims) > 1:
_input_layer = sent_vec
for id, dim in enumerate(self.dnn_dims[1:]):
name = "%s_fc_%d_%d" % (prefix, id, dim)
logger.info("create fc layer [%s] which dimention is %d" %
(name, dim))
fc = paddle.layer.fc(
input=_input_layer,
size=dim,
name=name,
act=paddle.activation.Tanh(),
param_attr=ParamAttr(name='%s.w' % name),
bias_attr=ParamAttr(name='%s.b' % name),
)
_input_layer = fc
return _input_layer
```
### 分类或回归实现
分类和回归的结构比较相似,因此可以用一个函数创建出来
```python
def _build_classification_or_regression_model(self, is_classification):
'''
Build a classification/regression model, and the cost is returned.
A Classification has 3 inputs:
- source sentence
- target sentence
- classification label
'''
# prepare inputs.
assert self.class_num
source = paddle.layer.data(
name='source_input',
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
target = paddle.layer.data(
name='target_input',
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
label = paddle.layer.data(
name='label_input',
type=paddle.data_type.integer_value(self.class_num)
if is_classification else paddle.data_type.dense_input)
prefixs = '_ _'.split(
) if self.share_semantic_generator else 'left right'.split()
embed_prefixs = '_ _'.split(
) if self.share_embed else 'left right'.split()
word_vecs = []
for id, input in enumerate([source, target]):
x = self.create_embedding(input, prefix=embed_prefixs[id])
word_vecs.append(x)
semantics = []
for id, input in enumerate(word_vecs):
x = self.model_arch_creater(input, prefix=prefixs[id])
semantics.append(x)
concated_vector = paddle.layer.concat(semantics)
prediction = paddle.layer.fc(
input=concated_vector,
size=self.class_num,
act=paddle.activation.Softmax())
cost = paddle.layer.classification_cost(
input=prediction,
label=label) if is_classification else paddle.layer.mse_cost(
prediction, label)
return cost, prediction, label
```
### Pairwise Rank实现
Pairwise Rank复用上面的DNN结构,同一个source对两个target求相似度打分,
如果左边的target打分高,预测为1,否则预测为 0。
```python
def _build_rank_model(self):
'''
Build a pairwise rank model, and the cost is returned.
A pairwise rank model has 3 inputs:
- source sentence
- left_target sentence
- right_target sentence
- label, 1 if left_target should be sorted in front of right_target, otherwise 0.
'''
source = paddle.layer.data(
name='source_input',
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
left_target = paddle.layer.data(
name='left_target_input',
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
right_target = paddle.layer.data(
name='right_target_input',
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
label = paddle.layer.data(
name='label_input', type=paddle.data_type.integer_value(1))
prefixs = '_ _ _'.split(
) if self.share_semantic_generator else 'source left right'.split()
embed_prefixs = '_ _'.split(
) if self.share_embed else 'source target target'.split()
word_vecs = []
for id, input in enumerate([source, left_target, right_target]):
x = self.create_embedding(input, prefix=embed_prefixs[id])
word_vecs.append(x)
semantics = []
for id, input in enumerate(word_vecs):
x = self.model_arch_creater(input, prefix=prefixs[id])
semantics.append(x)
# cossim score of source and left_target
left_score = paddle.layer.cos_sim(semantics[0], semantics[1])
# cossim score of source and right target
right_score = paddle.layer.cos_sim(semantics[0], semantics[2])
# rank cost
cost = paddle.layer.rank_cost(left_score, right_score, label=label)
# prediction = left_score - right_score
# but this operator is not supported currently.
# so AUC will not used.
return cost, None, None
```
## 数据格式
`./data` 中有简单的示例数据
### 回归的数据格式
```
# 3 fields each line:
# - source's word ids
# - target's word ids
# - target
<ids> \t <ids> \t <float>
```
比如:
```
3 6 10 \t 6 8 33 \t 0.7
6 0 \t 6 9 330 \t 0.03
```
### 分类的数据格式
```
# 3 fields each line:
# - source's word ids
# - target's word ids
# - target
<ids> \t <ids> \t <label>
```
比如:
```
3 6 10 \t 6 8 33 \t 0
6 10 \t 8 3 1 \t 1
```
### 排序的数据格式
```
# 4 fields each line:
# - source's word ids
# - target1's word ids
# - target2's word ids
# - label
<ids> \t <ids> \t <ids> \t <label>
```
比如:
```
7 2 4 \t 2 10 12 \t 9 2 7 10 23 \t 0
7 2 4 \t 10 12 \t 9 2 21 23 \t 1
```
## 执行训练
可以直接执行 `python train.py -y 0 --model_arch 0` 使用 `./data/classification` 目录里简单的数据来训练一个分类的FC模型。
其他模型结构也可以通过命令行实现定制,详细命令行参数如下
```
usage: train.py [-h] [-i TRAIN_DATA_PATH] [-t TEST_DATA_PATH]
[-s SOURCE_DIC_PATH] [--target_dic_path TARGET_DIC_PATH]
[-b BATCH_SIZE] [-p NUM_PASSES] -y MODEL_TYPE -a MODEL_ARCH
[--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET]
[--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS]
[--num_workers NUM_WORKERS] [--use_gpu USE_GPU] [-c CLASS_NUM]
[--model_output_prefix MODEL_OUTPUT_PREFIX]
[-g NUM_BATCHES_TO_LOG] [-e NUM_BATCHES_TO_TEST]
[-z NUM_BATCHES_TO_SAVE_MODEL]
PaddlePaddle DSSM example
optional arguments:
-h, --help show this help message and exit
-i TRAIN_DATA_PATH, --train_data_path TRAIN_DATA_PATH
path of training dataset
-t TEST_DATA_PATH, --test_data_path TEST_DATA_PATH
path of testing dataset
-s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH
path of the source's word dic
--target_dic_path TARGET_DIC_PATH
path of the target's word dic, if not set, the
`source_dic_path` will be used
-b BATCH_SIZE, --batch_size BATCH_SIZE
size of mini-batch (default:10)
-p NUM_PASSES, --num_passes NUM_PASSES
number of passes to run(default:10)
-y MODEL_TYPE, --model_type MODEL_TYPE
model type, 0 for classification, 1 for pairwise rank,
2 for regression (default: classification)
-a MODEL_ARCH, --model_arch MODEL_ARCH
model architecture, 1 for CNN, 0 for FC, 2 for RNN
--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET
whether to share network parameters between source and
target
--share_embed SHARE_EMBED
whether to share word embedding between source and
target
--dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32',
which means create a 4-layer dnn, demention of each
layer is 256, 128, 64 and 32
--num_workers NUM_WORKERS
num worker threads, default 1
--use_gpu USE_GPU whether to use GPU devices (default: False)
-c CLASS_NUM, --class_num CLASS_NUM
number of categories for classification task.
--model_output_prefix MODEL_OUTPUT_PREFIX
prefix of the path for model to store, (default: ./)
-g NUM_BATCHES_TO_LOG, --num_batches_to_log NUM_BATCHES_TO_LOG
number of batches to output train log, (default: 100)
-e NUM_BATCHES_TO_TEST, --num_batches_to_test NUM_BATCHES_TO_TEST
number of batches to test, (default: 200)
-z NUM_BATCHES_TO_SAVE_MODEL, --num_batches_to_save_model NUM_BATCHES_TO_SAVE_MODEL
number of batches to output model, (default: 400)
```
重要的参数描述如下
- `train_data_path` 训练数据路径
- `test_data_path` 测试数据路局,可以不设置
- `source_dic_path` 源字典字典路径
- `target_dic_path` 目标字典路径
- `model_type` 模型的损失函数的类型,分类0,排序1,回归2
- `model_arch` 模型结构,FC 0, CNN 1, RNN 2
- `dnn_dims` 模型各层的维度设置,默认为 `256,128,64,32`,即模型有4层,各层维度如上设置
## 用训练好的模型预测
```
usage: infer.py [-h] --model_path MODEL_PATH -i DATA_PATH -o
PREDICTION_OUTPUT_PATH -y MODEL_TYPE [-s SOURCE_DIC_PATH]
[--target_dic_path TARGET_DIC_PATH] -a MODEL_ARCH
[--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET]
[--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS]
[-c CLASS_NUM]
PaddlePaddle DSSM infer
optional arguments:
-h, --help show this help message and exit
--model_path MODEL_PATH
path of model parameters file
-i DATA_PATH, --data_path DATA_PATH
path of the dataset to infer
-o PREDICTION_OUTPUT_PATH, --prediction_output_path PREDICTION_OUTPUT_PATH
path to output the prediction
-y MODEL_TYPE, --model_type MODEL_TYPE
model type, 0 for classification, 1 for pairwise rank,
2 for regression (default: classification)
-s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH
path of the source's word dic
--target_dic_path TARGET_DIC_PATH
path of the target's word dic, if not set, the
`source_dic_path` will be used
-a MODEL_ARCH, --model_arch MODEL_ARCH
model architecture, 1 for CNN, 0 for FC, 2 for RNN
--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET
whether to share network parameters between source and
target
--share_embed SHARE_EMBED
whether to share word embedding between source and
target
--dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32',
which means create a 4-layer dnn, demention of each
layer is 256, 128, 64 and 32
-c CLASS_NUM, --class_num CLASS_NUM
number of categories for classification task.
```
部分参数可以参考 `train.py`,重要参数解释如下
- `data_path` 需要预测的数据路径
- `prediction_output_path` 预测的输出路径
## 参考文献
1. Huang P S, He X, Gao J, et al. Learning deep structured semantic models for web search using clickthrough data[C]//Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, 2013: 2333-2338.
2. [Microsoft Learning to Rank Datasets](https://www.microsoft.com/en-us/research/project/mslr/)
3. [Gao J, He X, Deng L. Deep Learning for Web Search and Natural Language Processing[J]. Microsoft Research Technical Report, 2015.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/wsdm2015.v3.pdf)
苹果 苹果 6s 0
汽车 驾驶 驾校 培训 1
苹果 六 袋 苹果 6s 0
新手 汽车 驾驶 驾校 培训 1
苹果 六 袋 苹果 6s 新手 汽车 驾驶 1
新手 汽车 驾驶 驾校 培训 苹果 6s 0
苹果 六 袋 苹果 6s 新手 汽车 驾驶 1
新手 汽车 驾驶 驾校 培训 苹果 6s 1
UNK
苹果
6s
新手
汽车
驾驶
驾校
培训
\ No newline at end of file
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
import itertools
import reader
import paddle.v2 as paddle
from network_conf import DSSM
from utils import logger, ModelType, ModelArch, load_dic
parser = argparse.ArgumentParser(description="PaddlePaddle DSSM infer")
parser.add_argument(
'--model_path',
type=str,
required=True,
help="path of model parameters file")
parser.add_argument(
'-i',
'--data_path',
type=str,
required=True,
help="path of the dataset to infer")
parser.add_argument(
'-o',
'--prediction_output_path',
type=str,
required=True,
help="path to output the prediction")
parser.add_argument(
'-y',
'--model_type',
type=int,
required=True,
default=ModelType.CLASSIFICATION_MODE,
help="model type, %d for classification, %d for pairwise rank, %d for regression (default: classification)"
% (ModelType.CLASSIFICATION_MODE, ModelType.RANK_MODE,
ModelType.REGRESSION_MODE))
parser.add_argument(
'-s',
'--source_dic_path',
type=str,
required=False,
help="path of the source's word dic")
parser.add_argument(
'--target_dic_path',
type=str,
required=False,
help="path of the target's word dic, if not set, the `source_dic_path` will be used"
)
parser.add_argument(
'-a',
'--model_arch',
type=int,
required=True,
default=ModelArch.CNN_MODE,
help="model architecture, %d for CNN, %d for FC, %d for RNN" %
(ModelArch.CNN_MODE, ModelArch.FC_MODE, ModelArch.RNN_MODE))
parser.add_argument(
'--share_network_between_source_target',
type=bool,
default=False,
help="whether to share network parameters between source and target")
parser.add_argument(
'--share_embed',
type=bool,
default=False,
help="whether to share word embedding between source and target")
parser.add_argument(
'--dnn_dims',
type=str,
default='256,128,64,32',
help="dimentions of dnn layers, default is '256,128,64,32', which means create a 4-layer dnn, demention of each layer is 256, 128, 64 and 32"
)
parser.add_argument(
'-c',
'--class_num',
type=int,
default=0,
help="number of categories for classification task.")
args = parser.parse_args()
args.model_type = ModelType(args.model_type)
args.model_arch = ModelArch(args.model_arch)
if args.model_type.is_classification():
assert args.class_num > 1, "--class_num should be set in classification task."
layer_dims = map(int, args.dnn_dims.split(','))
args.target_dic_path = args.source_dic_path if not args.target_dic_path else args.target_dic_path
paddle.init(use_gpu=False, trainer_count=1)
class Inferer(object):
def __init__(self, param_path):
logger.info("create DSSM model")
prediction = DSSM(
dnn_dims=layer_dims,
vocab_sizes=[
len(load_dic(path))
for path in [args.source_dic_path, args.target_dic_path]
],
model_type=args.model_type,
model_arch=args.model_arch,
share_semantic_generator=args.share_network_between_source_target,
class_num=args.class_num,
share_embed=args.share_embed,
is_infer=True)()
# load parameter
logger.info("load model parameters from %s" % param_path)
self.parameters = paddle.parameters.Parameters.from_tar(
open(param_path, 'r'))
self.inferer = paddle.inference.Inference(
output_layer=prediction, parameters=self.parameters)
def infer(self, data_path):
logger.info("infer data...")
dataset = reader.Dataset(
train_path=data_path,
test_path=None,
source_dic_path=args.source_dic_path,
target_dic_path=args.target_dic_path,
model_type=args.model_type, )
infer_reader = paddle.batch(dataset.infer, batch_size=1000)
logger.warning('write predictions to %s' % args.prediction_output_path)
output_f = open(args.prediction_output_path, 'w')
for id, batch in enumerate(infer_reader()):
res = self.inferer.infer(input=batch)
predictions = [' '.join(map(str, x)) for x in res]
assert len(batch) == len(
predictions), "predict error, %d inputs, but %d predictions" % (
len(batch), len(predictions))
output_f.write('\n'.join(map(str, predictions)) + '\n')
if __name__ == '__main__':
inferer = Inferer(args.model_path)
inferer.infer(args.data_path)
from paddle import v2 as paddle
from paddle.v2.attr import ParamAttr
from utils import TaskType, logger, ModelType, ModelArch
class DSSM(object):
def __init__(self,
dnn_dims=[],
vocab_sizes=[],
model_type=ModelType.create_classification(),
model_arch=ModelArch.create_cnn(),
share_semantic_generator=False,
class_num=None,
share_embed=False,
is_infer=False):
'''
@dnn_dims: list of int
dimentions of each layer in semantic vector generator.
@vocab_sizes: 2-d tuple
size of both left and right items.
@model_type: int
type of task, should be 'rank: 0', 'regression: 1' or 'classification: 2'
@model_arch: int
model architecture
@share_semantic_generator: bool
whether to share the semantic vector generator for both left and right.
@share_embed: bool
whether to share the embeddings between left and right.
@class_num: int
number of categories.
'''
assert len(
vocab_sizes
) == 2, "vocab_sizes specify the sizes left and right inputs, and dim should be 2."
assert len(dnn_dims) > 1, "more than two layers is needed."
self.dnn_dims = dnn_dims
self.vocab_sizes = vocab_sizes
self.share_semantic_generator = share_semantic_generator
self.share_embed = share_embed
self.model_type = ModelType(model_type)
self.model_arch = ModelArch(model_arch)
self.class_num = class_num
self.is_infer = is_infer
logger.warning("build DSSM model with config of %s, %s" %
(self.model_type, self.model_arch))
logger.info("vocabulary sizes: %s" % str(self.vocab_sizes))
# bind model architecture
_model_arch = {
'cnn': self.create_cnn,
'fc': self.create_fc,
'rnn': self.create_rnn,
}
def _model_arch_creater(emb, prefix=''):
sent_vec = _model_arch.get(str(model_arch))(emb, prefix)
dnn = self.create_dnn(sent_vec, prefix)
return dnn
self.model_arch_creater = _model_arch_creater
# build model type
_model_type = {
'classification': self._build_classification_model,
'rank': self._build_rank_model,
'regression': self._build_regression_model,
}
print 'model type: ', str(self.model_type)
self.model_type_creater = _model_type[str(self.model_type)]
def __call__(self):
return self.model_type_creater()
def create_embedding(self, input, prefix=''):
'''
Create an embedding table whose name has a `prefix`.
'''
logger.info("create embedding table [%s] which dimention is %d" %
(prefix, self.dnn_dims[0]))
emb = paddle.layer.embedding(
input=input,
size=self.dnn_dims[0],
param_attr=ParamAttr(name='%s_emb.w' % prefix))
return emb
def create_fc(self, emb, prefix=''):
'''
A multi-layer fully connected neural networks.
@emb: paddle.layer
output of the embedding layer
@prefix: str
prefix of layers' names, used to share parameters between more than one `fc` parts.
'''
_input_layer = paddle.layer.pooling(
input=emb, pooling_type=paddle.pooling.Max())
fc = paddle.layer.fc(input=_input_layer, size=self.dnn_dims[1])
return fc
def create_rnn(self, emb, prefix=''):
'''
A GRU sentence vector learner.
'''
gru = paddle.networks.simple_gru(input=emb, size=256)
sent_vec = paddle.layer.last_seq(gru)
return sent_vec
def create_cnn(self, emb, prefix=''):
'''
A multi-layer CNN.
@emb: paddle.layer
output of the embedding layer
@prefix: str
prefix of layers' names, used to share parameters between more than one `cnn` parts.
'''
def create_conv(context_len, hidden_size, prefix):
key = "%s_%d_%d" % (prefix, context_len, hidden_size)
conv = paddle.networks.sequence_conv_pool(
input=emb,
context_len=context_len,
hidden_size=hidden_size,
# set parameter attr for parameter sharing
context_proj_param_attr=ParamAttr(name=key + 'contex_proj.w'),
fc_param_attr=ParamAttr(name=key + '_fc.w'),
fc_bias_attr=ParamAttr(name=key + '_fc.b'),
pool_bias_attr=ParamAttr(name=key + '_pool.b'))
return conv
logger.info('create a sequence_conv_pool which context width is 3')
conv_3 = create_conv(3, self.dnn_dims[1], "cnn")
logger.info('create a sequence_conv_pool which context width is 4')
conv_4 = create_conv(4, self.dnn_dims[1], "cnn")
return conv_3, conv_4
def create_dnn(self, sent_vec, prefix):
# if more than three layers, than a fc layer will be added.
if len(self.dnn_dims) > 1:
_input_layer = sent_vec
for id, dim in enumerate(self.dnn_dims[1:]):
name = "%s_fc_%d_%d" % (prefix, id, dim)
logger.info("create fc layer [%s] which dimention is %d" %
(name, dim))
fc = paddle.layer.fc(
name=name,
input=_input_layer,
size=dim,
act=paddle.activation.Tanh(),
param_attr=ParamAttr(name='%s.w' % name),
bias_attr=ParamAttr(name='%s.b' % name))
_input_layer = fc
return _input_layer
def _build_classification_model(self):
logger.info("build classification model")
assert self.model_type.is_classification()
return self._build_classification_or_regression_model(
is_classification=True)
def _build_regression_model(self):
logger.info("build regression model")
assert self.model_type.is_regression()
return self._build_classification_or_regression_model(
is_classification=False)
def _build_rank_model(self):
'''
Build a pairwise rank model, and the cost is returned.
A pairwise rank model has 3 inputs:
- source sentence
- left_target sentence
- right_target sentence
- label, 1 if left_target should be sorted in front of right_target, otherwise 0.
'''
logger.info("build rank model")
assert self.model_type.is_rank()
source = paddle.layer.data(
name='source_input',
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
left_target = paddle.layer.data(
name='left_target_input',
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
right_target = paddle.layer.data(
name='right_target_input',
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
if not self.is_infer:
label = paddle.layer.data(
name='label_input', type=paddle.data_type.integer_value(1))
prefixs = '_ _ _'.split(
) if self.share_semantic_generator else 'source left right'.split()
embed_prefixs = '_ _'.split(
) if self.share_embed else 'source target target'.split()
word_vecs = []
for id, input in enumerate([source, left_target, right_target]):
x = self.create_embedding(input, prefix=embed_prefixs[id])
word_vecs.append(x)
semantics = []
for id, input in enumerate(word_vecs):
x = self.model_arch_creater(input, prefix=prefixs[id])
semantics.append(x)
# cossim score of source and left_target
left_score = paddle.layer.cos_sim(semantics[0], semantics[1])
# cossim score of source and right target
right_score = paddle.layer.cos_sim(semantics[0], semantics[2])
if not self.is_infer:
# rank cost
cost = paddle.layer.rank_cost(left_score, right_score, label=label)
# prediction = left_score - right_score
# but this operator is not supported currently.
# so AUC will not used.
return cost, None, label
return right_score
def _build_classification_or_regression_model(self, is_classification):
'''
Build a classification/regression model, and the cost is returned.
A Classification has 3 inputs:
- source sentence
- target sentence
- classification label
'''
if is_classification:
# prepare inputs.
assert self.class_num
source = paddle.layer.data(
name='source_input',
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
target = paddle.layer.data(
name='target_input',
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
label = paddle.layer.data(
name='label_input',
type=paddle.data_type.integer_value(self.class_num)
if is_classification else paddle.data_type.dense_vector(1))
prefixs = '_ _'.split(
) if self.share_semantic_generator else 'left right'.split()
embed_prefixs = '_ _'.split(
) if self.share_embed else 'left right'.split()
word_vecs = []
for id, input in enumerate([source, target]):
x = self.create_embedding(input, prefix=embed_prefixs[id])
word_vecs.append(x)
semantics = []
for id, input in enumerate(word_vecs):
x = self.model_arch_creater(input, prefix=prefixs[id])
semantics.append(x)
if is_classification:
concated_vector = paddle.layer.concat(semantics)
prediction = paddle.layer.fc(
input=concated_vector,
size=self.class_num,
act=paddle.activation.Softmax())
cost = paddle.layer.classification_cost(
input=prediction, label=label)
else:
prediction = paddle.layer.cos_sim(*semantics)
cost = paddle.layer.mse_cost(prediction, label)
if not self.is_infer:
return cost, prediction, label
return prediction
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from utils import UNK, ModelType, TaskType, load_dic, sent2ids, logger, ModelType
class Dataset(object):
def __init__(self, train_path, test_path, source_dic_path, target_dic_path,
model_type):
self.train_path = train_path
self.test_path = test_path
self.source_dic_path = source_dic_path
self.target_dic_path = target_dic_path
self.model_type = ModelType(model_type)
self.source_dic = load_dic(self.source_dic_path)
self.target_dic = load_dic(self.target_dic_path)
_record_reader = {
ModelType.CLASSIFICATION_MODE: self._read_classification_record,
ModelType.REGRESSION_MODE: self._read_regression_record,
ModelType.RANK_MODE: self._read_rank_record,
}
assert isinstance(model_type, ModelType)
self.record_reader = _record_reader[model_type.mode]
self.is_infer = False
def train(self):
'''
Load trainset.
'''
logger.info("[reader] load trainset from %s" % self.train_path)
with open(self.train_path) as f:
for line_id, line in enumerate(f):
yield self.record_reader(line)
def test(self):
'''
Load testset.
'''
# logger.info("[reader] load testset from %s" % self.test_path)
with open(self.test_path) as f:
for line_id, line in enumerate(f):
yield self.record_reader(line)
def infer(self):
self.is_infer = True
with open(self.train_path) as f:
for line in f:
yield self.record_reader(line)
def _read_classification_record(self, line):
'''
data format:
<source words> [TAB] <target words> [TAB] <label>
@line: str
a string line which represent a record.
'''
fs = line.strip().split('\t')
assert len(fs) == 3, "wrong format for classification\n" + \
"the format shoud be " +\
"<source words> [TAB] <target words> [TAB] <label>'"
source = sent2ids(fs[0], self.source_dic)
target = sent2ids(fs[1], self.target_dic)
if not self.is_infer:
label = int(fs[2])
return (source, target, label, )
return source, target
def _read_regression_record(self, line):
'''
data format:
<source words> [TAB] <target words> [TAB] <label>
@line: str
a string line which represent a record.
'''
fs = line.strip().split('\t')
assert len(fs) == 3, "wrong format for regression\n" + \
"the format shoud be " +\
"<source words> [TAB] <target words> [TAB] <label>'"
source = sent2ids(fs[0], self.source_dic)
target = sent2ids(fs[1], self.target_dic)
if not self.is_infer:
label = float(fs[2])
return (source, target, [label], )
return source, target
def _read_rank_record(self, line):
'''
data format:
<source words> [TAB] <left_target words> [TAB] <right_target words> [TAB] <label>
'''
fs = line.strip().split('\t')
assert len(fs) == 4, "wrong format for rank\n" + \
"the format should be " +\
"<source words> [TAB] <left_target words> [TAB] <right_target words> [TAB] <label>"
source = sent2ids(fs[0], self.source_dic)
left_target = sent2ids(fs[1], self.target_dic)
right_target = sent2ids(fs[2], self.target_dic)
if not self.is_infer:
label = int(fs[3])
return (source, left_target, right_target, label)
return source, left_target, right_target
if __name__ == '__main__':
path = './data/classification/train.txt'
test_path = './data/classification/test.txt'
source_dic = './data/vocab.txt'
dataset = Dataset(path, test_path, source_dic, source_dic,
ModelType.CLASSIFICATION)
for rcd in dataset.train():
print rcd
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
import paddle.v2 as paddle
from network_conf import DSSM
import reader
from utils import TaskType, load_dic, logger, ModelType, ModelArch, display_args
parser = argparse.ArgumentParser(description="PaddlePaddle DSSM example")
parser.add_argument(
'-i',
'--train_data_path',
type=str,
required=False,
help="path of training dataset")
parser.add_argument(
'-t',
'--test_data_path',
type=str,
required=False,
help="path of testing dataset")
parser.add_argument(
'-s',
'--source_dic_path',
type=str,
required=False,
help="path of the source's word dic")
parser.add_argument(
'--target_dic_path',
type=str,
required=False,
help="path of the target's word dic, if not set, the `source_dic_path` will be used"
)
parser.add_argument(
'-b',
'--batch_size',
type=int,
default=10,
help="size of mini-batch (default:10)")
parser.add_argument(
'-p',
'--num_passes',
type=int,
default=10,
help="number of passes to run(default:10)")
parser.add_argument(
'-y',
'--model_type',
type=int,
required=True,
default=ModelType.CLASSIFICATION_MODE,
help="model type, %d for classification, %d for pairwise rank, %d for regression (default: classification)"
% (ModelType.CLASSIFICATION_MODE, ModelType.RANK_MODE,
ModelType.REGRESSION_MODE))
parser.add_argument(
'-a',
'--model_arch',
type=int,
required=True,
default=ModelArch.CNN_MODE,
help="model architecture, %d for CNN, %d for FC, %d for RNN" %
(ModelArch.CNN_MODE, ModelArch.FC_MODE, ModelArch.RNN_MODE))
parser.add_argument(
'--share_network_between_source_target',
type=bool,
default=False,
help="whether to share network parameters between source and target")
parser.add_argument(
'--share_embed',
type=bool,
default=False,
help="whether to share word embedding between source and target")
parser.add_argument(
'--dnn_dims',
type=str,
default='256,128,64,32',
help="dimentions of dnn layers, default is '256,128,64,32', which means create a 4-layer dnn, demention of each layer is 256, 128, 64 and 32"
)
parser.add_argument(
'--num_workers', type=int, default=1, help="num worker threads, default 1")
parser.add_argument(
'--use_gpu',
type=bool,
default=False,
help="whether to use GPU devices (default: False)")
parser.add_argument(
'-c',
'--class_num',
type=int,
default=0,
help="number of categories for classification task.")
parser.add_argument(
'--model_output_prefix',
type=str,
default="./",
help="prefix of the path for model to store, (default: ./)")
parser.add_argument(
'-g',
'--num_batches_to_log',
type=int,
default=100,
help="number of batches to output train log, (default: 100)")
parser.add_argument(
'-e',
'--num_batches_to_test',
type=int,
default=200,
help="number of batches to test, (default: 200)")
parser.add_argument(
'-z',
'--num_batches_to_save_model',
type=int,
default=400,
help="number of batches to output model, (default: 400)")
# arguments check.
args = parser.parse_args()
args.model_type = ModelType(args.model_type)
args.model_arch = ModelArch(args.model_arch)
if args.model_type.is_classification():
assert args.class_num > 1, "--class_num should be set in classification task."
layer_dims = [int(i) for i in args.dnn_dims.split(',')]
args.target_dic_path = args.source_dic_path if not args.target_dic_path else args.target_dic_path
def train(train_data_path=None,
test_data_path=None,
source_dic_path=None,
target_dic_path=None,
model_type=ModelType.create_classification(),
model_arch=ModelArch.create_cnn(),
batch_size=10,
num_passes=10,
share_semantic_generator=False,
share_embed=False,
class_num=None,
num_workers=1,
use_gpu=False):
'''
Train the DSSM.
'''
default_train_path = './data/rank/train.txt'
default_test_path = './data/rank/test.txt'
default_dic_path = './data/vocab.txt'
if not model_type.is_rank():
default_train_path = './data/classification/train.txt'
default_test_path = './data/classification/test.txt'
use_default_data = not train_data_path
if use_default_data:
train_data_path = default_train_path
test_data_path = default_test_path
source_dic_path = default_dic_path
target_dic_path = default_dic_path
dataset = reader.Dataset(
train_path=train_data_path,
test_path=test_data_path,
source_dic_path=source_dic_path,
target_dic_path=target_dic_path,
model_type=model_type, )
train_reader = paddle.batch(
paddle.reader.shuffle(dataset.train, buf_size=1000),
batch_size=batch_size)
test_reader = paddle.batch(
paddle.reader.shuffle(dataset.test, buf_size=1000),
batch_size=batch_size)
paddle.init(use_gpu=use_gpu, trainer_count=num_workers)
cost, prediction, label = DSSM(
dnn_dims=layer_dims,
vocab_sizes=[
len(load_dic(path)) for path in [source_dic_path, target_dic_path]
],
model_type=model_type,
model_arch=model_arch,
share_semantic_generator=share_semantic_generator,
class_num=class_num,
share_embed=share_embed)()
parameters = paddle.parameters.create(cost)
adam_optimizer = paddle.optimizer.Adam(
learning_rate=1e-3,
regularization=paddle.optimizer.L2Regularization(rate=1e-3),
model_average=paddle.optimizer.ModelAverage(average_window=0.5))
trainer = paddle.trainer.SGD(
cost=cost,
extra_layers=paddle.evaluator.auc(input=prediction, label=label)
if not model_type.is_rank() else None,
parameters=parameters,
update_equation=adam_optimizer)
feeding = {}
if model_type.is_classification() or model_type.is_regression():
feeding = {'source_input': 0, 'target_input': 1, 'label_input': 2}
else:
feeding = {
'source_input': 0,
'left_target_input': 1,
'right_target_input': 2,
'label_input': 3
}
def _event_handler(event):
'''
Define batch handler
'''
if isinstance(event, paddle.event.EndIteration):
# output train log
if event.batch_id % args.num_batches_to_log == 0:
logger.info("Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics))
# test model
if event.batch_id > 0 and event.batch_id % args.num_batches_to_test == 0:
if test_reader is not None:
if model_type.is_classification():
result = trainer.test(
reader=test_reader, feeding=feeding)
logger.info("Test at Pass %d, %s" % (event.pass_id,
result.metrics))
else:
result = None
# save model
if event.batch_id > 0 and event.batch_id % args.num_batches_to_save_model == 0:
model_desc = "{type}_{arch}".format(
type=str(args.model_type), arch=str(args.model_arch))
with open("%sdssm_%s_pass_%05d.tar" %
(args.model_output_prefix, model_desc,
event.pass_id), "w") as f:
parameters.to_tar(f)
trainer.train(
reader=train_reader,
event_handler=_event_handler,
feeding=feeding,
num_passes=num_passes)
logger.info("Training has finished.")
if __name__ == '__main__':
display_args(args)
train(
train_data_path=args.train_data_path,
test_data_path=args.test_data_path,
source_dic_path=args.source_dic_path,
target_dic_path=args.target_dic_path,
model_type=ModelType(args.model_type),
model_arch=ModelArch(args.model_arch),
batch_size=args.batch_size,
num_passes=args.num_passes,
share_semantic_generator=args.share_network_between_source_target,
share_embed=args.share_embed,
class_num=args.class_num,
num_workers=args.num_workers,
use_gpu=args.use_gpu)
import logging
import paddle
UNK = 0
logger = logging.getLogger("paddle")
logger.setLevel(logging.INFO)
def mode_attr_name(mode):
return mode.upper() + '_MODE'
def create_attrs(cls):
for id, mode in enumerate(cls.modes):
setattr(cls, mode_attr_name(mode), id)
def make_check_method(cls):
'''
create methods for classes.
'''
def method(mode):
def _method(self):
return self.mode == getattr(cls, mode_attr_name(mode))
return _method
for id, mode in enumerate(cls.modes):
setattr(cls, 'is_' + mode, method(mode))
def make_create_method(cls):
def method(mode):
@staticmethod
def _method():
key = getattr(cls, mode_attr_name(mode))
return cls(key)
return _method
for id, mode in enumerate(cls.modes):
setattr(cls, 'create_' + mode, method(mode))
def make_str_method(cls, type_name='unk'):
def _str_(self):
for mode in cls.modes:
if self.mode == getattr(cls, mode_attr_name(mode)):
return mode
def _hash_(self):
return self.mode
setattr(cls, '__str__', _str_)
setattr(cls, '__repr__', _str_)
setattr(cls, '__hash__', _hash_)
cls.__name__ = type_name
def _init_(self, mode, cls):
if isinstance(mode, int):
self.mode = mode
elif isinstance(mode, cls):
self.mode = mode.mode
else:
raise Exception("wrong mode type, get type: %s, value: %s" %
(type(mode), mode))
def build_mode_class(cls):
create_attrs(cls)
make_str_method(cls)
make_check_method(cls)
make_create_method(cls)
class TaskType(object):
modes = 'train test infer'.split()
def __init__(self, mode):
_init_(self, mode, TaskType)
class ModelType:
modes = 'classification rank regression'.split()
def __init__(self, mode):
_init_(self, mode, ModelType)
class ModelArch:
modes = 'fc cnn rnn'.split()
def __init__(self, mode):
_init_(self, mode, ModelArch)
build_mode_class(TaskType)
build_mode_class(ModelType)
build_mode_class(ModelArch)
def sent2ids(sent, vocab):
'''
transform a sentence to a list of ids.
@sent: str
a sentence.
@vocab: dict
a word dic
'''
return [vocab.get(w, UNK) for w in sent.split()]
def load_dic(path):
'''
word dic format:
each line is a word
'''
dic = {}
with open(path) as f:
for id, line in enumerate(f):
w = line.strip()
dic[w] = id
return dic
def display_args(args):
logger.info("arguments passed by command line:")
for k, v in sorted(v for v in vars(args).items()):
logger.info("{}:\t{}".format(k, v))
if __name__ == '__main__':
t = TaskType(1)
t = TaskType.create_train()
print t
print 'is', t.is_train()
...@@ -26,7 +26,7 @@ num_passes = 20 # how many passes to train the model ...@@ -26,7 +26,7 @@ num_passes = 20 # how many passes to train the model
log_period = 50 log_period = 50
save_period_by_batches = 50 save_period_by_batches = 50
use_gpu = True # to use gpu or not use_gpu = False # to use gpu or not
trainer_count = 1 # number of trainer trainer_count = 1 # number of trainer
################## for model configuration ################## ################## for model configuration ##################
......
...@@ -35,8 +35,7 @@ def train(topology, ...@@ -35,8 +35,7 @@ def train(topology,
os.mkdir(model_save_dir) os.mkdir(model_save_dir)
# initialize PaddlePaddle # initialize PaddlePaddle
paddle.init( paddle.init(use_gpu=conf.use_gpu, trainer_count=conf.trainer_count)
use_gpu=conf.use_gpu, gpu_id=3, trainer_count=conf.trainer_count)
# create optimizer # create optimizer
adam_optimizer = paddle.optimizer.Adam( adam_optimizer = paddle.optimizer.Adam(
......
...@@ -4,14 +4,14 @@ ...@@ -4,14 +4,14 @@
RankNet模型在命令行输入: RankNet模型在命令行输入:
```python ```bash
python ranknet.py bash ./run_ranknet.sh
``` ```
LambdaRank模型在命令行输入: LambdaRank模型在命令行输入:
```python ```bash
python lambda_rank.py bash ./run_lambdarank.sh
``` ```
用户只需要使用以上命令就完成排序模型的训练和预测,程序会自动下载内置数据集,无需手动下载。 用户只需要使用以上命令就完成排序模型的训练和预测,程序会自动下载内置数据集,无需手动下载。
...@@ -54,7 +54,7 @@ python lambda_rank.py ...@@ -54,7 +54,7 @@ python lambda_rank.py
例如调用接口 例如调用接口
```python ```bash
pairwise_train_dataset = functools.partial(paddle.dataset.mq2007.train, format="pairwise") pairwise_train_dataset = functools.partial(paddle.dataset.mq2007.train, format="pairwise")
for label, left_doc, right_doc in pairwise_train_dataset(): for label, left_doc, right_doc in pairwise_train_dataset():
... ...
...@@ -104,7 +104,7 @@ $$\lambda _{i,j}=\frac{\partial C}{\partial s_{i}} = \frac{1}{2}(1-S_{i,j})-\fra ...@@ -104,7 +104,7 @@ $$\lambda _{i,j}=\frac{\partial C}{\partial s_{i}} = \frac{1}{2}(1-S_{i,j})-\fra
由于Pairwise中的网络结构是左右对称,可定义一半网络结构,另一半共享网络参数。在PaddlePaddle中允许网络结构中共享连接,具有相同名字的参数将会共享参数。使用PaddlePaddle实现RankNet排序模型,定义网络结构的示例代码如下: 由于Pairwise中的网络结构是左右对称,可定义一半网络结构,另一半共享网络参数。在PaddlePaddle中允许网络结构中共享连接,具有相同名字的参数将会共享参数。使用PaddlePaddle实现RankNet排序模型,定义网络结构的示例代码如下:
```python ```bash
import paddle.v2 as paddle import paddle.v2 as paddle
def half_ranknet(name_prefix, input_dim): def half_ranknet(name_prefix, input_dim):
...@@ -150,7 +150,7 @@ def ranknet(input_dim): ...@@ -150,7 +150,7 @@ def ranknet(input_dim):
RankNet的训练只需要运行命令: RankNet的训练只需要运行命令:
```python ```python
python ranknet.py run ./run_ranknet.sh
``` ```
将会自动下载数据,训练RankNet模型,并将每个轮次的模型参数存储下来。 将会自动下载数据,训练RankNet模型,并将每个轮次的模型参数存储下来。
...@@ -277,7 +277,7 @@ def lambda_rank(input_dim): ...@@ -277,7 +277,7 @@ def lambda_rank(input_dim):
训练LambdaRank模型只需要运行命令: 训练LambdaRank模型只需要运行命令:
```python ```python
python lambda_rank.py bash ./run_lambdarank.sh
``` ```
脚本会自动下载数据,训练LambdaRank模型,并将每个轮次的模型存储下来。 脚本会自动下载数据,训练LambdaRank模型,并将每个轮次的模型存储下来。
...@@ -303,7 +303,7 @@ query_id : 2, relevance_score:2, feature_vector 0:0.1, 1:0.4, 2:0.1 #doc1 ...@@ -303,7 +303,7 @@ query_id : 2, relevance_score:2, feature_vector 0:0.1, 1:0.4, 2:0.1 #doc1
需要转换为Listwise格式,例如 需要转换为Listwise格式,例如
<query_id><relevance_score> <feature_vector> `<query_id><relevance_score> <feature_vector>`
```tex ```tex
1 1 0.1,0.2,0.4 1 1 0.1,0.2,0.4
......
...@@ -46,14 +46,14 @@ ...@@ -46,14 +46,14 @@
RankNet模型在命令行输入: RankNet模型在命令行输入:
```python ```bash
python ranknet.py bash ./run_ranknet.sh
``` ```
LambdaRank模型在命令行输入: LambdaRank模型在命令行输入:
```python ```bash
python lambda_rank.py bash ./run_lambdarank.sh
``` ```
用户只需要使用以上命令就完成排序模型的训练和预测,程序会自动下载内置数据集,无需手动下载。 用户只需要使用以上命令就完成排序模型的训练和预测,程序会自动下载内置数据集,无需手动下载。
...@@ -96,7 +96,7 @@ python lambda_rank.py ...@@ -96,7 +96,7 @@ python lambda_rank.py
例如调用接口 例如调用接口
```python ```bash
pairwise_train_dataset = functools.partial(paddle.dataset.mq2007.train, format="pairwise") pairwise_train_dataset = functools.partial(paddle.dataset.mq2007.train, format="pairwise")
for label, left_doc, right_doc in pairwise_train_dataset(): for label, left_doc, right_doc in pairwise_train_dataset():
... ...
...@@ -146,7 +146,7 @@ $$\lambda _{i,j}=\frac{\partial C}{\partial s_{i}} = \frac{1}{2}(1-S_{i,j})-\fra ...@@ -146,7 +146,7 @@ $$\lambda _{i,j}=\frac{\partial C}{\partial s_{i}} = \frac{1}{2}(1-S_{i,j})-\fra
由于Pairwise中的网络结构是左右对称,可定义一半网络结构,另一半共享网络参数。在PaddlePaddle中允许网络结构中共享连接,具有相同名字的参数将会共享参数。使用PaddlePaddle实现RankNet排序模型,定义网络结构的示例代码如下: 由于Pairwise中的网络结构是左右对称,可定义一半网络结构,另一半共享网络参数。在PaddlePaddle中允许网络结构中共享连接,具有相同名字的参数将会共享参数。使用PaddlePaddle实现RankNet排序模型,定义网络结构的示例代码如下:
```python ```bash
import paddle.v2 as paddle import paddle.v2 as paddle
def half_ranknet(name_prefix, input_dim): def half_ranknet(name_prefix, input_dim):
...@@ -192,7 +192,7 @@ def ranknet(input_dim): ...@@ -192,7 +192,7 @@ def ranknet(input_dim):
RankNet的训练只需要运行命令: RankNet的训练只需要运行命令:
```python ```python
python ranknet.py run ./run_ranknet.sh
``` ```
将会自动下载数据,训练RankNet模型,并将每个轮次的模型参数存储下来。 将会自动下载数据,训练RankNet模型,并将每个轮次的模型参数存储下来。
...@@ -319,7 +319,7 @@ def lambda_rank(input_dim): ...@@ -319,7 +319,7 @@ def lambda_rank(input_dim):
训练LambdaRank模型只需要运行命令: 训练LambdaRank模型只需要运行命令:
```python ```python
python lambda_rank.py bash ./run_lambdarank.sh
``` ```
脚本会自动下载数据,训练LambdaRank模型,并将每个轮次的模型存储下来。 脚本会自动下载数据,训练LambdaRank模型,并将每个轮次的模型存储下来。
...@@ -345,7 +345,7 @@ query_id : 2, relevance_score:2, feature_vector 0:0.1, 1:0.4, 2:0.1 #doc1 ...@@ -345,7 +345,7 @@ query_id : 2, relevance_score:2, feature_vector 0:0.1, 1:0.4, 2:0.1 #doc1
需要转换为Listwise格式,例如 需要转换为Listwise格式,例如
<query_id><relevance_score> <feature_vector> `<query_id><relevance_score> <feature_vector>`
```tex ```tex
1 1 0.1,0.2,0.4 1 1 0.1,0.2,0.4
......
...@@ -3,6 +3,7 @@ import gzip ...@@ -3,6 +3,7 @@ import gzip
import paddle.v2 as paddle import paddle.v2 as paddle
import numpy as np import numpy as np
import functools import functools
import argparse
def lambda_rank(input_dim): def lambda_rank(input_dim):
...@@ -117,6 +118,15 @@ def lambda_rank_infer(pass_id): ...@@ -117,6 +118,15 @@ def lambda_rank_infer(pass_id):
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser(description='LambdaRank demo')
parser.add_argument("--run_type", type=str, help="run type is train|infer")
parser.add_argument(
"--num_passes",
type=int,
help="num of passes in train| infer pass number of model")
args = parser.parse_args()
paddle.init(use_gpu=False, trainer_count=1) paddle.init(use_gpu=False, trainer_count=1)
train_lambda_rank(2) if args.run_type == "train":
lambda_rank_infer(pass_id=1) train_lambda_rank(args.num_passes)
elif args.run_type == "infer":
lambda_rank_infer(pass_id=args.num_passes - 1)
...@@ -19,7 +19,7 @@ def ndcg(score_list): ...@@ -19,7 +19,7 @@ def ndcg(score_list):
n = len(score_list) n = len(score_list)
cost = .0 cost = .0
for i in range(n): for i in range(n):
cost += float(score_list[i]) / np.log((i + 1) + 1) cost += float(np.power(2, score_list[i])) / np.log((i + 1) + 1)
return cost return cost
dcg_cost = dcg(score_list) dcg_cost = dcg(score_list)
...@@ -28,14 +28,11 @@ def ndcg(score_list): ...@@ -28,14 +28,11 @@ def ndcg(score_list):
return dcg_cost / ideal_cost return dcg_cost / ideal_cost
class NdcgTest(unittest.TestCase): class TestNDCG(unittest.TestCase):
def __init__(self): def test_array(self):
pass
def runcase(self):
a = [3, 2, 3, 0, 1, 2] a = [3, 2, 3, 0, 1, 2]
value = ndcg(a) value = ndcg(a)
self.assertAlmostEqual(0.961, value, places=3) self.assertAlmostEqual(0.9583, value, places=3)
if __name__ == '__main__': if __name__ == '__main__':
......
...@@ -5,6 +5,7 @@ import functools ...@@ -5,6 +5,7 @@ import functools
import paddle.v2 as paddle import paddle.v2 as paddle
import numpy as np import numpy as np
from metrics import ndcg from metrics import ndcg
import argparse
# ranknet is the classic pairwise learning to rank algorithm # ranknet is the classic pairwise learning to rank algorithm
# http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf # http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf
...@@ -102,9 +103,9 @@ def ranknet_infer(pass_id): ...@@ -102,9 +103,9 @@ def ranknet_infer(pass_id):
feature_dim = 46 feature_dim = 46
# we just need half_ranknet to predict a rank score, which can be used in sort documents # we just need half_ranknet to predict a rank score, which can be used in sort documents
output = half_ranknet("left", feature_dim) output = half_ranknet("infer", feature_dim)
parameters = paddle.parameters.Parameters.from_tar( parameters = paddle.parameters.Parameters.from_tar(
gzip.open("ranknet_params_%d.tar.gz" % (pass_id - 1))) gzip.open("ranknet_params_%d.tar.gz" % (pass_id)))
# load data of same query and relevance documents, need ranknet to rank these candidates # load data of same query and relevance documents, need ranknet to rank these candidates
infer_query_id = [] infer_query_id = []
...@@ -118,18 +119,27 @@ def ranknet_infer(pass_id): ...@@ -118,18 +119,27 @@ def ranknet_infer(pass_id):
for query_id, relevance_score, feature_vector in plain_txt_test(): for query_id, relevance_score, feature_vector in plain_txt_test():
infer_query_id.append(query_id) infer_query_id.append(query_id)
infer_data.append(feature_vector) infer_data.append([feature_vector])
# predict score of infer_data document. Re-sort the document base on predict score # predict score of infer_data document. Re-sort the document base on predict score
# in descending order. then we build the ranking documents # in descending order. then we build the ranking documents
scores = paddle.infer( scores = paddle.infer(
output_layer=output, parameters=parameters, input=infer_data) output_layer=output, parameters=parameters, input=infer_data)
print scores
for query_id, score in zip(infer_query_id, scores): for query_id, score in zip(infer_query_id, scores):
print "query_id : ", query_id, " ranknet rank document order : ", score print "query_id : ", query_id, " ranknet rank document order : ", score
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Ranknet demo')
parser.add_argument("--run_type", type=str, help="run type is train|infer")
parser.add_argument(
"--num_passes",
type=int,
help="num of passes in train| infer pass number of model")
args = parser.parse_args()
paddle.init(use_gpu=False, trainer_count=4) paddle.init(use_gpu=False, trainer_count=4)
pass_num = 2 if args.run_type == "train":
train_ranknet(pass_num) train_ranknet(args.num_passes)
ranknet_infer(pass_id=pass_num - 1) elif args.run_type == "infer":
ranknet_infer(pass_id=args.pass_num - 1)
#!/bin/sh
python lambda_rank.py \
--run_type="train" \
--num_passes=10 \
2>&1 | tee lambdarank_train.log
python lambda_rank.py \
--run_type="infer" \
--num_passes=10 \
2>&1 | tee lambdarank_infer.log
#!/bin/sh
python ranknet.py \
--run_type="train" \
--num_passes=10 \
2>&1 | tee rankenet_train.log
python ranknet.py \
--run_type="infer" \
--num_passes=10 \
2>&1 | tee ranknet_infer.log
...@@ -43,7 +43,10 @@ def train(model_save_dir): ...@@ -43,7 +43,10 @@ def train(model_save_dir):
parameters.to_tar(f) parameters.to_tar(f)
trainer.train( trainer.train(
paddle.batch(paddle.dataset.imikolov.train(word_dict, 5), 64), paddle.batch(
paddle.reader.shuffle(
lambda: paddle.dataset.imikolov.train(word_dict, 5)(),
buf_size=1000), 64),
num_passes=1000, num_passes=1000,
event_handler=event_handler) event_handler=event_handler)
......
...@@ -63,4 +63,4 @@ def train(save_dir_path, source_dict_dim, target_dict_dim): ...@@ -63,4 +63,4 @@ def train(save_dir_path, source_dict_dim, target_dict_dim):
if __name__ == '__main__': if __name__ == '__main__':
train(save_dir_path="models", source_dict_dim=3000, target_dict_dim=3000) train(save_dir_path="models", source_dict_dim=30000, target_dict_dim=30000)
...@@ -23,10 +23,10 @@ ...@@ -23,10 +23,10 @@
序列标注可以分为Sequence Classification、Segment Classification和Temporal Classification三类[[1](#参考文献)],本例只考虑Segment Classification,即对输入序列中的每个元素在输出序列中给出对应的标签。对于NER任务,由于需要标识边界,一般采用[BIO标注方法](http://book.paddlepaddle.org/07.label_semantic_roles/)定义的标签集,如下是一个NER的标注结果示例: 序列标注可以分为Sequence Classification、Segment Classification和Temporal Classification三类[[1](#参考文献)],本例只考虑Segment Classification,即对输入序列中的每个元素在输出序列中给出对应的标签。对于NER任务,由于需要标识边界,一般采用[BIO标注方法](http://book.paddlepaddle.org/07.label_semantic_roles/)定义的标签集,如下是一个NER的标注结果示例:
<div align="center"> <p align="center">
<img src="images/ner_label_ins.png" width = "80%" align=center /><br> <img src="images/ner_label_ins.png" width="80%" align="center"/><br/>
图1. BIO标注方法示例 图1. BIO标注方法示例
</div> </p>
根据序列标注结果可以直接得到实体边界和实体类别。类似的,分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务都可通过序列标注来解决。使用神经网络模型解决问题的思路通常是:前层网络学习输入的特征表示,网络的最后一层在特征基础上完成最终的任务;对于序列标注问题,通常:使用基于RNN的网络结构学习特征,将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是:CRF使用句子级别的似然概率,能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然,这里以NER任务作为示例,但所给出的模型可以应用到其他各种序列标注任务中。 根据序列标注结果可以直接得到实体边界和实体类别。类似的,分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务都可通过序列标注来解决。使用神经网络模型解决问题的思路通常是:前层网络学习输入的特征表示,网络的最后一层在特征基础上完成最终的任务;对于序列标注问题,通常:使用基于RNN的网络结构学习特征,将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是:CRF使用句子级别的似然概率,能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然,这里以NER任务作为示例,但所给出的模型可以应用到其他各种序列标注任务中。
...@@ -43,10 +43,10 @@ NER任务的输入是"一句话",目标是识别句子中的实体边界及类 ...@@ -43,10 +43,10 @@ NER任务的输入是"一句话",目标是识别句子中的实体边界及类
3. 将步骤2中的2个词向量序列作为双向RNN的输入,学习输入序列的特征表示,得到新的特性表示序列; 3. 将步骤2中的2个词向量序列作为双向RNN的输入,学习输入序列的特征表示,得到新的特性表示序列;
4. CRF以步骤3中模型学习到的特征为输入,以标记序列为监督信号,实现序列标注。 4. CRF以步骤3中模型学习到的特征为输入,以标记序列为监督信号,实现序列标注。
<div align="center"> <p align="center">
<img src="images/ner_network.png" width = "40%" align=center /><br> <img src="images/ner_network.png" width="40%" align="center"/><br/>
图2. NER 模型网络结构图 图2. NER 模型网络结构图
</div> </p>
## 数据说明 ## 数据说明
......
wget http://cs224d.stanford.edu/assignment2/assignment2.zip if [ -f assignment2.zip ]; then
echo "data exist"
else
wget http://cs224d.stanford.edu/assignment2/assignment2.zip
fi
if [ $? -eq 0 ];then if [ $? -eq 0 ];then
unzip assignment2.zip unzip assignment2.zip
......
...@@ -65,10 +65,10 @@ ...@@ -65,10 +65,10 @@
序列标注可以分为Sequence Classification、Segment Classification和Temporal Classification三类[[1](#参考文献)],本例只考虑Segment Classification,即对输入序列中的每个元素在输出序列中给出对应的标签。对于NER任务,由于需要标识边界,一般采用[BIO标注方法](http://book.paddlepaddle.org/07.label_semantic_roles/)定义的标签集,如下是一个NER的标注结果示例: 序列标注可以分为Sequence Classification、Segment Classification和Temporal Classification三类[[1](#参考文献)],本例只考虑Segment Classification,即对输入序列中的每个元素在输出序列中给出对应的标签。对于NER任务,由于需要标识边界,一般采用[BIO标注方法](http://book.paddlepaddle.org/07.label_semantic_roles/)定义的标签集,如下是一个NER的标注结果示例:
<div align="center"> <p align="center">
<img src="images/ner_label_ins.png" width = "80%" align=center /><br> <img src="images/ner_label_ins.png" width="80%" align="center"/><br/>
图1. BIO标注方法示例 图1. BIO标注方法示例
</div> </p>
根据序列标注结果可以直接得到实体边界和实体类别。类似的,分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务都可通过序列标注来解决。使用神经网络模型解决问题的思路通常是:前层网络学习输入的特征表示,网络的最后一层在特征基础上完成最终的任务;对于序列标注问题,通常:使用基于RNN的网络结构学习特征,将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是:CRF使用句子级别的似然概率,能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然,这里以NER任务作为示例,但所给出的模型可以应用到其他各种序列标注任务中。 根据序列标注结果可以直接得到实体边界和实体类别。类似的,分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务都可通过序列标注来解决。使用神经网络模型解决问题的思路通常是:前层网络学习输入的特征表示,网络的最后一层在特征基础上完成最终的任务;对于序列标注问题,通常:使用基于RNN的网络结构学习特征,将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是:CRF使用句子级别的似然概率,能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然,这里以NER任务作为示例,但所给出的模型可以应用到其他各种序列标注任务中。
...@@ -85,10 +85,10 @@ NER任务的输入是"一句话",目标是识别句子中的实体边界及类 ...@@ -85,10 +85,10 @@ NER任务的输入是"一句话",目标是识别句子中的实体边界及类
3. 将步骤2中的2个词向量序列作为双向RNN的输入,学习输入序列的特征表示,得到新的特性表示序列; 3. 将步骤2中的2个词向量序列作为双向RNN的输入,学习输入序列的特征表示,得到新的特性表示序列;
4. CRF以步骤3中模型学习到的特征为输入,以标记序列为监督信号,实现序列标注。 4. CRF以步骤3中模型学习到的特征为输入,以标记序列为监督信号,实现序列标注。
<div align="center"> <p align="center">
<img src="images/ner_network.png" width = "40%" align=center /><br> <img src="images/ner_network.png" width="40%" align="center"/><br/>
图2. NER 模型网络结构图 图2. NER 模型网络结构图
</div> </p>
## 数据说明 ## 数据说明
......
...@@ -21,7 +21,7 @@ def canonicalize_word(word, wordset=None, digits=True): ...@@ -21,7 +21,7 @@ def canonicalize_word(word, wordset=None, digits=True):
if (wordset != None) and (word in wordset): return word if (wordset != None) and (word in wordset): return word
word = canonicalize_digits(word) # try to canonicalize numbers word = canonicalize_digits(word) # try to canonicalize numbers
if (wordset == None) or (word in wordset): return word if (wordset == None) or (word in wordset): return word
else: return "<UNK>" # unknown token else: return "UUUNKKK" # unknown token
def data_reader(data_file, word_dict, label_dict): def data_reader(data_file, word_dict, label_dict):
...@@ -35,7 +35,7 @@ def data_reader(data_file, word_dict, label_dict): ...@@ -35,7 +35,7 @@ def data_reader(data_file, word_dict, label_dict):
""" """
def reader(): def reader():
UNK_IDX = word_dict["<UNK>"] UNK_IDX = word_dict["UUUNKKK"]
sentence = [] sentence = []
labels = [] labels = []
......
...@@ -106,4 +106,5 @@ if __name__ == "__main__": ...@@ -106,4 +106,5 @@ if __name__ == "__main__":
test_data_file="data/test", test_data_file="data/test",
vocab_file="data/vocab.txt", vocab_file="data/vocab.txt",
target_file="data/target.txt", target_file="data/target.txt",
emb_file="data/wordVectors.txt") emb_file="data/wordVectors.txt",
model_save_dir="model/")
# SSD目标检测
## 概述
SSD全称为Single Shot MultiBox Detector,是目标检测领域较新且效果较好的检测算法之一,具体参见论文\[[1](#引用)\]。SSD算法主要特点是检测速度快且检测精度高。PaddlePaddle已集成SSD算法,本示例旨在介绍如何使用PaddlePaddle中的SSD模型进行目标检测。下文展开顺序为:首先简要介绍SSD原理,然后介绍示例包含文件及作用,接着介绍如何在PASCAL VOC数据集上训练、评估及检测,最后简要介绍如何在自有数据集上使用SSD。
## SSD原理
SSD使用一个卷积神经网络实现“端到端”的检测,所谓“端到端”指输入为原始图像,输出为检测结果,无需借助外部工具或流程进行特征提取、候选框生成等。论文中SSD的基础模型为VGG16\[[2](#引用)\],不同于原始VGG16网络模型,SSD做了一些改变:
1. 将最后的fc6、fc7全连接层变为卷积层,卷积层参数通过对原始fc6、fc7参数采样得到。
2. 将pool5层的参数由2x2-s2(kernel大小为2x2,stride size为2)更改为3x3-s1-p1(kernel大小为3x3,stride size为1,padding size为1)。
3. 在conv4\_3、conv7、conv8\_2、conv9\_2、conv10\_2及pool11层后面接了priorbox层,priorbox层的主要目的是根据输入的特征图(feature map)生成一系列的矩形候选框。关于SSD的更详细的介绍可以参考论文\[[1](#引用)\]
下图为模型(300x300)的总体结构:
<p align="center">
<img src="images/ssd_network.png" width="900" height="250" hspace='10'/> <br/>
图1. SSD网络结构
</p>
图中每个矩形盒子代表一个卷积层,最后的两个矩形框分别表示汇总各卷积层输出结果和后处理阶段。具体地,在预测阶段网络会输出一组候选矩形框,每个矩形包含两类信息:位置和类别得分,图中倒数第二个矩形框即表示网络的检测结果的汇总处理,由于候选矩形框数量较多且很多矩形框重叠严重,这时需要经过后处理来筛选出质量较高的少数矩形框,这里的后处理主要指非极大值抑制(Non-maximum Suppression)。
从SSD的网络结构可以看出,候选矩形框在多个特征图(feature map上)生成,不同的feature map具有的感受野不同,这样可以在不同尺度扫描图像,相对于其他检测方法可以生成更丰富的候选框,从而提高检测精度;另一方面SSD对VGG16的扩展部分以较小的代价实现对候选框的位置和类别得分的计算,整个过程只需要一个卷积神经网络完成,所以速度较快。
## 示例总览
本示例共包含如下文件:
<table>
<caption>表1. 示例文件</caption>
<tr><th>文件</th><th>用途</th></tr>
<tr><td>train.py</td><td>训练脚本</td></tr>
<tr><td>eval.py</td><td>评估脚本,用于评估训好模型</td></tr>
<tr><td>infer.py</td><td>检测脚本,给定图片及模型,实施检测</td></tr>
<tr><td>visual.py</td><td>检测结果可视化</td></tr>
<tr><td>image_util.py</td><td>图像预处理所需公共函数</td></tr>
<tr><td>data_provider.py</td><td>数据处理脚本,生成训练、评估或检测所需数据</td></tr>
<tr><td>config/pascal_voc_conf.py</td><td>神经网络超参数配置文件</td></tr>
<tr><td>data/label_list</td><td>类别列表</td></tr>
<tr><td>data/prepare_voc_data.py</td><td>准备训练PASCAL VOC数据列表</td></tr>
</table>
训练阶段需要对数据做预处理,包括裁剪、采样等,这部分操作在```image_util.py``````data_provider.py```中完成。值得注意的是,```config/vgg_config.py```为参数配置文件,包括训练参数、神经网络参数等,本配置文件包含参数是针对PASCAL VOC数据配置的,当训练自有数据时,需要仿照该文件配置新的参数。```data/prepare_voc_data.py```脚本用来生成文件列表,包括切分训练集和测试集,使用时需要用户事先下载并解压数据,默认采用VOC2007和VOC2012。
## PASCAL VOC数据集
### 数据准备
首先需要下载数据集,VOC2007\[[3](#引用)\]和VOC2012\[[4](#引用)\],VOC2007包含训练集和测试集,VOC2012只包含训练集,将下载好的数据解压,目录结构为```data/VOCdevkit/VOC2007``````data/VOCdevkit/VOC2012```。进入```data```目录,运行```python prepare_voc_data.py```即可生成```trainval.txt``````test.txt```。核心函数为:
```python
def prepare_filelist(devkit_dir, years, output_dir):
trainval_list = []
test_list = []
for year in years:
trainval, test = walk_dir(devkit_dir, year)
trainval_list.extend(trainval)
test_list.extend(test)
random.shuffle(trainval_list)
with open(osp.join(output_dir, 'trainval.txt'), 'w') as ftrainval:
for item in trainval_list:
ftrainval.write(item[0] + ' ' + item[1] + '\n')
with open(osp.join(output_dir, 'test.txt'), 'w') as ftest:
for item in test_list:
ftest.write(item[0] + ' ' + item[1] + '\n')
```
该函数首先对每一年(year)的数据进行处理,然后将训练图像的文件路径列表进行随机打乱,最后保存训练文件列表和测试文件列表。默认```prepare_voc_data.py``````VOCdevkit```在相同目录下,且生成的文件列表也在该目录。需注意```trainval.txt```既包含VOC2007的训练数据,也包含VOC2012的训练数据,```test.txt```只包含VOC2007的测试数据。我们这里提供```trainval.txt```前几行输入作为样例:
```
VOCdevkit/VOC2007/JPEGImages/000005.jpg VOCdevkit/VOC2007/Annotations/000005.xml
VOCdevkit/VOC2007/JPEGImages/000007.jpg VOCdevkit/VOC2007/Annotations/000007.xml
VOCdevkit/VOC2007/JPEGImages/000009.jpg VOCdevkit/VOC2007/Annotations/000009.xml
```
文件共两个字段,第一个字段为图像文件的相对路径,第二个字段为对应标注文件的相对路径。
### 预训练模型准备
下载预训练的VGG-16模型,我们提供了一个转换好的模型,具体下载地址为:http://paddlepaddle.bj.bcebos.com/model_zoo/detection/ssd_model/vgg_model.tar.gz ,下载好模型后,放置路径为```vgg/vgg_model.tar.gz```
### 模型训练
直接执行```python train.py```即可进行训练。需要注意本示例仅支持CUDA GPU环境,无法在CPU上训练,主要因为使用CPU训练速度很慢,实践中一般使用GPU来处理图像任务,这里实现采用硬编码方式使用cuDNN,不提供CPU版本。```train.py```的一些关键执行逻辑:
```python
paddle.init(use_gpu=True, trainer_count=4)
data_args = data_provider.Settings(
data_dir='./data',
label_file='label_list',
resize_h=cfg.IMG_HEIGHT,
resize_w=cfg.IMG_WIDTH,
mean_value=[104,117,124])
train(train_file_list='./data/trainval.txt',
dev_file_list='./data/test.txt',
data_args=data_args,
init_model_path='./vgg/vgg_model.tar.gz')
```
主要包括:
1. 调用```paddle.init```指定使用4卡GPU训练。
2. 调用```data_provider.Settings```配置数据预处理所需参数,其中```cfg.IMG_HEIGHT``````cfg.IMG_WIDTH```在配置文件```config/vgg_config.py```中设置,这里均为300,300x300是一个典型配置,兼顾效率和检测精度,也可以通过修改配置文件扩展到512x512。
3. 调用```train```执行训练,其中```train_file_list```指定训练数据列表,```dev_file_list```指定评估数据列表,```init_model_path```指定预训练模型位置。
4. 训练过程中会打印一些日志信息,每训练1个batch会输出当前的轮数、当前batch的cost及mAP(mean Average Precision,平均精度均值),每训练一个pass,会保存一次模型,默认保存在```checkpoints```目录下(注:需事先创建)。
下面给出SDD300x300在VOC数据集(train包括07+12,test为07)上的mAP曲线,迭代140轮mAP可达到71.52%。
<p align="center">
<img src="images/SSD300x300_map.png" hspace='10'/> <br/>
图2. SSD300x300 mAP收敛曲线
</p>
### 模型评估
执行```python eval.py```即可对模型进行评估,```eval.py```的关键执行逻辑如下:
```python
paddle.init(use_gpu=True, trainer_count=4) # use 4 gpus
data_args = data_provider.Settings(
data_dir='./data',
label_file='label_list',
resize_h=cfg.IMG_HEIGHT,
resize_w=cfg.IMG_WIDTH,
mean_value=[104, 117, 124])
eval(
eval_file_list='./data/test.txt',
batch_size=4,
data_args=data_args,
model_path='models/pass-00000.tar.gz')
```
调用```paddle.init```指定使用4卡GPU评估;```data_provider.Settings```参见训练阶段的配置;调用```eval```执行评估,其中```eval_file_list```指定评估数据列表,```batch_size```指定评估时batch size的大小,```model_path ```指定模型位置。评估结束会输出```loss```信息和```mAP```信息。
### 图像检测
执行```python infer.py```即可使用训练好的模型对图片实施检测,```infer.py```关键逻辑如下:
```python
infer(
eval_file_list='./data/infer.txt',
save_path='infer.res',
data_args=data_args,
batch_size=4,
model_path='models/pass-00000.tar.gz',
threshold=0.3)
```
其中```eval_file_list```指定图像路径列表;```save_path```指定预测结果保存路径;```data_args```如上;```batch_size```为每多少样本预测一次;```model_path```指模型的位置;```threshold```为置信度阈值,只有得分大于或等于该值的才会输出。下面给出```infer.res```的一些输出样例:
```
VOCdevkit/VOC2007/JPEGImages/006936.jpg 12 0.997844 131.255611777 162.271582842 396.475315094 334.0
VOCdevkit/VOC2007/JPEGImages/006936.jpg 14 0.998557 229.160234332 49.5991278887 314.098775387 312.913876176
VOCdevkit/VOC2007/JPEGImages/006936.jpg 14 0.372522 187.543615699 133.727034628 345.647156239 327.448492289
...
```
一共包含4个字段,以tab分割,第一个字段是检测图像路径,第二字段为检测矩形框内类别,第三个字段是置信度,第四个字段是4个坐标值(以空格分割)。
示例还提供了一个可视化脚本,直接运行```python visual.py```即可,须指定输出检测结果路径及输出目录,默认可视化后图像保存在```./visual_res```,下面是用训练好的模型infer部分图像并可视化的效果:
<p align="center">
<img src="images/vis_1.jpg" height=150 width=200 hspace='10'/>
<img src="images/vis_2.jpg" height=150 width=200 hspace='10'/>
<img src="images/vis_3.jpg" height=150 width=100 hspace='10'/>
<img src="images/vis_4.jpg" height=150 width=200 hspace='10'/> <br />
图3. SSD300x300 检测可视化示例
</p>
## 自有数据集
在自有数据上训练PaddlePaddle SSD需要完成两个关键准备,首先需要适配网络可以接受的输入格式,这里提供一个推荐的结构,以```train.txt```为例
```
image00000_file_path image00000_annotation_file_path
image00001_file_path image00001_annotation_file_path
image00002_file_path image00002_annotation_file_path
...
```
文件共两列,以空白符分割,第一列为图像文件的路径,第二列为对应标注数据的文件路径。对图像文件的读取比较直接,略微复杂的是对标注数据的解析,本示例中标注数据使用xml文件存储,所以需要在```data_provider.py```中对xml解析,核心逻辑如下:
```python
bbox_labels = []
root = xml.etree.ElementTree.parse(label_path).getroot()
for object in root.findall('object'):
bbox_sample = []
# start from 1
bbox_sample.append(float(settings.label_list.index(
object.find('name').text)))
bbox = object.find('bndbox')
difficult = float(object.find('difficult').text)
bbox_sample.append(float(bbox.find('xmin').text)/img_width)
bbox_sample.append(float(bbox.find('ymin').text)/img_height)
bbox_sample.append(float(bbox.find('xmax').text)/img_width)
bbox_sample.append(float(bbox.find('ymax').text)/img_height)
bbox_sample.append(difficult)
bbox_labels.append(bbox_sample)
```
这里一条标注数据包括:label、xmin、ymin、xmax、ymax和is\_difficult,is\_difficult表示该object是否为难例,实际中如果不需要,只需把该字段置零即可。自有数据也需要提供对应的解析逻辑,假设标注数据(比如image00000\_annotation\_file\_path)存储格式如下:
```
label1 xmin1 ymin1 xmax1 ymax1
label2 xmin2 ymin2 xmax2 ymax2
...
```
每行对应一个物体,共5个字段,第一个为label(注背景为0,需从1编号),剩余4个为坐标,对应的解析逻辑可更改为如下:
```
bbox_labels = []
with open(label_path) as flabel:
for line in flabel:
bbox_sample = []
bbox = [float(i) for i in line.strip().split()]
label = bbox[0]
bbox_sample.append(label)
bbox_sample.append(bbox[1]/float(img_width))
bbox_sample.append(bbox[2]/float(img_height))
bbox_sample.append(bbox[3]/float(img_width))
bbox_sample.append(bbox[4]/float(img_height))
bbox_sample.append(0.0)
bbox_labels.append(bbox_sample)
```
另一个重要的事情就是根据图像大小及检测物体的大小等更改网络结构的配置,主要是仿照```config/vgg_config.py```创建自己的配置文件,参数设置经验请参照论文\[[1](#引用)\]
## 引用
1. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. [SSD: Single shot multibox detector](https://arxiv.org/abs/1512.02325). European conference on computer vision. Springer, Cham, 2016.
2. Simonyan, Karen, and Andrew Zisserman. [Very deep convolutional networks for large-scale image recognition](https://arxiv.org/abs/1409.1556). arXiv preprint arXiv:1409.1556 (2014).
3. [The PASCAL Visual Object Classes Challenge 2007](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html)
4. [Visual Object Classes Challenge 2012 (VOC2012)](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html)
from easydict import EasyDict as edict
import numpy as np
__C = edict()
cfg = __C
__C.TRAIN = edict()
__C.IMG_WIDTH = 300
__C.IMG_HEIGHT = 300
__C.IMG_CHANNEL = 3
__C.CLASS_NUM = 21
__C.BACKGROUND_ID = 0
# training settings
__C.TRAIN.LEARNING_RATE = 0.001 / 4
__C.TRAIN.MOMENTUM = 0.9
__C.TRAIN.BATCH_SIZE = 32
__C.TRAIN.NUM_PASS = 200
__C.TRAIN.L2REGULARIZATION = 0.0005 * 4
__C.TRAIN.LEARNING_RATE_DECAY_A = 0.1
__C.TRAIN.LEARNING_RATE_DECAY_B = 16551 * 80
__C.TRAIN.LEARNING_RATE_SCHEDULE = 'discexp'
__C.NET = edict()
# configuration for multibox_loss_layer
__C.NET.MBLOSS = edict()
__C.NET.MBLOSS.OVERLAP_THRESHOLD = 0.5
__C.NET.MBLOSS.NEG_POS_RATIO = 3.0
__C.NET.MBLOSS.NEG_OVERLAP = 0.5
# configuration for detection_map
__C.NET.DETMAP = edict()
__C.NET.DETMAP.OVERLAP_THRESHOLD = 0.5
__C.NET.DETMAP.EVAL_DIFFICULT = False
__C.NET.DETMAP.AP_TYPE = "11point"
# configuration for detection_output_layer
__C.NET.DETOUT = edict()
__C.NET.DETOUT.CONFIDENCE_THRESHOLD = 0.01
__C.NET.DETOUT.NMS_THRESHOLD = 0.45
__C.NET.DETOUT.NMS_TOP_K = 400
__C.NET.DETOUT.KEEP_TOP_K = 200
# configuration for priorbox_layer from conv4_3
__C.NET.CONV4 = edict()
__C.NET.CONV4.PB = edict()
__C.NET.CONV4.PB.MIN_SIZE = [30]
__C.NET.CONV4.PB.ASPECT_RATIO = [2.]
__C.NET.CONV4.PB.VARIANCE = [0.1, 0.1, 0.2, 0.2]
# configuration for priorbox_layer from fc7
__C.NET.FC7 = edict()
__C.NET.FC7.PB = edict()
__C.NET.FC7.PB.MIN_SIZE = [60]
__C.NET.FC7.PB.MAX_SIZE = [114]
__C.NET.FC7.PB.ASPECT_RATIO = [2., 3.]
__C.NET.FC7.PB.VARIANCE = [0.1, 0.1, 0.2, 0.2]
# configuration for priorbox_layer from conv6_2
__C.NET.CONV6 = edict()
__C.NET.CONV6.PB = edict()
__C.NET.CONV6.PB.MIN_SIZE = [114]
__C.NET.CONV6.PB.MAX_SIZE = [168]
__C.NET.CONV6.PB.ASPECT_RATIO = [2., 3.]
__C.NET.CONV6.PB.VARIANCE = [0.1, 0.1, 0.2, 0.2]
# configuration for priorbox_layer from conv7_2
__C.NET.CONV7 = edict()
__C.NET.CONV7.PB = edict()
__C.NET.CONV7.PB.MIN_SIZE = [168]
__C.NET.CONV7.PB.MAX_SIZE = [222]
__C.NET.CONV7.PB.ASPECT_RATIO = [2., 3.]
__C.NET.CONV7.PB.VARIANCE = [0.1, 0.1, 0.2, 0.2]
# configuration for priorbox_layer from conv8_2
__C.NET.CONV8 = edict()
__C.NET.CONV8.PB = edict()
__C.NET.CONV8.PB.MIN_SIZE = [222]
__C.NET.CONV8.PB.MAX_SIZE = [276]
__C.NET.CONV8.PB.ASPECT_RATIO = [2., 3.]
__C.NET.CONV8.PB.VARIANCE = [0.1, 0.1, 0.2, 0.2]
# configuration for priorbox_layer from pool6
__C.NET.POOL6 = edict()
__C.NET.POOL6.PB = edict()
__C.NET.POOL6.PB.MIN_SIZE = [276]
__C.NET.POOL6.PB.MAX_SIZE = [330]
__C.NET.POOL6.PB.ASPECT_RATIO = [2., 3.]
__C.NET.POOL6.PB.VARIANCE = [0.1, 0.1, 0.2, 0.2]
background
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
import os
import os.path as osp
import re
import random
devkit_dir = './VOCdevkit'
years = ['2007', '2012']
def get_dir(devkit_dir, year, type):
return osp.join(devkit_dir, 'VOC' + year, type)
def walk_dir(devkit_dir, year):
filelist_dir = get_dir(devkit_dir, year, 'ImageSets/Main')
annotation_dir = get_dir(devkit_dir, year, 'Annotations')
img_dir = get_dir(devkit_dir, year, 'JPEGImages')
trainval_list = []
test_list = []
added = set()
for _, _, files in os.walk(filelist_dir):
for fname in files:
img_ann_list = []
if re.match('[a-z]+_trainval\.txt', fname):
img_ann_list = trainval_list
elif re.match('[a-z]+_test\.txt', fname):
img_ann_list = test_list
else:
continue
fpath = osp.join(filelist_dir, fname)
for line in open(fpath):
name_prefix = line.strip().split()[0]
if name_prefix in added:
continue
added.add(name_prefix)
ann_path = osp.join(annotation_dir, name_prefix + '.xml')
img_path = osp.join(img_dir, name_prefix + '.jpg')
assert os.path.isfile(ann_path), 'file %s not found.' % ann_path
assert os.path.isfile(img_path), 'file %s not found.' % img_path
img_ann_list.append((img_path, ann_path))
return trainval_list, test_list
def prepare_filelist(devkit_dir, years, output_dir):
trainval_list = []
test_list = []
for year in years:
trainval, test = walk_dir(devkit_dir, year)
trainval_list.extend(trainval)
test_list.extend(test)
random.shuffle(trainval_list)
with open(osp.join(output_dir, 'trainval.txt'), 'w') as ftrainval:
for item in trainval_list:
ftrainval.write(item[0] + ' ' + item[1] + '\n')
with open(osp.join(output_dir, 'test.txt'), 'w') as ftest:
for item in test_list:
ftest.write(item[0] + ' ' + item[1] + '\n')
prepare_filelist(devkit_dir, years, '.')
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import image_util
from paddle.utils.image_util import *
import random
from PIL import Image
import numpy as np
import xml.etree.ElementTree
import os
class Settings(object):
def __init__(self, data_dir, label_file, resize_h, resize_w, mean_value):
self._data_dir = data_dir
self._label_list = []
label_fpath = os.path.join(data_dir, label_file)
for line in open(label_fpath):
self._label_list.append(line.strip())
self._resize_height = resize_h
self._resize_width = resize_w
self._img_mean = np.array(mean_value)[:, np.newaxis, np.newaxis].astype(
'float32')
@property
def data_dir(self):
return self._data_dir
@property
def label_list(self):
return self._label_list
@property
def resize_h(self):
return self._resize_height
@property
def resize_w(self):
return self._resize_width
@property
def img_mean(self):
return self._img_mean
def _reader_creator(settings, file_list, mode, shuffle):
def reader():
with open(file_list) as flist:
lines = [line.strip() for line in flist]
if shuffle:
random.shuffle(lines)
for line in lines:
if mode == 'train' or mode == 'test':
img_path, label_path = line.split()
img_path = os.path.join(settings.data_dir, img_path)
label_path = os.path.join(settings.data_dir, label_path)
elif mode == 'infer':
img_path = os.path.join(settings.data_dir, line)
img = Image.open(img_path)
img_width, img_height = img.size
img = np.array(img)
# layout: label | xmin | ymin | xmax | ymax | difficult
if mode == 'train' or mode == 'test':
bbox_labels = []
root = xml.etree.ElementTree.parse(label_path).getroot()
for object in root.findall('object'):
bbox_sample = []
# start from 1
bbox_sample.append(
float(
settings.label_list.index(
object.find('name').text)))
bbox = object.find('bndbox')
difficult = float(object.find('difficult').text)
bbox_sample.append(
float(bbox.find('xmin').text) / img_width)
bbox_sample.append(
float(bbox.find('ymin').text) / img_height)
bbox_sample.append(
float(bbox.find('xmax').text) / img_width)
bbox_sample.append(
float(bbox.find('ymax').text) / img_height)
bbox_sample.append(difficult)
bbox_labels.append(bbox_sample)
sample_labels = bbox_labels
if mode == 'train':
batch_sampler = []
# hard-code here
batch_sampler.append(
image_util.sampler(1, 1, 1.0, 1.0, 1.0, 1.0, 0.0,
0.0))
batch_sampler.append(
image_util.sampler(1, 50, 0.3, 1.0, 0.5, 2.0, 0.1,
0.0))
batch_sampler.append(
image_util.sampler(1, 50, 0.3, 1.0, 0.5, 2.0, 0.3,
0.0))
batch_sampler.append(
image_util.sampler(1, 50, 0.3, 1.0, 0.5, 2.0, 0.5,
0.0))
batch_sampler.append(
image_util.sampler(1, 50, 0.3, 1.0, 0.5, 2.0, 0.7,
0.0))
batch_sampler.append(
image_util.sampler(1, 50, 0.3, 1.0, 0.5, 2.0, 0.9,
0.0))
batch_sampler.append(
image_util.sampler(1, 50, 0.3, 1.0, 0.5, 2.0, 0.0,
1.0))
""" random crop """
sampled_bbox = image_util.generate_batch_samples(
batch_sampler, bbox_labels, img_width, img_height)
if len(sampled_bbox) > 0:
idx = int(random.uniform(0, len(sampled_bbox)))
img, sample_labels = image_util.crop_image(
img, bbox_labels, sampled_bbox[idx], img_width,
img_height)
img = Image.fromarray(img)
img = img.resize((settings.resize_w, settings.resize_h),
Image.ANTIALIAS)
img = np.array(img)
if mode == 'train':
mirror = int(random.uniform(0, 2))
if mirror == 1:
img = img[:, ::-1, :]
for i in xrange(len(sample_labels)):
tmp = sample_labels[i][1]
sample_labels[i][1] = 1 - sample_labels[i][3]
sample_labels[i][3] = 1 - tmp
if len(img.shape) == 3:
img = np.swapaxes(img, 1, 2)
img = np.swapaxes(img, 1, 0)
img = img.astype('float32')
img -= settings.img_mean
img = img.flatten()
if mode == 'train' or mode == 'test':
if mode == 'train' and len(sample_labels) == 0: continue
yield img.astype('float32'), sample_labels
elif mode == 'infer':
yield img.astype('float32')
return reader
def train(settings, file_list, shuffle=True):
return _reader_creator(settings, file_list, 'train', shuffle)
def test(settings, file_list):
return _reader_creator(settings, file_list, 'test', False)
def infer(settings, file_list):
return _reader_creator(settings, file_list, 'infer', False)
import paddle.v2 as paddle
import data_provider
import vgg_ssd_net
import os, sys
import gzip
from config.pascal_voc_conf import cfg
def eval(eval_file_list, batch_size, data_args, model_path):
cost, detect_out = vgg_ssd_net.net_conf(mode='eval')
assert os.path.isfile(model_path), 'Invalid model.'
parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_path))
optimizer = paddle.optimizer.Momentum()
trainer = paddle.trainer.SGD(
cost=cost,
parameters=parameters,
extra_layers=[detect_out],
update_equation=optimizer)
feeding = {'image': 0, 'bbox': 1}
reader = paddle.batch(
data_provider.test(data_args, eval_file_list), batch_size=batch_size)
result = trainer.test(reader=reader, feeding=feeding)
print "TestCost: %f, Detection mAP=%g" % \
(result.cost, result.metrics['detection_evaluator'])
if __name__ == "__main__":
paddle.init(use_gpu=True, trainer_count=4) # use 4 gpus
data_args = data_provider.Settings(
data_dir='./data',
label_file='label_list',
resize_h=cfg.IMG_HEIGHT,
resize_w=cfg.IMG_WIDTH,
mean_value=[104, 117, 124])
eval(
eval_file_list='./data/test.txt',
batch_size=4,
data_args=data_args,
model_path='models/pass-00000.tar.gz')
from PIL import Image
import numpy as np
import random
import math
class sampler():
def __init__(self, max_sample, max_trial, min_scale, max_scale,
min_aspect_ratio, max_aspect_ratio, min_jaccard_overlap,
max_jaccard_overlap):
self.max_sample = max_sample
self.max_trial = max_trial
self.min_scale = min_scale
self.max_scale = max_scale
self.min_aspect_ratio = min_aspect_ratio
self.max_aspect_ratio = max_aspect_ratio
self.min_jaccard_overlap = min_jaccard_overlap
self.max_jaccard_overlap = max_jaccard_overlap
class bbox():
def __init__(self, xmin, ymin, xmax, ymax):
self.xmin = xmin
self.ymin = ymin
self.xmax = xmax
self.ymax = ymax
def bbox_area(src_bbox):
width = src_bbox.xmax - src_bbox.xmin
height = src_bbox.ymax - src_bbox.ymin
return width * height
def generate_sample(sampler):
scale = random.uniform(sampler.min_scale, sampler.max_scale)
min_aspect_ratio = max(sampler.min_aspect_ratio, (scale**2.0))
max_aspect_ratio = min(sampler.max_aspect_ratio, 1 / (scale**2.0))
aspect_ratio = random.uniform(min_aspect_ratio, max_aspect_ratio)
bbox_width = scale * (aspect_ratio**0.5)
bbox_height = scale / (aspect_ratio**0.5)
xmin_bound = 1 - bbox_width
ymin_bound = 1 - bbox_height
xmin = random.uniform(0, xmin_bound)
ymin = random.uniform(0, ymin_bound)
xmax = xmin + bbox_width
ymax = ymin + bbox_height
sampled_bbox = bbox(xmin, ymin, xmax, ymax)
return sampled_bbox
def jaccard_overlap(sample_bbox, object_bbox):
if sample_bbox.xmin >= object_bbox.xmax or \
sample_bbox.xmax <= object_bbox.xmin or \
sample_bbox.ymin >= object_bbox.ymax or \
sample_bbox.ymax <= object_bbox.ymin:
return 0
intersect_xmin = max(sample_bbox.xmin, object_bbox.xmin)
intersect_ymin = max(sample_bbox.ymin, object_bbox.ymin)
intersect_xmax = min(sample_bbox.xmax, object_bbox.xmax)
intersect_ymax = min(sample_bbox.ymax, object_bbox.ymax)
intersect_size = (intersect_xmax - intersect_xmin) * (
intersect_ymax - intersect_ymin)
sample_bbox_size = bbox_area(sample_bbox)
object_bbox_size = bbox_area(object_bbox)
overlap = intersect_size / (
sample_bbox_size + object_bbox_size - intersect_size)
return overlap
def satisfy_sample_constraint(sampler, sample_bbox, bbox_labels):
if sampler.min_jaccard_overlap == 0 and sampler.max_jaccard_overlap == 0:
return True
for i in range(len(bbox_labels)):
object_bbox = bbox(bbox_labels[i][1], bbox_labels[i][2],
bbox_labels[i][3], bbox_labels[i][4])
overlap = jaccard_overlap(sample_bbox, object_bbox)
if sampler.min_jaccard_overlap != 0 and \
overlap < sampler.min_jaccard_overlap:
continue
if sampler.max_jaccard_overlap != 0 and \
overlap > sampler.max_jaccard_overlap:
continue
return True
return False
def generate_batch_samples(batch_sampler, bbox_labels, image_width,
image_height):
sampled_bbox = []
index = []
c = 0
for sampler in batch_sampler:
found = 0
for i in range(sampler.max_trial):
if found >= sampler.max_sample:
break
sample_bbox = generate_sample(sampler)
if satisfy_sample_constraint(sampler, sample_bbox, bbox_labels):
sampled_bbox.append(sample_bbox)
found = found + 1
index.append(c)
c = c + 1
return sampled_bbox
def clip_bbox(src_bbox):
src_bbox.xmin = max(min(src_bbox.xmin, 1.0), 0.0)
src_bbox.ymin = max(min(src_bbox.ymin, 1.0), 0.0)
src_bbox.xmax = max(min(src_bbox.xmax, 1.0), 0.0)
src_bbox.ymax = max(min(src_bbox.ymax, 1.0), 0.0)
return src_bbox
def meet_emit_constraint(src_bbox, sample_bbox):
center_x = (src_bbox.xmax + src_bbox.xmin) / 2
center_y = (src_bbox.ymax + src_bbox.ymin) / 2
if center_x >= sample_bbox.xmin and \
center_x <= sample_bbox.xmax and \
center_y >= sample_bbox.ymin and \
center_y <= sample_bbox.ymax:
return True
return False
def transform_labels(bbox_labels, sample_bbox):
proj_bbox = bbox(0, 0, 0, 0)
sample_labels = []
for i in range(len(bbox_labels)):
sample_label = []
object_bbox = bbox(bbox_labels[i][1], bbox_labels[i][2],
bbox_labels[i][3], bbox_labels[i][4])
if not meet_emit_constraint(object_bbox, sample_bbox):
continue
sample_width = sample_bbox.xmax - sample_bbox.xmin
sample_height = sample_bbox.ymax - sample_bbox.ymin
proj_bbox.xmin = (object_bbox.xmin - sample_bbox.xmin) / sample_width
proj_bbox.ymin = (object_bbox.ymin - sample_bbox.ymin) / sample_height
proj_bbox.xmax = (object_bbox.xmax - sample_bbox.xmin) / sample_width
proj_bbox.ymax = (object_bbox.ymax - sample_bbox.ymin) / sample_height
proj_bbox = clip_bbox(proj_bbox)
if bbox_area(proj_bbox) > 0:
sample_label.append(bbox_labels[i][0])
sample_label.append(float(proj_bbox.xmin))
sample_label.append(float(proj_bbox.ymin))
sample_label.append(float(proj_bbox.xmax))
sample_label.append(float(proj_bbox.ymax))
sample_label.append(bbox_labels[i][5])
sample_labels.append(sample_label)
return sample_labels
def crop_image(img, bbox_labels, sample_bbox, image_width, image_height):
sample_bbox = clip_bbox(sample_bbox)
xmin = int(sample_bbox.xmin * image_width)
xmax = int(sample_bbox.xmax * image_width)
ymin = int(sample_bbox.ymin * image_height)
ymax = int(sample_bbox.ymax * image_height)
sample_img = img[ymin:ymax, xmin:xmax]
sample_labels = transform_labels(bbox_labels, sample_bbox)
return sample_img, sample_labels
<html>
<head>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'] ],
processEscapes: true
},
"HTML-CSS": { availableFonts: ["TeX"] }
});
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js" async></script>
<script type="text/javascript" src="../.tools/theme/marked.js">
</script>
<link href="http://cdn.bootcss.com/highlight.js/9.9.0/styles/darcula.min.css" rel="stylesheet">
<script src="http://cdn.bootcss.com/highlight.js/9.9.0/highlight.min.js"></script>
<link href="http://cdn.bootcss.com/bootstrap/4.0.0-alpha.6/css/bootstrap.min.css" rel="stylesheet">
<link href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" rel="stylesheet">
<link href="../.tools/theme/github-markdown.css" rel='stylesheet'>
</head>
<style type="text/css" >
.markdown-body {
box-sizing: border-box;
min-width: 200px;
max-width: 980px;
margin: 0 auto;
padding: 45px;
}
</style>
<body>
<div id="context" class="container-fluid markdown-body">
</div>
<!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'>
# SSD目标检测
## 概述
SSD全称为Single Shot MultiBox Detector,是目标检测领域较新且效果较好的检测算法之一,具体参见论文\[[1](#引用)\]。SSD算法主要特点是检测速度快且检测精度高。PaddlePaddle已集成SSD算法,本示例旨在介绍如何使用PaddlePaddle中的SSD模型进行目标检测。下文展开顺序为:首先简要介绍SSD原理,然后介绍示例包含文件及作用,接着介绍如何在PASCAL VOC数据集上训练、评估及检测,最后简要介绍如何在自有数据集上使用SSD。
## SSD原理
SSD使用一个卷积神经网络实现“端到端”的检测,所谓“端到端”指输入为原始图像,输出为检测结果,无需借助外部工具或流程进行特征提取、候选框生成等。论文中SSD的基础模型为VGG16\[[2](#引用)\],不同于原始VGG16网络模型,SSD做了一些改变:
1. 将最后的fc6、fc7全连接层变为卷积层,卷积层参数通过对原始fc6、fc7参数采样得到。
2. 将pool5层的参数由2x2-s2(kernel大小为2x2,stride size为2)更改为3x3-s1-p1(kernel大小为3x3,stride size为1,padding size为1)。
3. 在conv4\_3、conv7、conv8\_2、conv9\_2、conv10\_2及pool11层后面接了priorbox层,priorbox层的主要目的是根据输入的特征图(feature map)生成一系列的矩形候选框。关于SSD的更详细的介绍可以参考论文\[[1](#引用)\]。
下图为模型(300x300)的总体结构:
<p align="center">
<img src="images/ssd_network.png" width="900" height="250" hspace='10'/> <br/>
图1. SSD网络结构
</p>
图中每个矩形盒子代表一个卷积层,最后的两个矩形框分别表示汇总各卷积层输出结果和后处理阶段。具体地,在预测阶段网络会输出一组候选矩形框,每个矩形包含两类信息:位置和类别得分,图中倒数第二个矩形框即表示网络的检测结果的汇总处理,由于候选矩形框数量较多且很多矩形框重叠严重,这时需要经过后处理来筛选出质量较高的少数矩形框,这里的后处理主要指非极大值抑制(Non-maximum Suppression)。
从SSD的网络结构可以看出,候选矩形框在多个特征图(feature map上)生成,不同的feature map具有的感受野不同,这样可以在不同尺度扫描图像,相对于其他检测方法可以生成更丰富的候选框,从而提高检测精度;另一方面SSD对VGG16的扩展部分以较小的代价实现对候选框的位置和类别得分的计算,整个过程只需要一个卷积神经网络完成,所以速度较快。
## 示例总览
本示例共包含如下文件:
<table>
<caption>表1. 示例文件</caption>
<tr><th>文件</th><th>用途</th></tr>
<tr><td>train.py</td><td>训练脚本</td></tr>
<tr><td>eval.py</td><td>评估脚本,用于评估训好模型</td></tr>
<tr><td>infer.py</td><td>检测脚本,给定图片及模型,实施检测</td></tr>
<tr><td>visual.py</td><td>检测结果可视化</td></tr>
<tr><td>image_util.py</td><td>图像预处理所需公共函数</td></tr>
<tr><td>data_provider.py</td><td>数据处理脚本,生成训练、评估或检测所需数据</td></tr>
<tr><td>config/pascal_voc_conf.py</td><td>神经网络超参数配置文件</td></tr>
<tr><td>data/label_list</td><td>类别列表</td></tr>
<tr><td>data/prepare_voc_data.py</td><td>准备训练PASCAL VOC数据列表</td></tr>
</table>
训练阶段需要对数据做预处理,包括裁剪、采样等,这部分操作在```image_util.py```和```data_provider.py```中完成。值得注意的是,```config/vgg_config.py```为参数配置文件,包括训练参数、神经网络参数等,本配置文件包含参数是针对PASCAL VOC数据配置的,当训练自有数据时,需要仿照该文件配置新的参数。```data/prepare_voc_data.py```脚本用来生成文件列表,包括切分训练集和测试集,使用时需要用户事先下载并解压数据,默认采用VOC2007和VOC2012。
## PASCAL VOC数据集
### 数据准备
首先需要下载数据集,VOC2007\[[3](#引用)\]和VOC2012\[[4](#引用)\],VOC2007包含训练集和测试集,VOC2012只包含训练集,将下载好的数据解压,目录结构为```data/VOCdevkit/VOC2007```和```data/VOCdevkit/VOC2012```。进入```data```目录,运行```python prepare_voc_data.py```即可生成```trainval.txt```和```test.txt```。核心函数为:
```python
def prepare_filelist(devkit_dir, years, output_dir):
trainval_list = []
test_list = []
for year in years:
trainval, test = walk_dir(devkit_dir, year)
trainval_list.extend(trainval)
test_list.extend(test)
random.shuffle(trainval_list)
with open(osp.join(output_dir, 'trainval.txt'), 'w') as ftrainval:
for item in trainval_list:
ftrainval.write(item[0] + ' ' + item[1] + '\n')
with open(osp.join(output_dir, 'test.txt'), 'w') as ftest:
for item in test_list:
ftest.write(item[0] + ' ' + item[1] + '\n')
```
该函数首先对每一年(year)的数据进行处理,然后将训练图像的文件路径列表进行随机打乱,最后保存训练文件列表和测试文件列表。默认```prepare_voc_data.py```和```VOCdevkit```在相同目录下,且生成的文件列表也在该目录。需注意```trainval.txt```既包含VOC2007的训练数据,也包含VOC2012的训练数据,```test.txt```只包含VOC2007的测试数据。我们这里提供```trainval.txt```前几行输入作为样例:
```
VOCdevkit/VOC2007/JPEGImages/000005.jpg VOCdevkit/VOC2007/Annotations/000005.xml
VOCdevkit/VOC2007/JPEGImages/000007.jpg VOCdevkit/VOC2007/Annotations/000007.xml
VOCdevkit/VOC2007/JPEGImages/000009.jpg VOCdevkit/VOC2007/Annotations/000009.xml
```
文件共两个字段,第一个字段为图像文件的相对路径,第二个字段为对应标注文件的相对路径。
### 预训练模型准备
下载预训练的VGG-16模型,我们提供了一个转换好的模型,具体下载地址为:http://paddlepaddle.bj.bcebos.com/model_zoo/detection/ssd_model/vgg_model.tar.gz ,下载好模型后,放置路径为```vgg/vgg_model.tar.gz```。
### 模型训练
直接执行```python train.py```即可进行训练。需要注意本示例仅支持CUDA GPU环境,无法在CPU上训练,主要因为使用CPU训练速度很慢,实践中一般使用GPU来处理图像任务,这里实现采用硬编码方式使用cuDNN,不提供CPU版本。```train.py```的一些关键执行逻辑:
```python
paddle.init(use_gpu=True, trainer_count=4)
data_args = data_provider.Settings(
data_dir='./data',
label_file='label_list',
resize_h=cfg.IMG_HEIGHT,
resize_w=cfg.IMG_WIDTH,
mean_value=[104,117,124])
train(train_file_list='./data/trainval.txt',
dev_file_list='./data/test.txt',
data_args=data_args,
init_model_path='./vgg/vgg_model.tar.gz')
```
主要包括:
1. 调用```paddle.init```指定使用4卡GPU训练。
2. 调用```data_provider.Settings```配置数据预处理所需参数,其中```cfg.IMG_HEIGHT```和```cfg.IMG_WIDTH```在配置文件```config/vgg_config.py```中设置,这里均为300,300x300是一个典型配置,兼顾效率和检测精度,也可以通过修改配置文件扩展到512x512。
3. 调用```train```执行训练,其中```train_file_list```指定训练数据列表,```dev_file_list```指定评估数据列表,```init_model_path```指定预训练模型位置。
4. 训练过程中会打印一些日志信息,每训练1个batch会输出当前的轮数、当前batch的cost及mAP(mean Average Precision,平均精度均值),每训练一个pass,会保存一次模型,默认保存在```checkpoints```目录下(注:需事先创建)。
下面给出SDD300x300在VOC数据集(train包括07+12,test为07)上的mAP曲线,迭代140轮mAP可达到71.52%。
<p align="center">
<img src="images/SSD300x300_map.png" hspace='10'/> <br/>
图2. SSD300x300 mAP收敛曲线
</p>
### 模型评估
执行```python eval.py```即可对模型进行评估,```eval.py```的关键执行逻辑如下:
```python
paddle.init(use_gpu=True, trainer_count=4) # use 4 gpus
data_args = data_provider.Settings(
data_dir='./data',
label_file='label_list',
resize_h=cfg.IMG_HEIGHT,
resize_w=cfg.IMG_WIDTH,
mean_value=[104, 117, 124])
eval(
eval_file_list='./data/test.txt',
batch_size=4,
data_args=data_args,
model_path='models/pass-00000.tar.gz')
```
调用```paddle.init```指定使用4卡GPU评估;```data_provider.Settings```参见训练阶段的配置;调用```eval```执行评估,其中```eval_file_list```指定评估数据列表,```batch_size```指定评估时batch size的大小,```model_path ```指定模型位置。评估结束会输出```loss```信息和```mAP```信息。
### 图像检测
执行```python infer.py```即可使用训练好的模型对图片实施检测,```infer.py```关键逻辑如下:
```python
infer(
eval_file_list='./data/infer.txt',
save_path='infer.res',
data_args=data_args,
batch_size=4,
model_path='models/pass-00000.tar.gz',
threshold=0.3)
```
其中```eval_file_list```指定图像路径列表;```save_path```指定预测结果保存路径;```data_args```如上;```batch_size```为每多少样本预测一次;```model_path```指模型的位置;```threshold```为置信度阈值,只有得分大于或等于该值的才会输出。下面给出```infer.res```的一些输出样例:
```
VOCdevkit/VOC2007/JPEGImages/006936.jpg 12 0.997844 131.255611777 162.271582842 396.475315094 334.0
VOCdevkit/VOC2007/JPEGImages/006936.jpg 14 0.998557 229.160234332 49.5991278887 314.098775387 312.913876176
VOCdevkit/VOC2007/JPEGImages/006936.jpg 14 0.372522 187.543615699 133.727034628 345.647156239 327.448492289
...
```
一共包含4个字段,以tab分割,第一个字段是检测图像路径,第二字段为检测矩形框内类别,第三个字段是置信度,第四个字段是4个坐标值(以空格分割)。
示例还提供了一个可视化脚本,直接运行```python visual.py```即可,须指定输出检测结果路径及输出目录,默认可视化后图像保存在```./visual_res```,下面是用训练好的模型infer部分图像并可视化的效果:
<p align="center">
<img src="images/vis_1.jpg" height=150 width=200 hspace='10'/>
<img src="images/vis_2.jpg" height=150 width=200 hspace='10'/>
<img src="images/vis_3.jpg" height=150 width=100 hspace='10'/>
<img src="images/vis_4.jpg" height=150 width=200 hspace='10'/> <br />
图3. SSD300x300 检测可视化示例
</p>
## 自有数据集
在自有数据上训练PaddlePaddle SSD需要完成两个关键准备,首先需要适配网络可以接受的输入格式,这里提供一个推荐的结构,以```train.txt```为例
```
image00000_file_path image00000_annotation_file_path
image00001_file_path image00001_annotation_file_path
image00002_file_path image00002_annotation_file_path
...
```
文件共两列,以空白符分割,第一列为图像文件的路径,第二列为对应标注数据的文件路径。对图像文件的读取比较直接,略微复杂的是对标注数据的解析,本示例中标注数据使用xml文件存储,所以需要在```data_provider.py```中对xml解析,核心逻辑如下:
```python
bbox_labels = []
root = xml.etree.ElementTree.parse(label_path).getroot()
for object in root.findall('object'):
bbox_sample = []
# start from 1
bbox_sample.append(float(settings.label_list.index(
object.find('name').text)))
bbox = object.find('bndbox')
difficult = float(object.find('difficult').text)
bbox_sample.append(float(bbox.find('xmin').text)/img_width)
bbox_sample.append(float(bbox.find('ymin').text)/img_height)
bbox_sample.append(float(bbox.find('xmax').text)/img_width)
bbox_sample.append(float(bbox.find('ymax').text)/img_height)
bbox_sample.append(difficult)
bbox_labels.append(bbox_sample)
```
这里一条标注数据包括:label、xmin、ymin、xmax、ymax和is\_difficult,is\_difficult表示该object是否为难例,实际中如果不需要,只需把该字段置零即可。自有数据也需要提供对应的解析逻辑,假设标注数据(比如image00000\_annotation\_file\_path)存储格式如下:
```
label1 xmin1 ymin1 xmax1 ymax1
label2 xmin2 ymin2 xmax2 ymax2
...
```
每行对应一个物体,共5个字段,第一个为label(注背景为0,需从1编号),剩余4个为坐标,对应的解析逻辑可更改为如下:
```
bbox_labels = []
with open(label_path) as flabel:
for line in flabel:
bbox_sample = []
bbox = [float(i) for i in line.strip().split()]
label = bbox[0]
bbox_sample.append(label)
bbox_sample.append(bbox[1]/float(img_width))
bbox_sample.append(bbox[2]/float(img_height))
bbox_sample.append(bbox[3]/float(img_width))
bbox_sample.append(bbox[4]/float(img_height))
bbox_sample.append(0.0)
bbox_labels.append(bbox_sample)
```
另一个重要的事情就是根据图像大小及检测物体的大小等更改网络结构的配置,主要是仿照```config/vgg_config.py```创建自己的配置文件,参数设置经验请参照论文\[[1](#引用)\]。
## 引用
1. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. [SSD: Single shot multibox detector](https://arxiv.org/abs/1512.02325). European conference on computer vision. Springer, Cham, 2016.
2. Simonyan, Karen, and Andrew Zisserman. [Very deep convolutional networks for large-scale image recognition](https://arxiv.org/abs/1409.1556). arXiv preprint arXiv:1409.1556 (2014).
3. [The PASCAL Visual Object Classes Challenge 2007](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html)
4. [Visual Object Classes Challenge 2012 (VOC2012)](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html)
</div>
<!-- You can change the lines below now. -->
<script type="text/javascript">
marked.setOptions({
renderer: new marked.Renderer(),
gfm: true,
breaks: false,
smartypants: true,
highlight: function(code, lang) {
code = code.replace(/&amp;/g, "&")
code = code.replace(/&gt;/g, ">")
code = code.replace(/&lt;/g, "<")
code = code.replace(/&nbsp;/g, " ")
return hljs.highlightAuto(code, [lang]).value;
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
</script>
</body>
import paddle.v2 as paddle
import data_provider
import vgg_ssd_net
import os, sys
import numpy as np
import gzip
from PIL import Image
from config.pascal_voc_conf import cfg
def _infer(inferer, infer_data, threshold):
ret = []
infer_res = inferer.infer(input=infer_data)
keep_inds = np.where(infer_res[:, 2] >= threshold)[0]
for idx in keep_inds:
ret.append([
infer_res[idx][0], infer_res[idx][1] - 1, infer_res[idx][2],
infer_res[idx][3], infer_res[idx][4], infer_res[idx][5],
infer_res[idx][6]
])
return ret
def save_batch_res(ret_res, img_w, img_h, fname_list, fout):
for det_res in ret_res:
img_idx = int(det_res[0])
label = int(det_res[1])
conf_score = det_res[2]
xmin = det_res[3] * img_w[img_idx]
ymin = det_res[4] * img_h[img_idx]
xmax = det_res[5] * img_w[img_idx]
ymax = det_res[6] * img_h[img_idx]
fout.write(fname_list[img_idx] + '\t' + str(label) + '\t' + str(
conf_score) + '\t' + str(xmin) + ' ' + str(ymin) + ' ' + str(xmax) +
' ' + str(ymax))
fout.write('\n')
def infer(eval_file_list, save_path, data_args, batch_size, model_path,
threshold):
detect_out = vgg_ssd_net.net_conf(mode='infer')
assert os.path.isfile(model_path), 'Invalid model.'
parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_path))
inferer = paddle.inference.Inference(
output_layer=detect_out, parameters=parameters)
reader = data_provider.infer(data_args, eval_file_list)
all_fname_list = [line.strip() for line in open(eval_file_list).readlines()]
test_data = []
fname_list = []
img_w = []
img_h = []
idx = 0
"""Do inference batch by batch,
coords of bbox will be scaled based on image size
"""
with open(save_path, 'w') as fout:
for img in reader():
test_data.append([img])
fname_list.append(all_fname_list[idx])
w, h = Image.open(os.path.join('./data', fname_list[-1])).size
img_w.append(w)
img_h.append(h)
if len(test_data) == batch_size:
ret_res = _infer(inferer, test_data, threshold)
save_batch_res(ret_res, img_w, img_h, fname_list, fout)
test_data = []
fname_list = []
img_w = []
img_h = []
idx += 1
if len(test_data) > 0:
ret_res = _infer(inferer, test_data, threshold)
save_batch_res(ret_res, img_w, img_h, fname_list, fout)
if __name__ == "__main__":
paddle.init(use_gpu=True, trainer_count=1)
data_args = data_provider.Settings(
data_dir='./data',
label_file='label_list',
resize_h=cfg.IMG_HEIGHT,
resize_w=cfg.IMG_WIDTH,
mean_value=[104, 117, 124])
infer(
eval_file_list='./data/infer.txt',
save_path='infer.res',
data_args=data_args,
batch_size=4,
model_path='models/pass-00000.tar.gz',
threshold=0.3)
import paddle.v2 as paddle
import data_provider
import vgg_ssd_net
import os, sys
import gzip
import tarfile
from config.pascal_voc_conf import cfg
def train(train_file_list, dev_file_list, data_args, init_model_path):
optimizer = paddle.optimizer.Momentum(
momentum=cfg.TRAIN.MOMENTUM,
learning_rate=cfg.TRAIN.LEARNING_RATE,
regularization=paddle.optimizer.L2Regularization(
rate=cfg.TRAIN.L2REGULARIZATION),
learning_rate_decay_a=cfg.TRAIN.LEARNING_RATE_DECAY_A,
learning_rate_decay_b=cfg.TRAIN.LEARNING_RATE_DECAY_B,
learning_rate_schedule=cfg.TRAIN.LEARNING_RATE_SCHEDULE)
cost, detect_out = vgg_ssd_net.net_conf('train')
parameters = paddle.parameters.create(cost)
if not (init_model_path is None):
assert os.path.isfile(init_model_path), 'Invalid model.'
parameters.init_from_tar(gzip.open(init_model_path))
trainer = paddle.trainer.SGD(
cost=cost,
parameters=parameters,
extra_layers=[detect_out],
update_equation=optimizer)
feeding = {'image': 0, 'bbox': 1}
train_reader = paddle.batch(
data_provider.train(data_args, train_file_list),
batch_size=cfg.TRAIN.BATCH_SIZE) # generate a batch image each time
dev_reader = paddle.batch(
data_provider.test(data_args, dev_file_list),
batch_size=cfg.TRAIN.BATCH_SIZE)
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 1 == 0:
print "\nPass %d, Batch %d, TrainCost %f, Detection mAP=%f" % \
(event.pass_id,
event.batch_id,
event.cost,
event.metrics['detection_evaluator'])
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.EndPass):
with gzip.open('checkpoints/params_pass_%05d.tar.gz' % \
event.pass_id, 'w') as f:
parameters.to_tar(f)
result = trainer.test(reader=dev_reader, feeding=feeding)
print "\nTest with Pass %d, TestCost: %f, Detection mAP=%g" % \
(event.pass_id,
result.cost,
result.metrics['detection_evaluator'])
trainer.train(
reader=train_reader,
event_handler=event_handler,
num_passes=cfg.TRAIN.NUM_PASS,
feeding=feeding)
if __name__ == "__main__":
paddle.init(use_gpu=True, trainer_count=4)
data_args = data_provider.Settings(
data_dir='./data',
label_file='label_list',
resize_h=cfg.IMG_HEIGHT,
resize_w=cfg.IMG_WIDTH,
mean_value=[104, 117, 124])
train(
train_file_list='./data/trainval.txt',
dev_file_list='./data/test.txt',
data_args=data_args,
init_model_path='./vgg/vgg_model.tar.gz')
import paddle.v2 as paddle
from config.pascal_voc_conf import cfg
def net_conf(mode):
"""Network configuration. Total three modes included 'train' 'eval'
and 'infer'. Loss and mAP evaluation layer will return if using 'train'
and 'eval'. In 'infer' mode, only detection output layer will be returned.
"""
default_l2regularization = cfg.TRAIN.L2REGULARIZATION
default_bias_attr = paddle.attr.ParamAttr(l2_rate=0.0, learning_rate=2.0)
default_static_bias_attr = paddle.attr.ParamAttr(is_static=True)
def get_param_attr(local_lr, regularization):
is_static = False
if local_lr == 0.0:
is_static = True
return paddle.attr.ParamAttr(
learning_rate=local_lr, l2_rate=regularization, is_static=is_static)
def conv_group(stack_num, name_list, input, filter_size_list, num_channels,
num_filters_list, stride_list, padding_list,
common_bias_attr, common_param_attr, common_act):
conv = input
in_channels = num_channels
for i in xrange(stack_num):
conv = paddle.layer.img_conv(
name=name_list[i],
input=conv,
filter_size=filter_size_list[i],
num_channels=in_channels,
num_filters=num_filters_list[i],
stride=stride_list[i],
padding=padding_list[i],
bias_attr=common_bias_attr,
param_attr=common_param_attr,
act=common_act)
in_channels = num_filters_list[i]
return conv
def vgg_block(idx_str, input, num_channels, num_filters, pool_size,
pool_stride, pool_pad):
layer_name = "conv%s_" % idx_str
stack_num = 3
name_list = [layer_name + str(i + 1) for i in xrange(3)]
conv = conv_group(stack_num, name_list, input, [3] * stack_num,
num_channels, [num_filters] * stack_num,
[1] * stack_num, [1] * stack_num, default_bias_attr,
get_param_attr(1, default_l2regularization),
paddle.activation.Relu())
pool = paddle.layer.img_pool(
input=conv,
pool_size=pool_size,
num_channels=num_filters,
pool_type=paddle.pooling.CudnnMax(),
stride=pool_stride,
padding=pool_pad)
return conv, pool
def mbox_block(layer_idx, input, num_channels, filter_size, loc_filters,
conf_filters):
mbox_loc_name = layer_idx + "_mbox_loc"
mbox_loc = paddle.layer.img_conv(
name=mbox_loc_name,
input=input,
filter_size=filter_size,
num_channels=num_channels,
num_filters=loc_filters,
stride=1,
padding=1,
bias_attr=default_bias_attr,
param_attr=get_param_attr(1, default_l2regularization),
act=paddle.activation.Identity())
mbox_conf_name = layer_idx + "_mbox_conf"
mbox_conf = paddle.layer.img_conv(
name=mbox_conf_name,
input=input,
filter_size=filter_size,
num_channels=num_channels,
num_filters=conf_filters,
stride=1,
padding=1,
bias_attr=default_bias_attr,
param_attr=get_param_attr(1, default_l2regularization),
act=paddle.activation.Identity())
return mbox_loc, mbox_conf
def ssd_block(layer_idx, input, img_shape, num_channels, num_filters1,
num_filters2, aspect_ratio, variance, min_size, max_size):
layer_name = "conv" + layer_idx + "_"
stack_num = 2
conv1_name = layer_name + "1"
conv2_name = layer_name + "2"
conv2 = conv_group(stack_num, [conv1_name, conv2_name], input, [1, 3],
num_channels, [num_filters1, num_filters2], [1, 2],
[0, 1], default_bias_attr,
get_param_attr(1, default_l2regularization),
paddle.activation.Relu())
loc_filters = (len(aspect_ratio) * 2 + 1 + len(max_size)) * 4
conf_filters = (
len(aspect_ratio) * 2 + 1 + len(max_size)) * cfg.CLASS_NUM
mbox_loc, mbox_conf = mbox_block(conv2_name, conv2, num_filters2, 3,
loc_filters, conf_filters)
mbox_priorbox = paddle.layer.priorbox(
input=conv2,
image=img_shape,
min_size=min_size,
max_size=max_size,
aspect_ratio=aspect_ratio,
variance=variance)
return conv2, mbox_loc, mbox_conf, mbox_priorbox
img = paddle.layer.data(
name='image',
type=paddle.data_type.dense_vector(cfg.IMG_CHANNEL * cfg.IMG_HEIGHT *
cfg.IMG_WIDTH),
height=cfg.IMG_HEIGHT,
width=cfg.IMG_WIDTH)
stack_num = 2
conv1_2 = conv_group(stack_num, ['conv1_1', 'conv1_2'], img,
[3] * stack_num, 3, [64] * stack_num, [1] * stack_num,
[1] * stack_num, default_static_bias_attr,
get_param_attr(0, 0), paddle.activation.Relu())
pool1 = paddle.layer.img_pool(
name="pool1",
input=conv1_2,
pool_type=paddle.pooling.CudnnMax(),
pool_size=2,
num_channels=64,
stride=2)
stack_num = 2
conv2_2 = conv_group(stack_num, ['conv2_1', 'conv2_2'], pool1, [3] *
stack_num, 64, [128] * stack_num, [1] * stack_num,
[1] * stack_num, default_static_bias_attr,
get_param_attr(0, 0), paddle.activation.Relu())
pool2 = paddle.layer.img_pool(
name="pool2",
input=conv2_2,
pool_type=paddle.pooling.CudnnMax(),
pool_size=2,
num_channels=128,
stride=2)
conv3_3, pool3 = vgg_block("3", pool2, 128, 256, 2, 2, 0)
conv4_3, pool4 = vgg_block("4", pool3, 256, 512, 2, 2, 0)
conv4_3_mbox_priorbox = paddle.layer.priorbox(
input=conv4_3,
image=img,
min_size=cfg.NET.CONV4.PB.MIN_SIZE,
aspect_ratio=cfg.NET.CONV4.PB.ASPECT_RATIO,
variance=cfg.NET.CONV4.PB.VARIANCE)
conv4_3_norm = paddle.layer.cross_channel_norm(
name="conv4_3_norm",
input=conv4_3,
param_attr=paddle.attr.ParamAttr(
initial_mean=20, initial_std=0, is_static=False, learning_rate=1))
conv4_3_norm_mbox_loc, conv4_3_norm_mbox_conf = \
mbox_block("conv4_3_norm", conv4_3_norm, 512, 3, 12, 63)
conv5_3, pool5 = vgg_block("5", pool4, 512, 512, 3, 1, 1)
stack_num = 2
fc7 = conv_group(stack_num, ['fc6', 'fc7'], pool5, [3, 1], 512, [1024] *
stack_num, [1] * stack_num, [1, 0], default_bias_attr,
get_param_attr(1, default_l2regularization),
paddle.activation.Relu())
fc7_mbox_loc, fc7_mbox_conf = mbox_block("fc7", fc7, 1024, 3, 24, 126)
fc7_mbox_priorbox = paddle.layer.priorbox(
input=fc7,
image=img,
min_size=cfg.NET.FC7.PB.MIN_SIZE,
max_size=cfg.NET.FC7.PB.MAX_SIZE,
aspect_ratio=cfg.NET.FC7.PB.ASPECT_RATIO,
variance=cfg.NET.FC7.PB.VARIANCE)
conv6_2, conv6_2_mbox_loc, conv6_2_mbox_conf, conv6_2_mbox_priorbox = \
ssd_block("6", fc7, img, 1024, 256, 512,
cfg.NET.CONV6.PB.ASPECT_RATIO,
cfg.NET.CONV6.PB.VARIANCE,
cfg.NET.CONV6.PB.MIN_SIZE,
cfg.NET.CONV6.PB.MAX_SIZE)
conv7_2, conv7_2_mbox_loc, conv7_2_mbox_conf, conv7_2_mbox_priorbox = \
ssd_block("7", conv6_2, img, 512, 128, 256,
cfg.NET.CONV7.PB.ASPECT_RATIO,
cfg.NET.CONV7.PB.VARIANCE,
cfg.NET.CONV7.PB.MIN_SIZE,
cfg.NET.CONV7.PB.MAX_SIZE)
conv8_2, conv8_2_mbox_loc, conv8_2_mbox_conf, conv8_2_mbox_priorbox = \
ssd_block("8", conv7_2, img, 256, 128, 256,
cfg.NET.CONV8.PB.ASPECT_RATIO,
cfg.NET.CONV8.PB.VARIANCE,
cfg.NET.CONV8.PB.MIN_SIZE,
cfg.NET.CONV8.PB.MAX_SIZE)
pool6 = paddle.layer.img_pool(
name="pool6",
input=conv8_2,
pool_size=3,
num_channels=256,
stride=1,
pool_type=paddle.pooling.Avg())
pool6_mbox_loc, pool6_mbox_conf = mbox_block("pool6", pool6, 256, 3, 24,
126)
pool6_mbox_priorbox = paddle.layer.priorbox(
input=pool6,
image=img,
min_size=cfg.NET.POOL6.PB.MIN_SIZE,
max_size=cfg.NET.POOL6.PB.MAX_SIZE,
aspect_ratio=cfg.NET.POOL6.PB.ASPECT_RATIO,
variance=cfg.NET.POOL6.PB.VARIANCE)
mbox_priorbox = paddle.layer.concat(
name="mbox_priorbox",
input=[
conv4_3_mbox_priorbox, fc7_mbox_priorbox, conv6_2_mbox_priorbox,
conv7_2_mbox_priorbox, conv8_2_mbox_priorbox, pool6_mbox_priorbox
])
loc_loss_input = [
conv4_3_norm_mbox_loc, fc7_mbox_loc, conv6_2_mbox_loc, conv7_2_mbox_loc,
conv8_2_mbox_loc, pool6_mbox_loc
]
conf_loss_input = [
conv4_3_norm_mbox_conf, fc7_mbox_conf, conv6_2_mbox_conf,
conv7_2_mbox_conf, conv8_2_mbox_conf, pool6_mbox_conf
]
detection_out = paddle.layer.detection_output(
input_loc=loc_loss_input,
input_conf=conf_loss_input,
priorbox=mbox_priorbox,
confidence_threshold=cfg.NET.DETOUT.CONFIDENCE_THRESHOLD,
nms_threshold=cfg.NET.DETOUT.NMS_THRESHOLD,
num_classes=cfg.CLASS_NUM,
nms_top_k=cfg.NET.DETOUT.NMS_TOP_K,
keep_top_k=cfg.NET.DETOUT.KEEP_TOP_K,
background_id=cfg.BACKGROUND_ID,
name="detection_output")
if mode == 'train' or mode == 'eval':
bbox = paddle.layer.data(
name='bbox', type=paddle.data_type.dense_vector_sequence(6))
loss = paddle.layer.multibox_loss(
input_loc=loc_loss_input,
input_conf=conf_loss_input,
priorbox=mbox_priorbox,
label=bbox,
num_classes=cfg.CLASS_NUM,
overlap_threshold=cfg.NET.MBLOSS.OVERLAP_THRESHOLD,
neg_pos_ratio=cfg.NET.MBLOSS.NEG_POS_RATIO,
neg_overlap=cfg.NET.MBLOSS.NEG_OVERLAP,
background_id=cfg.BACKGROUND_ID,
name="multibox_loss")
paddle.evaluator.detection_map(
input=detection_out,
label=bbox,
overlap_threshold=cfg.NET.DETMAP.OVERLAP_THRESHOLD,
background_id=cfg.BACKGROUND_ID,
evaluate_difficult=cfg.NET.DETMAP.EVAL_DIFFICULT,
ap_type=cfg.NET.DETMAP.AP_TYPE,
name="detection_evaluator")
return loss, detection_out
elif mode == 'infer':
return detection_out
import cv2
import os
data_dir = './data'
infer_file = './infer.res'
out_dir = './visual_res'
path_to_im = dict()
for line in open(infer_file):
img_path, _, _, _ = line.strip().split('\t')
if img_path not in path_to_im:
im = cv2.imread(os.path.join(data_dir, img_path))
path_to_im[img_path] = im
for line in open(infer_file):
img_path, label, conf, bbox = line.strip().split('\t')
xmin, ymin, xmax, ymax = map(float, bbox.split(' '))
xmin = int(round(xmin))
ymin = int(round(ymin))
xmax = int(round(xmax))
ymax = int(round(ymax))
img = path_to_im[img_path]
cv2.rectangle(img, (xmin, ymin), (xmax, ymax),
(0, (1 - xmin) * 255, xmin * 255), 2)
for img_path in path_to_im:
im = path_to_im[img_path]
out_path = os.path.join(out_dir, os.path.basename(img_path))
cv2.imwrite(out_path, im)
print 'Done.'
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册