提交 7a2d0cae 编写于 作者: Y Yibing Liu

Merge branch 'develop' into mfcc_feat_dev

# 点击率预估 # 点击率预估
以下是本例目录包含的文件以及对应说明:
```
├── README.md # 本教程markdown 文档
├── dataset.md # 数据集处理教程
├── images # 本教程图片目录
│   ├── lr_vs_dnn.jpg
│   └── wide_deep.png
├── infer.py # 预测脚本
├── network_conf.py # 模型网络配置
├── reader.py # data reader
├── train.py # 训练脚本
└── utils.py # helper functions
└── avazu_data_processer.py # 示例数据预处理脚本
```
## 背景介绍 ## 背景介绍
CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率, CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\]
通常被用来衡量一个在线广告系统的有效性 是对用户点击一个特定链接的概率做出预测,是广告投放过程中的一个重要环节。精准的点击率预估对在线广告系统收益最大化具有重要意义
当有多个广告位时,CTR 预估一般会作为排序的基准。 当有多个广告位时,CTR 预估一般会作为排序的基准,比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
1. 召回满足 query 的广告集合 1. 获取与用户搜索词相关的广告集合
2. 业务规则和相关性过滤 2. 业务规则和相关性过滤
3. 根据拍卖机制和 CTR 排序 3. 根据拍卖机制和 CTR 排序
4. 展出广告 4. 展出广告
...@@ -36,13 +51,11 @@ Figure 1. LR 和 DNN 模型结构对比 ...@@ -36,13 +51,11 @@ Figure 1. LR 和 DNN 模型结构对比
</p> </p>
LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加), LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加),
但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息)。 但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息);
如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量, 如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量,
这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。 这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法。 LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法;
而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率, 而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率,
这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。 这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。
...@@ -59,10 +72,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括 ...@@ -59,10 +72,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括
我们直接使用第一种方法做分类任务。 我们直接使用第一种方法做分类任务。
我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。 我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示本例中的模型。
具体的特征处理方法参看 [data process](./dataset.md)
本教程中演示模型的输入格式如下:
```
# <dnn input ids> \t <lr input sparse values> \t click
1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
23 231 \t 1230:0.12 13421:0.9 \t 1
```
详细的格式描述如下:
- `dnn input ids` 采用 one-hot 表示,只需要填写值为1的ID(注意这里不是变长输入)
- `lr input sparse values` 使用了 `ID:VALUE` 的表示,值部分最好规约到值域 `[-1, 1]`
此外,模型训练时需要传入一个文件描述 dnn 和 lr两个子模型的输入维度,文件的格式如下:
具体的特征处理方法参看 [data process](./dataset.md) ```
dnn_input_dim: <int>
lr_input_dim: <int>
```
其中, `<int>` 表示一个整型数值。
本目录下的 `avazu_data_processor.py` 可以对下载的演示数据集\[[2](#参考文档)\] 进行处理,具体使用方法参考如下说明:
```
usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
OUTPUT_DIR
[--num_lines_to_detect NUM_LINES_TO_DETECT]
[--test_set_size TEST_SET_SIZE]
[--train_size TRAIN_SIZE]
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
path of the Avazu dataset
--output_dir OUTPUT_DIR
directory to output
--num_lines_to_detect NUM_LINES_TO_DETECT
number of records to detect dataset's meta info
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--train_size TRAIN_SIZE
size of the trainset (default: 100000)
```
- `data_path` 是待处理的数据路径
- `output_dir` 生成数据的输出路径
- `num_lines_to_detect` 预先扫描数据生成ID的个数,这里是扫描的文件行数
- `test_set_size` 生成测试集的行数
- `train_size` 生成训练姐的行数
## Wide & Deep Learning Model ## Wide & Deep Learning Model
...@@ -201,18 +266,20 @@ trainer.train( ...@@ -201,18 +266,20 @@ trainer.train(
## 运行训练和测试 ## 运行训练和测试
训练模型需要如下步骤: 训练模型需要如下步骤:
1. 下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\] 1. 准备训练数据
1.[Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz 1.[Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
2. 解压 train.gz 得到 train.txt 2. 解压 train.gz 得到 train.txt
2. 执行 `python train.py --train_data_path train.txt` ,开始训练 3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
2. 执行 `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0` 开始训练
上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下 上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
``` ```
usage: train.py [-h] --train_data_path TRAIN_DATA_PATH usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
[--batch_size BATCH_SIZE] [--test_set_size TEST_SET_SIZE] [--test_data_path TEST_DATA_PATH] [--batch_size BATCH_SIZE]
[--num_passes NUM_PASSES] [--num_passes NUM_PASSES]
[--num_lines_to_detact NUM_LINES_TO_DETACT] [--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
DATA_META_FILE --model_type MODEL_TYPE
PaddlePaddle CTR example PaddlePaddle CTR example
...@@ -220,16 +287,78 @@ optional arguments: ...@@ -220,16 +287,78 @@ optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--train_data_path TRAIN_DATA_PATH --train_data_path TRAIN_DATA_PATH
path of training dataset path of training dataset
--test_data_path TEST_DATA_PATH
path of testing dataset
--batch_size BATCH_SIZE --batch_size BATCH_SIZE
size of mini-batch (default:10000) size of mini-batch (default:10000)
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--num_passes NUM_PASSES --num_passes NUM_PASSES
number of passes to train number of passes to train
--num_lines_to_detact NUM_LINES_TO_DETACT --model_output_prefix MODEL_OUTPUT_PREFIX
number of records to detect dataset's meta info prefix of path for model to store (default:
./ctr_models)
--data_meta_file DATA_META_FILE
path of data meta info file
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
- `train_data_path` : 训练集的路径
- `test_data_path` : 测试集的路径
- `num_passes`: 模型训练多少轮
- `data_meta_file`: 参考[数据和任务抽象](### 数据和任务抽象)的描述。
- `model_type`: 模型分类或回归
## 用训好的模型做预测
训好的模型可以用来预测新的数据, 预测数据的格式为
```
# <dnn input ids> \t <lr input sparse values>
1 23 190 \t 230:0.12 3421:0.9 23451:0.12
23 231 \t 1230:0.12 13421:0.9
``` ```
这里与训练数据的格式唯一不同的地方,就是没有标签,也就是训练数据中第3列 `click` 对应的数值。
`infer.py` 的使用方法如下
```
usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
--prediction_output_path PREDICTION_OUTPUT_PATH
[--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--model_gz_path MODEL_GZ_PATH
path of model parameters gz file
--data_path DATA_PATH
path of the dataset to infer
--prediction_output_path PREDICTION_OUTPUT_PATH
path to output the prediction
--data_meta_path DATA_META_PATH
path of trainset's meta info, default is ./data.meta
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
- `model_gz_path_model`:用 `gz` 压缩过的模型路径
- `data_path` : 需要预测的数据路径
- `prediction_output_paht`:预测输出的路径
- `data_meta_file` :参考[数据和任务抽象](### 数据和任务抽象)的描述。
- `model_type` :分类或回归
示例数据可以用如下命令预测
```
python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
```
最终的预测结果位于 `predictions.txt`
## 参考文献 ## 参考文献
1. <https://en.wikipedia.org/wiki/Click-through_rate> 1. <https://en.wikipedia.org/wiki/Click-through_rate>
2. <https://www.kaggle.com/c/avazu-ctr-prediction/data> 2. <https://www.kaggle.com/c/avazu-ctr-prediction/data>
......
#!/usr/bin/env python
# -*- coding: utf-8 -*-import os
import sys import sys
import csv import csv
import cPickle
import argparse
import numpy as np import numpy as np
from utils import logger, TaskMode
parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument(
'--data_path', type=str, required=True, help="path of the Avazu dataset")
parser.add_argument(
'--output_dir', type=str, required=True, help="directory to output")
parser.add_argument(
'--num_lines_to_detect',
type=int,
default=500000,
help="number of records to detect dataset's meta info")
parser.add_argument(
'--test_set_size',
type=int,
default=10000,
help="size of the validation dataset(default: 10000)")
parser.add_argument(
'--train_size',
type=int,
default=100000,
help="size of the trainset (default: 100000)")
args = parser.parse_args()
''' '''
The fields of the dataset are: The fields of the dataset are:
...@@ -22,7 +50,7 @@ The fields of the dataset are: ...@@ -22,7 +50,7 @@ The fields of the dataset are:
15. device_conn_type 15. device_conn_type
16. C14-C21 -- anonymized categorical variables 16. C14-C21 -- anonymized categorical variables
We will treat following fields as categorical features: We will treat the following fields as categorical features:
- C1 - C1
- banner_pos - banner_pos
...@@ -40,6 +68,14 @@ and some other features as id features: ...@@ -40,6 +68,14 @@ and some other features as id features:
The `hour` field will be treated as a continuous feature and will be transformed The `hour` field will be treated as a continuous feature and will be transformed
to one-hot representation which has 24 bits. to one-hot representation which has 24 bits.
This script will output 3 files:
1. train.txt
2. test.txt
3. infer.txt
all the files are for demo.
''' '''
feature_dims = {} feature_dims = {}
...@@ -161,6 +197,7 @@ def detect_dataset(path, topn, id_fea_space=10000): ...@@ -161,6 +197,7 @@ def detect_dataset(path, topn, id_fea_space=10000):
NOTE the records should be randomly shuffled first. NOTE the records should be randomly shuffled first.
''' '''
# create categorical statis objects. # create categorical statis objects.
logger.warning('detecting dataset')
with open(path, 'rb') as csvfile: with open(path, 'rb') as csvfile:
reader = csv.DictReader(csvfile) reader = csv.DictReader(csvfile)
...@@ -174,9 +211,6 @@ def detect_dataset(path, topn, id_fea_space=10000): ...@@ -174,9 +211,6 @@ def detect_dataset(path, topn, id_fea_space=10000):
for key, item in fields.items(): for key, item in fields.items():
feature_dims[key] = item.size() feature_dims[key] = item.size()
#for key in id_features:
#feature_dims[key] = id_fea_space
feature_dims['hour'] = 24 feature_dims['hour'] = 24
feature_dims['click'] = 1 feature_dims['click'] = 1
...@@ -184,10 +218,17 @@ def detect_dataset(path, topn, id_fea_space=10000): ...@@ -184,10 +218,17 @@ def detect_dataset(path, topn, id_fea_space=10000):
feature_dims[key] for key in categorial_features + ['hour']) + 1 feature_dims[key] for key in categorial_features + ['hour']) + 1
feature_dims['lr_input'] = np.sum(feature_dims[key] feature_dims['lr_input'] = np.sum(feature_dims[key]
for key in id_features) + 1 for key in id_features) + 1
return feature_dims return feature_dims
def load_data_meta(meta_path):
'''
Load dataset's meta infomation.
'''
feature_dims, fields = cPickle.load(open(meta_path, 'rb'))
return feature_dims, fields
def concat_sparse_vectors(inputs, dims): def concat_sparse_vectors(inputs, dims):
''' '''
Concaterate more than one sparse vectors into one. Concaterate more than one sparse vectors into one.
...@@ -211,67 +252,162 @@ class AvazuDataset(object): ...@@ -211,67 +252,162 @@ class AvazuDataset(object):
''' '''
Load AVAZU dataset as train set. Load AVAZU dataset as train set.
''' '''
TRAIN_MODE = 0
TEST_MODE = 1
def __init__(self, train_path, n_records_as_test=-1): def __init__(self,
train_path,
n_records_as_test=-1,
fields=None,
feature_dims=None):
self.train_path = train_path self.train_path = train_path
self.n_records_as_test = n_records_as_test self.n_records_as_test = n_records_as_test
# task model: 0 train, 1 test self.fields = fields
self.mode = 0 # default is train mode.
self.mode = TaskMode.create_train()
def train(self): self.categorial_dims = [
self.mode = self.TRAIN_MODE feature_dims[key] for key in categorial_features + ['hour']
return self._parse(self.train_path, skip_n_lines=self.n_records_as_test) ]
self.id_dims = [feature_dims[key] for key in id_features]
def test(self): def train(self):
self.mode = self.TEST_MODE '''
return self._parse(self.train_path, top_n_lines=self.n_records_as_test) Load trainset.
'''
def _parse(self, path, skip_n_lines=-1, top_n_lines=-1): logger.info("load trainset from %s" % self.train_path)
with open(path, 'rb') as csvfile: self.mode = TaskMode.create_train()
reader = csv.DictReader(csvfile) with open(self.train_path) as f:
reader = csv.DictReader(f)
categorial_dims = [
feature_dims[key] for key in categorial_features + ['hour']
]
id_dims = [feature_dims[key] for key in id_features]
for row_id, row in enumerate(reader): for row_id, row in enumerate(reader):
if skip_n_lines > 0 and row_id < skip_n_lines: # skip top n lines
if self.n_records_as_test > 0 and row_id < self.n_records_as_test:
continue continue
if top_n_lines > 0 and row_id > top_n_lines:
break
record = []
for key in categorial_features:
record.append(fields[key].gen(row[key]))
record.append([int(row['hour'][-2:])])
dense_input = concat_sparse_vectors(record, categorial_dims)
record = [] rcd = self._parse_record(row)
for key in id_features: if rcd:
if 'cross' not in key: yield rcd
record.append(fields[key].gen(row[key]))
else:
fea0 = fields[key].cross_fea0
fea1 = fields[key].cross_fea1
record.append(
fields[key].gen_cross_fea(row[fea0], row[fea1]))
sparse_input = concat_sparse_vectors(record, id_dims) def test(self):
'''
Load testset.
'''
logger.info("load testset from %s" % self.train_path)
self.mode = TaskMode.create_test()
with open(self.train_path) as f:
reader = csv.DictReader(f)
record = [dense_input, sparse_input] for row_id, row in enumerate(reader):
# skip top n lines
if self.n_records_as_test > 0 and row_id > self.n_records_as_test:
break
record.append(list((int(row['click']), ))) rcd = self._parse_record(row)
yield record if rcd:
yield rcd
def infer(self):
'''
Load inferset.
'''
logger.info("load inferset from %s" % self.train_path)
self.mode = TaskMode.create_infer()
with open(self.train_path) as f:
reader = csv.DictReader(f)
if __name__ == '__main__': for row_id, row in enumerate(reader):
path = 'train.txt' rcd = self._parse_record(row)
print detect_dataset(path, 400000) if rcd:
yield rcd
filereader = AvazuDataset(path) def _parse_record(self, row):
for no, rcd in enumerate(filereader.train()): '''
print no, rcd Parse a CSV row and get a record.
if no > 1000: break '''
record = []
for key in categorial_features:
record.append(self.fields[key].gen(row[key]))
record.append([int(row['hour'][-2:])])
dense_input = concat_sparse_vectors(record, self.categorial_dims)
record = []
for key in id_features:
if 'cross' not in key:
record.append(self.fields[key].gen(row[key]))
else:
fea0 = self.fields[key].cross_fea0
fea1 = self.fields[key].cross_fea1
record.append(
self.fields[key].gen_cross_fea(row[fea0], row[fea1]))
sparse_input = concat_sparse_vectors(record, self.id_dims)
record = [dense_input, sparse_input]
if not self.mode.is_infer():
record.append(list((int(row['click']), )))
return record
def ids2dense(vec, dim):
return vec
def ids2sparse(vec):
return ["%d:1" % x for x in vec]
detect_dataset(args.data_path, args.num_lines_to_detect)
dataset = AvazuDataset(
args.data_path,
args.test_set_size,
fields=fields,
feature_dims=feature_dims)
output_trainset_path = os.path.join(args.output_dir, 'train.txt')
output_testset_path = os.path.join(args.output_dir, 'test.txt')
output_infer_path = os.path.join(args.output_dir, 'infer.txt')
output_meta_path = os.path.join(args.output_dir, 'data.meta.txt')
with open(output_trainset_path, 'w') as f:
for id, record in enumerate(dataset.train()):
if id and id % 10000 == 0:
logger.info("load %d records" % id)
if id > args.train_size:
break
dnn_input, lr_input, click = record
dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
lr_input = ids2sparse(lr_input)
line = "%s\t%s\t%d\n" % (' '.join(map(str, dnn_input)),
' '.join(map(str, lr_input)), click[0])
f.write(line)
logger.info('write to %s' % output_trainset_path)
with open(output_testset_path, 'w') as f:
for id, record in enumerate(dataset.test()):
dnn_input, lr_input, click = record
dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
lr_input = ids2sparse(lr_input)
line = "%s\t%s\t%d\n" % (' '.join(map(str, dnn_input)),
' '.join(map(str, lr_input)), click[0])
f.write(line)
logger.info('write to %s' % output_testset_path)
with open(output_infer_path, 'w') as f:
for id, record in enumerate(dataset.infer()):
dnn_input, lr_input = record
dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
lr_input = ids2sparse(lr_input)
line = "%s\t%s\n" % (' '.join(map(str, dnn_input)),
' '.join(map(str, lr_input)), )
f.write(line)
if id > args.test_set_size:
break
logger.info('write to %s' % output_infer_path)
with open(output_meta_path, 'w') as f:
lines = [
"dnn_input_dim: %d" % feature_dims['dnn_input'],
"lr_input_dim: %d" % feature_dims['lr_input']
]
f.write('\n'.join(lines))
logger.info('write data meta into %s' % output_meta_path)
# 数据及处理 # 数据及处理
## 数据集介绍 ## 数据集介绍
本教程演示使用Kaggle上CTR任务的数据集\[[3](#参考文献)\]的预处理方法,最终产生本模型需要的格式,详细的数据格式参考[README.md](./README.md)
Wide && Deep Model\[[2](#参考文献)\]的优势是融合稠密特征和大规模稀疏特征,
因此特征处理方面也针对稠密和稀疏两种特征作处理,
其中Deep部分的稠密值全部转化为ID类特征,
通过embedding 来转化为稠密的向量输入;Wide部分主要通过ID的叉乘提升维度。
数据集使用 `csv` 格式存储,其中各个字段内容如下: 数据集使用 `csv` 格式存储,其中各个字段内容如下:
- `id` : ad identifier - `id` : ad identifier
......
...@@ -42,15 +42,30 @@ ...@@ -42,15 +42,30 @@
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# 点击率预估 # 点击率预估
以下是本例目录包含的文件以及对应说明:
```
├── README.md # 本教程markdown 文档
├── dataset.md # 数据集处理教程
├── images # 本教程图片目录
│   ├── lr_vs_dnn.jpg
│   └── wide_deep.png
├── infer.py # 预测脚本
├── network_conf.py # 模型网络配置
├── reader.py # data reader
├── train.py # 训练脚本
└── utils.py # helper functions
└── avazu_data_processer.py # 示例数据预处理脚本
```
## 背景介绍 ## 背景介绍
CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率, CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\]
通常被用来衡量一个在线广告系统的有效性 是对用户点击一个特定链接的概率做出预测,是广告投放过程中的一个重要环节。精准的点击率预估对在线广告系统收益最大化具有重要意义
当有多个广告位时,CTR 预估一般会作为排序的基准。 当有多个广告位时,CTR 预估一般会作为排序的基准,比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
1. 召回满足 query 的广告集合 1. 获取与用户搜索词相关的广告集合
2. 业务规则和相关性过滤 2. 业务规则和相关性过滤
3. 根据拍卖机制和 CTR 排序 3. 根据拍卖机制和 CTR 排序
4. 展出广告 4. 展出广告
...@@ -78,13 +93,11 @@ Figure 1. LR 和 DNN 模型结构对比 ...@@ -78,13 +93,11 @@ Figure 1. LR 和 DNN 模型结构对比
</p> </p>
LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加), LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加),
但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息)。 但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息);
如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量, 如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量,
这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。 这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法。 LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法;
而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率, 而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率,
这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。 这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。
...@@ -101,10 +114,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括 ...@@ -101,10 +114,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括
我们直接使用第一种方法做分类任务。 我们直接使用第一种方法做分类任务。
我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。 我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示本例中的模型。
具体的特征处理方法参看 [data process](./dataset.md)。
本教程中演示模型的输入格式如下:
```
# <dnn input ids> \t <lr input sparse values> \t click
1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
23 231 \t 1230:0.12 13421:0.9 \t 1
```
详细的格式描述如下:
- `dnn input ids` 采用 one-hot 表示,只需要填写值为1的ID(注意这里不是变长输入)
- `lr input sparse values` 使用了 `ID:VALUE` 的表示,值部分最好规约到值域 `[-1, 1]`。
此外,模型训练时需要传入一个文件描述 dnn 和 lr两个子模型的输入维度,文件的格式如下:
具体的特征处理方法参看 [data process](./dataset.md) ```
dnn_input_dim: <int>
lr_input_dim: <int>
```
其中, `<int>` 表示一个整型数值。
本目录下的 `avazu_data_processor.py` 可以对下载的演示数据集\[[2](#参考文档)\] 进行处理,具体使用方法参考如下说明:
```
usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
OUTPUT_DIR
[--num_lines_to_detect NUM_LINES_TO_DETECT]
[--test_set_size TEST_SET_SIZE]
[--train_size TRAIN_SIZE]
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
path of the Avazu dataset
--output_dir OUTPUT_DIR
directory to output
--num_lines_to_detect NUM_LINES_TO_DETECT
number of records to detect dataset's meta info
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--train_size TRAIN_SIZE
size of the trainset (default: 100000)
```
- `data_path` 是待处理的数据路径
- `output_dir` 生成数据的输出路径
- `num_lines_to_detect` 预先扫描数据生成ID的个数,这里是扫描的文件行数
- `test_set_size` 生成测试集的行数
- `train_size` 生成训练姐的行数
## Wide & Deep Learning Model ## Wide & Deep Learning Model
...@@ -243,18 +308,20 @@ trainer.train( ...@@ -243,18 +308,20 @@ trainer.train(
## 运行训练和测试 ## 运行训练和测试
训练模型需要如下步骤: 训练模型需要如下步骤:
1. 下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\] 1. 准备训练数据
1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz 1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
2. 解压 train.gz 得到 train.txt 2. 解压 train.gz 得到 train.txt
2. 执行 `python train.py --train_data_path train.txt` ,开始训练 3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
2. 执行 `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0` 开始训练
上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下 上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
``` ```
usage: train.py [-h] --train_data_path TRAIN_DATA_PATH usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
[--batch_size BATCH_SIZE] [--test_set_size TEST_SET_SIZE] [--test_data_path TEST_DATA_PATH] [--batch_size BATCH_SIZE]
[--num_passes NUM_PASSES] [--num_passes NUM_PASSES]
[--num_lines_to_detact NUM_LINES_TO_DETACT] [--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
DATA_META_FILE --model_type MODEL_TYPE
PaddlePaddle CTR example PaddlePaddle CTR example
...@@ -262,16 +329,78 @@ optional arguments: ...@@ -262,16 +329,78 @@ optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--train_data_path TRAIN_DATA_PATH --train_data_path TRAIN_DATA_PATH
path of training dataset path of training dataset
--test_data_path TEST_DATA_PATH
path of testing dataset
--batch_size BATCH_SIZE --batch_size BATCH_SIZE
size of mini-batch (default:10000) size of mini-batch (default:10000)
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--num_passes NUM_PASSES --num_passes NUM_PASSES
number of passes to train number of passes to train
--num_lines_to_detact NUM_LINES_TO_DETACT --model_output_prefix MODEL_OUTPUT_PREFIX
number of records to detect dataset's meta info prefix of path for model to store (default:
./ctr_models)
--data_meta_file DATA_META_FILE
path of data meta info file
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
- `train_data_path` : 训练集的路径
- `test_data_path` : 测试集的路径
- `num_passes`: 模型训练多少轮
- `data_meta_file`: 参考[数据和任务抽象](### 数据和任务抽象)的描述。
- `model_type`: 模型分类或回归
## 用训好的模型做预测
训好的模型可以用来预测新的数据, 预测数据的格式为
```
# <dnn input ids> \t <lr input sparse values>
1 23 190 \t 230:0.12 3421:0.9 23451:0.12
23 231 \t 1230:0.12 13421:0.9
``` ```
这里与训练数据的格式唯一不同的地方,就是没有标签,也就是训练数据中第3列 `click` 对应的数值。
`infer.py` 的使用方法如下
```
usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
--prediction_output_path PREDICTION_OUTPUT_PATH
[--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--model_gz_path MODEL_GZ_PATH
path of model parameters gz file
--data_path DATA_PATH
path of the dataset to infer
--prediction_output_path PREDICTION_OUTPUT_PATH
path to output the prediction
--data_meta_path DATA_META_PATH
path of trainset's meta info, default is ./data.meta
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
- `model_gz_path_model`:用 `gz` 压缩过的模型路径
- `data_path` : 需要预测的数据路径
- `prediction_output_paht`:预测输出的路径
- `data_meta_file` :参考[数据和任务抽象](### 数据和任务抽象)的描述。
- `model_type` :分类或回归
示例数据可以用如下命令预测
```
python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
```
最终的预测结果位于 `predictions.txt`。
## 参考文献 ## 参考文献
1. <https://en.wikipedia.org/wiki/Click-through_rate> 1. <https://en.wikipedia.org/wiki/Click-through_rate>
2. <https://www.kaggle.com/c/avazu-ctr-prediction/data> 2. <https://www.kaggle.com/c/avazu-ctr-prediction/data>
......
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import gzip
import argparse
import itertools
import paddle.v2 as paddle
import network_conf
from train import dnn_layer_dims
import reader
from utils import logger, ModelType
parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument(
'--model_gz_path',
type=str,
required=True,
help="path of model parameters gz file")
parser.add_argument(
'--data_path', type=str, required=True, help="path of the dataset to infer")
parser.add_argument(
'--prediction_output_path',
type=str,
required=True,
help="path to output the prediction")
parser.add_argument(
'--data_meta_path',
type=str,
default="./data.meta",
help="path of trainset's meta info, default is ./data.meta")
parser.add_argument(
'--model_type',
type=int,
required=True,
default=ModelType.CLASSIFICATION,
help='model type, classification: %d, regression %d (default classification)'
% (ModelType.CLASSIFICATION, ModelType.REGRESSION))
args = parser.parse_args()
paddle.init(use_gpu=False, trainer_count=1)
class CTRInferer(object):
def __init__(self, param_path):
logger.info("create CTR model")
dnn_input_dim, lr_input_dim = reader.load_data_meta(args.data_meta_path)
# create the mdoel
self.ctr_model = network_conf.CTRmodel(
dnn_layer_dims,
dnn_input_dim,
lr_input_dim,
model_type=ModelType(args.model_type),
is_infer=True)
# load parameter
logger.info("load model parameters from %s" % param_path)
self.parameters = paddle.parameters.Parameters.from_tar(
gzip.open(param_path, 'r'))
self.inferer = paddle.inference.Inference(
output_layer=self.ctr_model.model,
parameters=self.parameters, )
def infer(self, data_path):
logger.info("infer data...")
dataset = reader.Dataset()
infer_reader = paddle.batch(
dataset.infer(args.data_path), batch_size=1000)
logger.warning('write predictions to %s' % args.prediction_output_path)
output_f = open(args.prediction_output_path, 'w')
for id, batch in enumerate(infer_reader()):
res = self.inferer.infer(input=batch)
predictions = [x for x in itertools.chain.from_iterable(res)]
assert len(batch) == len(
predictions), "predict error, %d inputs, but %d predictions" % (
len(batch), len(predictions))
output_f.write('\n'.join(map(str, predictions)) + '\n')
if __name__ == '__main__':
ctr_inferer = CTRInferer(args.model_gz_path)
ctr_inferer.infer(args.data_path)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import paddle.v2 as paddle
from paddle.v2 import layer
from paddle.v2 import data_type as dtype
from utils import logger, ModelType
class CTRmodel(object):
'''
A CTR model which implements wide && deep learning model.
'''
def __init__(self,
dnn_layer_dims,
dnn_input_dim,
lr_input_dim,
model_type=ModelType.create_classification(),
is_infer=False):
'''
@dnn_layer_dims: list of integer
dims of each layer in dnn
@dnn_input_dim: int
size of dnn's input layer
@lr_input_dim: int
size of lr's input layer
@is_infer: bool
whether to build a infer model
'''
self.dnn_layer_dims = dnn_layer_dims
self.dnn_input_dim = dnn_input_dim
self.lr_input_dim = lr_input_dim
self.model_type = model_type
self.is_infer = is_infer
self._declare_input_layers()
self.dnn = self._build_dnn_submodel_(self.dnn_layer_dims)
self.lr = self._build_lr_submodel_()
# model's prediction
# TODO(superjom) rename it to prediction
if self.model_type.is_classification():
self.model = self._build_classification_model(self.dnn, self.lr)
if self.model_type.is_regression():
self.model = self._build_regression_model(self.dnn, self.lr)
def _declare_input_layers(self):
self.dnn_merged_input = layer.data(
name='dnn_input',
type=paddle.data_type.sparse_binary_vector(self.dnn_input_dim))
self.lr_merged_input = layer.data(
name='lr_input',
type=paddle.data_type.sparse_vector(self.lr_input_dim))
if not self.is_infer:
self.click = paddle.layer.data(
name='click', type=dtype.dense_vector(1))
def _build_dnn_submodel_(self, dnn_layer_dims):
'''
build DNN submodel.
'''
dnn_embedding = layer.fc(
input=self.dnn_merged_input, size=dnn_layer_dims[0])
_input_layer = dnn_embedding
for i, dim in enumerate(dnn_layer_dims[1:]):
fc = layer.fc(
input=_input_layer,
size=dim,
act=paddle.activation.Relu(),
name='dnn-fc-%d' % i)
_input_layer = fc
return _input_layer
def _build_lr_submodel_(self):
'''
config LR submodel
'''
fc = layer.fc(
input=self.lr_merged_input, size=1, act=paddle.activation.Relu())
return fc
def _build_classification_model(self, dnn, lr):
merge_layer = layer.concat(input=[dnn, lr])
self.output = layer.fc(
input=merge_layer,
size=1,
# use sigmoid function to approximate ctr rate, a float value between 0 and 1.
act=paddle.activation.Sigmoid())
if not self.is_infer:
self.train_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
input=self.output, label=self.click)
return self.output
def _build_regression_model(self, dnn, lr):
merge_layer = layer.concat(input=[dnn, lr])
self.output = layer.fc(
input=merge_layer, size=1, act=paddle.activation.Sigmoid())
if not self.is_infer:
self.train_cost = paddle.layer.mse_cost(
input=self.output, label=self.click)
return self.output
from utils import logger, TaskMode, load_dnn_input_record, load_lr_input_record
feeding_index = {'dnn_input': 0, 'lr_input': 1, 'click': 2}
class Dataset(object):
def __init__(self):
self.mode = TaskMode.create_train()
def train(self, path):
'''
Load trainset.
'''
logger.info("load trainset from %s" % path)
self.mode = TaskMode.create_train()
self.path = path
return self._parse
def test(self, path):
'''
Load testset.
'''
logger.info("load testset from %s" % path)
self.path = path
self.mode = TaskMode.create_test()
return self._parse
def infer(self, path):
'''
Load infer set.
'''
logger.info("load inferset from %s" % path)
self.path = path
self.mode = TaskMode.create_infer()
return self._parse
def _parse(self):
'''
Parse dataset.
'''
with open(self.path) as f:
for line_id, line in enumerate(f):
fs = line.strip().split('\t')
dnn_input = load_dnn_input_record(fs[0])
lr_input = load_lr_input_record(fs[1])
if not self.mode.is_infer():
click = [int(fs[2])]
yield dnn_input, lr_input, click
else:
yield dnn_input, lr_input
def load_data_meta(path):
'''
load data meta info from path, return (dnn_input_dim, lr_input_dim)
'''
with open(path) as f:
lines = f.read().split('\n')
err_info = "wrong meta format"
assert len(lines) == 2, err_info
assert 'dnn_input_dim:' in lines[0] and 'lr_input_dim:' in lines[
1], err_info
res = map(int, [_.split(':')[1] for _ in lines])
logger.info('dnn input dim: %d' % res[0])
logger.info('lr input dim: %d' % res[1])
return res
#!/usr/bin/env python #!/usr/bin/env python
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-import os
import argparse import argparse
import logging import gzip
import paddle.v2 as paddle
from paddle.v2 import layer
from paddle.v2 import data_type as dtype
from data_provider import field_index, detect_dataset, AvazuDataset
parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument(
'--train_data_path',
type=str,
required=True,
help="path of training dataset")
parser.add_argument(
'--batch_size',
type=int,
default=10000,
help="size of mini-batch (default:10000)")
parser.add_argument(
'--test_set_size',
type=int,
default=10000,
help="size of the validation dataset(default: 10000)")
parser.add_argument(
'--num_passes', type=int, default=10, help="number of passes to train")
parser.add_argument(
'--num_lines_to_detact',
type=int,
default=500000,
help="number of records to detect dataset's meta info")
args = parser.parse_args()
dnn_layer_dims = [128, 64, 32, 1]
data_meta_info = detect_dataset(args.train_data_path, args.num_lines_to_detact)
logging.warning('detect categorical fields in dataset %s' %
args.train_data_path)
for key, item in data_meta_info.items():
logging.warning(' - {}\t{}'.format(key, item))
paddle.init(use_gpu=False, trainer_count=1)
# ============================================================================== import reader
# input layers import paddle.v2 as paddle
# ============================================================================== from utils import logger, ModelType
dnn_merged_input = layer.data( from network_conf import CTRmodel
name='dnn_input',
type=paddle.data_type.sparse_binary_vector(data_meta_info['dnn_input']))
def parse_args():
lr_merged_input = layer.data( parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
name='lr_input', parser.add_argument(
type=paddle.data_type.sparse_binary_vector(data_meta_info['lr_input'])) '--train_data_path',
type=str,
click = paddle.layer.data(name='click', type=dtype.dense_vector(1)) required=True,
help="path of training dataset")
parser.add_argument(
'--test_data_path', type=str, help='path of testing dataset')
parser.add_argument(
'--batch_size',
type=int,
default=10000,
help="size of mini-batch (default:10000)")
parser.add_argument(
'--num_passes', type=int, default=10, help="number of passes to train")
parser.add_argument(
'--model_output_prefix',
type=str,
default='./ctr_models',
help='prefix of path for model to store (default: ./ctr_models)')
parser.add_argument(
'--data_meta_file',
type=str,
required=True,
help='path of data meta info file', )
parser.add_argument(
'--model_type',
type=int,
required=True,
default=ModelType.CLASSIFICATION,
help='model type, classification: %d, regression %d (default classification)'
% (ModelType.CLASSIFICATION, ModelType.REGRESSION))
return parser.parse_args()
# ============================================================================== dnn_layer_dims = [128, 64, 32, 1]
# network structure
# ==============================================================================
def build_dnn_submodel(dnn_layer_dims):
dnn_embedding = layer.fc(input=dnn_merged_input, size=dnn_layer_dims[0])
_input_layer = dnn_embedding
for i, dim in enumerate(dnn_layer_dims[1:]):
fc = layer.fc(
input=_input_layer,
size=dim,
act=paddle.activation.Relu(),
name='dnn-fc-%d' % i)
_input_layer = fc
return _input_layer
# config LR submodel
def build_lr_submodel():
fc = layer.fc(
input=lr_merged_input, size=1, name='lr', act=paddle.activation.Relu())
return fc
# conbine DNN and LR submodels
def combine_submodels(dnn, lr):
merge_layer = layer.concat(input=[dnn, lr])
fc = layer.fc(
input=merge_layer,
size=1,
name='output',
# use sigmoid function to approximate ctr rate, a float value between 0 and 1.
act=paddle.activation.Sigmoid())
return fc
dnn = build_dnn_submodel(dnn_layer_dims)
lr = build_lr_submodel()
output = combine_submodels(dnn, lr)
# ============================================================================== # ==============================================================================
# cost and train period # cost and train period
# ============================================================================== # ==============================================================================
classification_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
input=output, label=click)
params = paddle.parameters.create(classification_cost)
optimizer = paddle.optimizer.Momentum(momentum=0.01)
trainer = paddle.trainer.SGD(
cost=classification_cost, parameters=params, update_equation=optimizer)
dataset = AvazuDataset(
args.train_data_path, n_records_as_test=args.test_set_size)
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
num_samples = event.batch_id * args.batch_size
if event.batch_id % 100 == 0:
logging.warning("Pass %d, Samples %d, Cost %f" %
(event.pass_id, num_samples, event.cost))
if event.batch_id % 1000 == 0:
result = trainer.test(
reader=paddle.batch(dataset.test, batch_size=args.batch_size),
feeding=field_index)
logging.warning("Test %d-%d, Cost %f" %
(event.pass_id, event.batch_id, result.cost))
trainer.train( def train():
reader=paddle.batch( args = parse_args()
paddle.reader.shuffle(dataset.train, buf_size=500), args.model_type = ModelType(args.model_type)
batch_size=args.batch_size), paddle.init(use_gpu=False, trainer_count=1)
feeding=field_index, dnn_input_dim, lr_input_dim = reader.load_data_meta(args.data_meta_file)
event_handler=event_handler,
num_passes=args.num_passes) # create ctr model.
model = CTRmodel(
dnn_layer_dims,
dnn_input_dim,
lr_input_dim,
model_type=args.model_type,
is_infer=False)
params = paddle.parameters.create(model.train_cost)
optimizer = paddle.optimizer.AdaGrad()
trainer = paddle.trainer.SGD(
cost=model.train_cost, parameters=params, update_equation=optimizer)
dataset = reader.Dataset()
def __event_handler__(event):
if isinstance(event, paddle.event.EndIteration):
num_samples = event.batch_id * args.batch_size
if event.batch_id % 100 == 0:
logger.warning("Pass %d, Samples %d, Cost %f, %s" % (
event.pass_id, num_samples, event.cost, event.metrics))
if event.batch_id % 1000 == 0:
if args.test_data_path:
result = trainer.test(
reader=paddle.batch(
dataset.test(args.test_data_path),
batch_size=args.batch_size),
feeding=reader.feeding_index)
logger.warning("Test %d-%d, Cost %f, %s" %
(event.pass_id, event.batch_id, result.cost,
result.metrics))
path = "{}-pass-{}-batch-{}-test-{}.tar.gz".format(
args.model_output_prefix, event.pass_id, event.batch_id,
result.cost)
with gzip.open(path, 'w') as f:
params.to_tar(f)
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(
dataset.train(args.train_data_path), buf_size=500),
batch_size=args.batch_size),
feeding=reader.feeding_index,
event_handler=__event_handler__,
num_passes=args.num_passes)
if __name__ == '__main__':
train()
import logging
logging.basicConfig()
logger = logging.getLogger("paddle")
logger.setLevel(logging.INFO)
class TaskMode:
TRAIN_MODE = 0
TEST_MODE = 1
INFER_MODE = 2
def __init__(self, mode):
self.mode = mode
def is_train(self):
return self.mode == self.TRAIN_MODE
def is_test(self):
return self.mode == self.TEST_MODE
def is_infer(self):
return self.mode == self.INFER_MODE
@staticmethod
def create_train():
return TaskMode(TaskMode.TRAIN_MODE)
@staticmethod
def create_test():
return TaskMode(TaskMode.TEST_MODE)
@staticmethod
def create_infer():
return TaskMode(TaskMode.INFER_MODE)
class ModelType:
CLASSIFICATION = 0
REGRESSION = 1
def __init__(self, mode):
self.mode = mode
def is_classification(self):
return self.mode == self.CLASSIFICATION
def is_regression(self):
return self.mode == self.REGRESSION
@staticmethod
def create_classification():
return ModelType(ModelType.CLASSIFICATION)
@staticmethod
def create_regression():
return ModelType(ModelType.REGRESSION)
def load_dnn_input_record(sent):
return map(int, sent.split())
def load_lr_input_record(sent):
res = []
for _ in [x.split(':') for x in sent.split()]:
res.append((int(_[0]), float(_[1]), ))
return res
...@@ -43,7 +43,10 @@ def train(model_save_dir): ...@@ -43,7 +43,10 @@ def train(model_save_dir):
parameters.to_tar(f) parameters.to_tar(f)
trainer.train( trainer.train(
paddle.batch(paddle.dataset.imikolov.train(word_dict, 5), 64), paddle.batch(
paddle.reader.shuffle(
lambda: paddle.dataset.imikolov.train(word_dict, 5)(),
buf_size=1000), 64),
num_passes=1000, num_passes=1000,
event_handler=event_handler) event_handler=event_handler)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册