提交 8e400d22 编写于 作者: Z zhoushiyu 提交者: Thunderbrook

Fix bug in deepfm (#3429)

* fix deepfm failed in 2*1 local distributed training

* fix deepfm preprocess bug

* fix deepfm preprocess bug and README
上级 de021e4e
......@@ -23,7 +23,7 @@
DCN模型介绍可以参阅论文[Deep & Cross Network for Ad Click Predictions](https://arxiv.org/abs/1708.05123)
## 环境
- PaddlePaddle 1.5.2
- PaddlePaddle 1.6
## 数据下载
......@@ -72,6 +72,9 @@ loss: [0.44703564] auc_val: [0.80654419]
cd dist_data && sh dist_download.sh && cd ..
```
运行命令本地模拟多机场景,默认使用2 X 2,即2个pserver,2个trainer的方式组网训练。
**注意:在多机训练中,建议使用Paddle 1.6版本以上或[最新版本](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/install/Tables.html#whl-dev)。**
```bash
sh cluster_train.sh
```
......@@ -98,6 +101,7 @@ python infer.py --model_output_dir cluster_model --test_epoch 10 --test_valid_da
- 0号trainer保存模型参数
- 每次训练完成后需要手动停止pserver进程,使用以下命令查看pserver进程:
>ps -ef | grep python
>ps -ef | grep python
- 数据读取使用dataset模式,目前仅支持运行在Linux环境下
......@@ -15,7 +15,7 @@ This model implementation reproduces the result of the paper "DeepFM: A Factoriz
```
## Environment
- PaddlePaddle 1.5
- PaddlePaddle 1.6
## Download and preprocess data
......@@ -53,6 +53,8 @@ When the training set is iterated to the 22nd round, the testing Logloss is `0.4
## Distributed Train
We emulate distributed training on a local machine. In default, we use 2 X 2,i.e. 2 pservers X 2 trainers。
**Note: we suggest to use Paddle >= 1.6 or [the latest Paddle](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/install/Tables.html#whl-dev) in distributed train.**
### Download and preprocess distributed demo dataset
This small demo dataset(a few lines from Criteo dataset) only test if distributed training can train.
```bash
......@@ -78,7 +80,7 @@ other params explained in cluster_train.py
Infer
```bash
python infer.py --model_output_dir cluster_model --test_epoch 50 --test_data_dir=dist_data/dist_test_data --feat_dict='dist_data/aid_data/feat_dict_10.pkl2'
python infer.py --model_output_dir cluster_model --test_epoch 10 --test_data_dir=dist_data/dist_test_data --feat_dict='dist_data/aid_data/feat_dict_10.pkl2'
```
Notes:
......@@ -87,7 +89,8 @@ Notes:
- The first trainer(with trainer_id 0) saves model params.
- After each training, pserver processes should be stop manually. You can use command below:
>ps -ef | grep python
>ps -ef | grep python
- We use Dataset API to load data,it's only supported on Linux now.
......
......@@ -38,7 +38,7 @@ def parse_args():
parser.add_argument(
'--num_epoch',
type=int,
default=50,
default=10,
help="The number of epochs to train (default: 50)")
parser.add_argument(
'--model_output_dir',
......@@ -73,7 +73,7 @@ def parse_args():
parser.add_argument(
'--reg', type=float, default=1e-4, help=' (default: 1e-4)')
parser.add_argument('--num_field', type=int, default=39)
parser.add_argument('--num_feat', type=int, default=135483)
parser.add_argument('--num_feat', type=int, default=141443)
parser.add_argument('--use_gpu', type=int, default=1)
# dist params
......
......@@ -59,7 +59,7 @@ def get_feat_dict():
for line_idx, line in enumerate(fin):
if line_idx % 100000 == 0:
print('generating feature dict', line_idx / 45000000)
features = line.lstrip('\n').split('\t')
features = line.rstrip('\n').split('\t')
for idx in categorical_range_:
if features[idx] == '': continue
feat_cnt.update([features[idx]])
......@@ -77,19 +77,9 @@ def get_feat_dict():
for idx in continuous_range_:
feat_dict[idx] = tc
tc += 1
# Discrete features
cnt_feat_set = set()
with open('train.txt', 'r') as fin:
for line_idx, line in enumerate(fin):
features = line.rstrip('\n').split('\t')
for idx in categorical_range_:
if features[idx] == '' or features[idx] not in dis_feat_set:
continue
if features[idx] not in cnt_feat_set:
cnt_feat_set.add(features[idx])
feat_dict[features[idx]] = tc
for feat in dis_feat_set:
feat_dict[feat] = tc
tc += 1
# Save dictionary
with open(dir_feat_dict_, 'wb') as fout:
pickle.dump(feat_dict, fout, protocol=2)
......
......@@ -33,7 +33,7 @@ def get_feat_dict():
feat_cnt = Counter()
with open(INPUT_FILE, 'r') as fin:
for line_idx, line in enumerate(fin):
features = line.lstrip('\n').split('\t')
features = line.rstrip('\n').split('\t')
for idx in categorical_range_:
if features[idx] == '': continue
feat_cnt.update([features[idx]])
......@@ -53,20 +53,9 @@ def get_feat_dict():
for idx in continuous_range_:
feat_dict[idx] = tc
tc += 1
# Discrete features
cnt_feat_set = set()
with open(INPUT_FILE, 'r') as fin:
for line_idx, line in enumerate(fin):
features = line.rstrip('\n').split('\t')
for idx in categorical_range_:
if features[idx] == '' or features[idx] not in feat_set:
continue
if features[idx] not in cnt_feat_set:
cnt_feat_set.add(features[idx])
feat_dict[features[idx]] = tc
for feat in feat_set:
feat_dict[feat] = tc
tc += 1
# Save dictionary
with open(dir_feat_dict_, 'wb') as fout:
pickle.dump(feat_dict, fout, protocol=2)
print('args.num_feat ', len(feat_dict) + 1)
......
......@@ -12,7 +12,7 @@ sh download.sh
```
## 环境
- PaddlePaddle 1.5
- PaddlePaddle 1.6
## 单机训练
```bash
......@@ -35,6 +35,8 @@ test_epoch设置加载第10轮训练的模型。
## 多机训练
运行命令本地模拟多机场景,默认使用2 X 2模式,即2个pserver,2个trainer的方式组网训练。
**注意:在多机训练中,建议使用Paddle 1.6版本以上或[最新版本](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/install/Tables.html#whl-dev)。**
数据下载同上面命令。
```bash
sh cluster_train.sh
......@@ -62,6 +64,7 @@ python infer.py --model_output_dir cluster_model --test_epoch 10 --use_gpu=0
- 0号trainer保存模型参数
- 每次训练完成后需要手动停止pserver进程,使用以下命令查看pserver进程:
>ps -ef | grep python
>ps -ef | grep python
- 数据读取使用dataset模式,目前仅支持运行在Linux环境下
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册