提交 fa043571 编写于 作者: Q Qiao Longfei

add doc for train on baidu cloud

上级 d1615ed5
...@@ -46,9 +46,11 @@ python train.py \ ...@@ -46,9 +46,11 @@ python train.py \
2>&1 | tee train.log 2>&1 | tee train.log
``` ```
训练到第1轮的第40000个batch后,测试的AUC为0.807178,误差(cost)为0.445196。 训练到第1轮的第40000个batch后,测试的AUC为0.801178,误差(cost)为0.445196。
### 本地启动一个2 trainer 2 pserver的分布式训练任务 ### 分布式训练
本地启动一个2 trainer 2 pserver的分布式训练任务
```bash ```bash
# start pserver0 # start pserver0
...@@ -101,3 +103,10 @@ python infer.py \ ...@@ -101,3 +103,10 @@ python infer.py \
--model_path models/pass-0/ \ --model_path models/pass-0/ \
--data_path data/valid.txt --data_path data/valid.txt
``` ```
注意:infer.py跑完最后输出的AUC才是整个预测文件的整体AUC。
## 在百度云上运行集群训练
1. 参考文档 [在百度云上启动Fluid分布式训练](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst) 在百度云上部署一个CPU集群。
1. 用preprocess.py处理训练数据生成train.txt。
1. 将train.txt切分成集群机器份,放到每台机器上。
1. 用上面的 `集群训练` 中的命令行启动分布式训练任务.
\ No newline at end of file
...@@ -52,17 +52,18 @@ python preprocess.py --datadir ./data/raw --outdir ./data ...@@ -52,17 +52,18 @@ python preprocess.py --datadir ./data/raw --outdir ./data
## Train ## Train
The command line options for training can be listed by `python train.py -h`. The command line options for training can be listed by `python train.py -h`.
### Train in local mode: ### Local Train:
```bash ```bash
python train.py \ python train.py \
--train_data_path data/train.txt \ --train_data_path data/train.txt \
2>&1 | tee train.log 2>&1 | tee train.log
``` ```
After training pass 1 batch 40000, the testing AUC is `0.807178` and the testing After training pass 1 batch 40000, the testing AUC is `0.801178` and the testing
cost is `0.445196`. cost is `0.445196`.
### Run a 2 pserver 2 trainer distribute training on a single machine ### Distributed Train
Run a 2 pserver 2 trainer distribute training on a single machine
```bash ```bash
# start pserver0 # start pserver0
python train.py \ python train.py \
...@@ -114,3 +115,10 @@ python infer.py \ ...@@ -114,3 +115,10 @@ python infer.py \
--model_path models/ \ --model_path models/ \
--data_path data/valid.txt --data_path data/valid.txt
``` ```
Note: The AUC value in the last log info is the total AUC for all test dataset.
## Train on Baidu Cloud
1. Please prepare some CPU machines on Baidu Cloud following the steps in [train_on_baidu_cloud](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst)
1. Prepare dataset using preprocess.py.
1. Split the train.txt to trainer_num parts and put them on the machines.
1. Run training with the cluster train using the command in `Distributed Train` above.
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册