提交 fa043571 编写于 作者: Q Qiao Longfei

add doc for train on baidu cloud

上级 d1615ed5
......@@ -46,9 +46,11 @@ python train.py \
2>&1 | tee train.log
```
训练到第1轮的第40000个batch后,测试的AUC为0.807178,误差(cost)为0.445196。
训练到第1轮的第40000个batch后,测试的AUC为0.801178,误差(cost)为0.445196。
### 本地启动一个2 trainer 2 pserver的分布式训练任务
### 分布式训练
本地启动一个2 trainer 2 pserver的分布式训练任务
```bash
# start pserver0
......@@ -101,3 +103,10 @@ python infer.py \
--model_path models/pass-0/ \
--data_path data/valid.txt
```
注意:infer.py跑完最后输出的AUC才是整个预测文件的整体AUC。
## 在百度云上运行集群训练
1. 参考文档 [在百度云上启动Fluid分布式训练](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst) 在百度云上部署一个CPU集群。
1. 用preprocess.py处理训练数据生成train.txt。
1. 将train.txt切分成集群机器份,放到每台机器上。
1. 用上面的 `集群训练` 中的命令行启动分布式训练任务.
\ No newline at end of file
......@@ -52,17 +52,18 @@ python preprocess.py --datadir ./data/raw --outdir ./data
## Train
The command line options for training can be listed by `python train.py -h`.
### Train in local mode:
### Local Train:
```bash
python train.py \
--train_data_path data/train.txt \
2>&1 | tee train.log
```
After training pass 1 batch 40000, the testing AUC is `0.807178` and the testing
After training pass 1 batch 40000, the testing AUC is `0.801178` and the testing
cost is `0.445196`.
### Run a 2 pserver 2 trainer distribute training on a single machine
### Distributed Train
Run a 2 pserver 2 trainer distribute training on a single machine
```bash
# start pserver0
python train.py \
......@@ -114,3 +115,10 @@ python infer.py \
--model_path models/ \
--data_path data/valid.txt
```
Note: The AUC value in the last log info is the total AUC for all test dataset.
## Train on Baidu Cloud
1. Please prepare some CPU machines on Baidu Cloud following the steps in [train_on_baidu_cloud](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst)
1. Prepare dataset using preprocess.py.
1. Split the train.txt to trainer_num parts and put them on the machines.
1. Run training with the cluster train using the command in `Distributed Train` above.
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册