add doc for train on baidu cloud

fa043571 · Qiao Longfei · d1615ed5 · fa043571 · fa043571
隐藏空白更改
内联并排

Showing with 22 addition and 5 deletion

fluid/recommendation/ctr/README.cn.md fluid/recommendation/ctr/README.cn.md +11 -2

fluid/recommendation/ctr/README.md fluid/recommendation/ctr/README.md +11 -3

未找到文件。
--- a/fluid/recommendation/ctr/README.cn.md
+++ b/fluid/recommendation/ctr/README.cn.md
@@ -46,9 +46,11 @@ python train.py \
        2>&1 | tee train.log
 ```
-训练到第1轮的第40000个batch后，测试的AUC为0.807178，误差（cost）为0.445196。
+训练到第1轮的第40000个batch后，测试的AUC为0.801178，误差（cost）为0.445196。
-### 本地启动一个2 trainer 2 pserver的分布式训练任务
+### 分布式训练
+本地启动一个2 trainer 2 pserver的分布式训练任务
 ```bash
 # start pserver0
@@ -101,3 +103,10 @@ python infer.py \
        --model_path models/pass-0/ \
        --data_path data/valid.txt
 ```
+注意：infer.py跑完最后输出的AUC才是整个预测文件的整体AUC。
+## 在百度云上运行集群训练
+1. 参考文档 [在百度云上启动Fluid分布式训练](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst) 在百度云上部署一个CPU集群。
+1. 用preprocess.py处理训练数据生成train.txt。
+1. 将train.txt切分成集群机器份，放到每台机器上。
+1. 用上面的 `集群训练` 中的命令行启动分布式训练任务.
\ No newline at end of file
--- a/fluid/recommendation/ctr/README.md
+++ b/fluid/recommendation/ctr/README.md
@@ -52,17 +52,18 @@ python preprocess.py --datadir ./data/raw --outdir ./data
 ## Train
 The command line options for training can be listed by `python train.py -h`.
-### Train in local mode:
+### Local Train:
 ```bash
 python train.py \
        --train_data_path data/train.txt \
        2>&1 | tee train.log
 ```
-After training pass 1 batch 40000, the testing AUC is `0.807178` and the testing
+After training pass 1 batch 40000, the testing AUC is `0.801178` and the testing
 cost is `0.445196`.
-### Run a 2 pserver 2 trainer distribute training on a single machine
+### Distributed Train
+Run a 2 pserver 2 trainer distribute training on a single machine
 ```bash
 # start pserver0
 python train.py \
@@ -114,3 +115,10 @@ python infer.py \
        --model_path models/ \
        --data_path data/valid.txt
 ```
+Note: The AUC value in the last log info is the total AUC for all test dataset.
+## Train on Baidu Cloud
+1. Please prepare some CPU machines on Baidu Cloud following the steps in [train_on_baidu_cloud](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst)
+1. Prepare dataset using preprocess.py.
+1. Split the train.txt to trainer_num parts and put them on the machines.
+1. Run training with the cluster train using the command in `Distributed Train` above.
\ No newline at end of file