Update deploy_ctr_on_baidu_cloud_cn.rst (#1518)

* Add picture for deploy_ctr_on_baidu_cloud_cn.rst * Baidu yun K8S + Volcano CTR Training Document 1.0 * Add ctr_pserver_log.png for deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Add pserver-log * Update deploy_ctr_on_baidu_cloud_cn.rst * chanage image of kubectl download version from 1.13.3 to 1.13.4 * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst Update on Sept 18. remove description of helm, tiller, Go install. replace new yaml provided by Jinghui Zhang * add some images * Update deploy_ctr_on_baidu_cloud_cn.rst add model output manipulation part * Update deploy_ctr_on_baidu_cloud_cn.rst * update deploy ctr on baidu cloud cn.rst * auto convert from md * Update deploy_ctr_on_baidu_cloud_cn.rst * replace the overview.png * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Delete train_on_baidu_cloud_en.rst it is outdated and cannot match with Chinese Version for months. * Delete train_on_baidu_cloud_cn.rst Has been replaced by ELASTIC CTR * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst

Update deploy_ctr_on_baidu_cloud_cn.rst (#1518)
* Add picture for deploy_ctr_on_baidu_cloud_cn.rst * Baidu yun K8S + Volcano CTR Training Document 1.0 * Add ctr_pserver_log.png for deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Add pserver-log * Update deploy_ctr_on_baidu_cloud_cn.rst * chanage image of kubectl download version from 1.13.3 to 1.13.4 * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst Update on Sept 18. remove description of helm, tiller, Go install. replace new yaml provided by Jinghui Zhang * add some images * Update deploy_ctr_on_baidu_cloud_cn.rst add model output manipulation part * Update deploy_ctr_on_baidu_cloud_cn.rst * update deploy ctr on baidu cloud cn.rst * auto convert from md * Update deploy_ctr_on_baidu_cloud_cn.rst * replace the overview.png * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Delete train_on_baidu_cloud_en.rst it is outdated and cannot match with Chinese Version for months. * Delete train_on_baidu_cloud_cn.rst Has been replaced by ELASTIC CTR * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst
2660ae9b · Jiawei Wang · GitHub · 56e3b177 · 2660ae9b · 56e3b177
3 changed file
--- a/doc/fluid/user_guides/howto/training/deploy_ctr_on_baidu_cloud_cn.rst
+++ b/doc/fluid/user_guides/howto/training/deploy_ctr_on_baidu_cloud_cn.rst
@@ -10,7 +10,7 @@ ELASTIC CTR
 * `1. 总体概览 <#head1>`_
 * `2. 前置需求 <#head2>`_
-* `3. 分布式训练+serving方案一键部署 <#head3>`_
+* `3. 分布式训练+Serving方案一键部署 <#head3>`_
 * `4. 查看结果 <#head4>`_
 * `5. 二次开发指南 <#head5>`_
@@ -20,10 +20,10 @@ ELASTIC CTR
 本项目提供了端到端的CTR训练和二次开发的解决方案，主要特点：
-* 使用K8S集群解决原来在物理集群上训练时，会出现类似于配置参数冗杂，环境搭建繁复等问题。
+* 整体方案在k8s环境一键部署，可快速搭建与验证效果
-* 使用基于Kube-batch开发的Volcano框架来进行任务提交和弹性调度。
+* 基于Paddle transpiler模式的大规模分布式高速训练
-* 使用Paddle Serving来进行模型的上线和预测。
+* 训练资源弹性伸缩
-* 使用Cube作为稀疏参数的分布式存储，在预测端与Paddle Serving对接。
+* 工业级稀疏参数Serving组件，高并发条件下单位时间吞吐总量是Redis的13倍 [\ `注1 <#annotation_1>`_\ ]
 本方案整体流程如下图所示：
@@ -37,11 +37,13 @@ ELASTIC CTR
 * trainer/pserver: 训练环节采用PaddlePaddle parameter server模式，对应trainer和pserver角色。分布式训练使用\ `volcano <https://volcano.sh/>`_\ 做批量任务管理工具
-* file server: 训练产出的模型文件，托管到File Server，供下游模块下载；训练产出的文件包括：ProgramDesc和模型参数，模型参数中最大的embedding由工具转换为seqfile格式，经过一系列流程配送到cube分布式稀疏参数服务，其余模型参数保持不变，配送到Paddle Serving模块
+* file server: 训练产出的模型文件，托管到File Server，供下游模块下载；训练产出的文件包括：ProgramDesc和模型参数，模型参数中最大的embedding由工具转换为seqfile格式，经过一系列流程配送到Cube分布式稀疏参数服务，其余模型参数保持不变，配送到Paddle Serving模块
-* cube-transfer: 负责监控上游训练作业产出的模型文件（hadoop sequence file）变化，拉取到本地，并调用cube-builder构建cube字典文件；通知cube-agent节点拉取最新的字典文件，并维护各个cube-server上版本一致性
+* cube-transfer: 负责监控上游训练作业产出的模型文件（hadoop sequence file）变化，拉取到本地，并调用cube-builder构建Cube字典文件；通知cube-agent节点拉取最新的字典文件，并维护各个cube-server上版本一致性
 * cube-builder: 负责将训练作业产出的模型文件（hadoop sequence file格式）转换成可以被cube-server加载的字典文件。字典文件具有特定的数据结构，针对尺寸和内存中访问做了高度优化
-* Cube-Server: 提供分片kv读写能力的服务节点
+* cube-server: 提供分片kv读写能力的服务节点
-* Cube-agent: 与cube-server同机部署，接收cube-transfer下发的字典文件更新命令，拉取数据到本地，通知cube-server进行更新
+* cube-agent: 与cube-server同机部署，接收cube-transfer下发的字典文件更新命令，拉取数据到本地，通知cube-server进行更新
+* Paddle Serving: 加载CTR预估任务模型ProgramDesc和dense参数，提供预测服务
+* Client: CTR预估任务的demo客户端
 以上组件串联完成从训练到预测部署的所有流程。本文档所提供的一键部署脚本\ `paddle-suite.sh <https://github.com/PaddlePaddle/Serving/blob/master/doc/resource/paddle-suite.sh>`_\ 可一键部署上述所有组件。
@@ -58,7 +60,7 @@ ELASTIC CTR
 **第2节 前置需求** 指导用户从零开始，在百度云上申请BCE集群，并部署volcano工具。本方案需使用\ `volcano <https://volcano.sh/>`_\ 做训练环节批量任务管理工具，目前在百度云上验证通过
-**第3节 分布式训练+serving方案部署** 使用paddle-suite.sh，一键部署分布式训练+serving完整流程；并详细解释脚本每一步的工作和含义
+**第3节 分布式训练+Serving方案部署** 使用paddle-suite.sh，一键部署分布式训练+serving完整流程；并详细解释脚本每一步的工作和含义
 **第4节 查看结果** 根据各个pod输出，验证一键安装状态
@@ -76,7 +78,7 @@ ELASTIC CTR
 `百度智能云CCE容器引擎帮助文档-创建集群 <https://cloud.baidu.com/doc/CCE/GettingStarted/24.5C.E5.88.9B.E5.BB.BA.E9.9B.86.E7.BE.A4.html#.E6.93.8D.E4.BD.9C.E6.AD.A5.E9.AA.A4>`_\ ，在百度智能云上建立一个集群，节点配置需要满足如下要求
-* CPU核数 &gt; 4
+* CPU核数 > 4
 申请容器引擎示例:
@@ -146,7 +148,7 @@ ELASTIC CTR
   :alt: image
-3. :raw-html-m2r:`<span id='head3'>分布式训练+serving方案一键部署</span>`
+3. :raw-html-m2r:`<span id='head3'>分布式训练+Serving方案一键部署</span>`
 =============================================================================
 3.1 下载部署方案脚本文件
@@ -397,6 +399,8 @@ pserver日志示例：
   $ docker build -t ${DOCKER_IMAGE_NAME} .
   $ docker push  ${DOCKER_IMAGE_NAME}
+推荐使用百度云提供的镜像仓库，这里是说明文档\ `推送镜像到镜像仓库 <https://cloud.baidu.com/doc/CCE/s/Yjxppt74z/#%E6%8E%A8%E9%80%81%E9%95%9C%E5%83%8F%E5%88%B0%E9%95%9C%E5%83%8F%E4%BB%93%E5%BA%93>`_\ 
 5.2 指定训练规模
 ----------------
@@ -427,10 +431,10 @@ pserver日志示例：
 如上图所示
-5.3 指定cube参数服务器的分片数量和副本数量
+5.3 指定Cube参数服务器的分片数量和副本数量
 ------------------------------------------
-在cube.yaml文件当中，我们可以看到每一个cube的节点的定义，有一个\ ``cubeserver pod``\ 和\ ``cube serverservice``\ 。如果我们需要增加cube的副本数和分片数，只需要在yaml文件中复制相关的定义和环境变量即可。
+在cube.yaml文件当中，我们可以看到每一个Cube的节点的定义，有一个\ ``cube server pod``\ 和\ ``cube server service``\ 。如果我们需要增加cube的副本数和分片数，只需要在yaml文件中复制相关的定义和环境变量即可。
 .. image:: src/cube_config1.png
@@ -444,7 +448,7 @@ pserver日志示例：
   :alt: image
-以上两个图片，一个是对cube POD的定义，一个是对cubeSERVICE的定义。如果需要扩展Cube分片数量，可以复制POD和SERVICE的定义，并重命名它们。示例程序给出的是2个分片，复制之后第3个可以命名为cube-2。
+以上两个图片，一个是对Cube POD的定义，一个是对CubeSERVICE的定义。如果需要扩展Cube分片数量，可以复制POD和SERVICE的定义，并重命名它们。示例程序给出的是2个分片，复制之后第3个可以命名为cube-2。
 5.4 Serving适配新的模型
 -----------------------
@@ -460,3 +464,125 @@ pserver日志示例：
 用户可在此基础上进行修改。
 关于Paddle Serving的完整开发模式，可参考\ `Serving从零开始写一个预测服务 <https://github.com/PaddlePaddle/Serving/blob/develop/doc/CREATING.md>`_\ ，以及\ `Paddle Serving的其他文档 <https://github.com/PaddlePaddle/Serving/tree/develop/doc>`_
+注释
+====
+注1. :raw-html-m2r:`<span id='annotation_1'>Cube和Redis性能对比测试环境</span>`
+-----------------------------------------------------------------------------------
+Cube和Redis均在百度云环境上部署，测试时只测试单个Cube server和Redis server节点的性能。
+client端和server端分别位于2台独立的云主机，机器间ping延时为0.3ms-0.5ms。
+机器配置：Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz 32核
+Cube测试环境
+^^^^^^^^^^^^
+测试key 64bit整数，value为10个float （40字节）
+首先用本方案一键部署脚本部署完成。
+用Paddle Serving的Cube客户端SDK，编写测试代码
+基本原理，启动k个线程，每个线程访问M次Cube server，每次批量获取N个key，总时间加和求平均。
+.. list-table::
+   :header-rows: 1
+   * - 并发数 （压测线程数）
+     - batch size
+     - 平均响应时间 (us)
+     - total qps
+   * - 1
+     - 1000
+     - 1312
+     - 762
+   * - 4
+     - 1000
+     - 1496
+     - 2674
+   * - 8
+     - 1000
+     - 1585
+     - 5047
+   * - 16
+     - 1000
+     - 1866
+     - 8574
+   * - 24
+     - 1000
+     - 2236
+     - 10733
+   * - 32
+     - 1000
+     - 2602
+     - 12298
+Redis测试环境
+^^^^^^^^^^^^^
+测试key 1-1000000之间随机整数，value为40字节字符串
+server端部署Redis-server (latest stable 5.0.6)
+client端为基于\ `redisplusplus <https://github.com/sewenew/redis-plus-plus>`_\ 编写的客户端\ `get_values.cpp <https://github.com/PaddlePaddle/Serving/blob/master/doc/resource/get_value.cpp>`_
+基本原理：启动k个线程，每个线程访问M次Redis server，每次用mget批量获取N个key。总时间加和求平均。
+调用方法：
+.. code-block:: bash
+   $ ./get_values -h 192.168.1.1 -t 3 -r 10000 -b 1000
+其中
+-h server所在主机名
+-t 并发线程数
+-r 每线程请求次数
+-b 每个mget请求的key个数
+.. list-table::
+   :header-rows: 1
+   * - 并发数 （压测线程数）
+     - batch size
+     - 平均响应时间 (us)
+     - total qps
+   * - 1
+     - 1000
+     - 1159
+     - 862
+   * - 4
+     - 1000
+     - 3537
+     - 1079
+   * - 8
+     - 1000
+     - 7726
+     - 1073
+   * - 16
+     - 1000
+     - 15440
+     - 1034
+   * - 24
+     - 1000
+     - 24279
+     - 1004
+   * - 32
+     - 1000
+     - 32570
+     - 996
+测试结论
+^^^^^^^^
+由于Redis高效的时间驱动模型和全内存操作，在单并发时，Redis平均响应时间比Cube少接近50% (1100us vs. 1680us)
+在扩展性方面，Redis受制于单线程模型，随并发数增加，响应时间加倍增加，而总吞吐在1000qps左右即不再上涨；而Cube则随着压测并发数增加，总的qps一直上涨，说明Cube能够较好处理并发请求，具有良好的扩展能力。
+RocksDB在线程数较少的时候，平均响应时间和qps慢于Redis，但是在16以及更多线程的测试当中，RocksDB提供了更快的响应时间和更大的qps。
--- a/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst
+++ b/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst
-.. _train_on_baidu_cloud_cn:
-在百度云启动分布式训练
-=========================
-PaddlePaddle Fluid分布式训练，可以不依赖集群系统（比如MPI，Kubernetes）启动分布式训练。
-本章节将会以 `百度云 <https://cloud.baidu.com/>`_ 为实例，说明如何在云端环境，甚至云端GPU环境启动
-大规模分布式任务。
-创建集群模板
----------
-登录到百度云控制台，选择BCC服务，点击“创建实例”。选择地域，注意，只有一些地域有GPU服务器可选，
-选择合适的地域之后，再选择对应型号，然后创建一个空的服务器，如下图：
-.. image:: src/create_gpu_machine.png
-* 在操作系统选项中，可以根据需要选择对应的版本，注意根据实际情况选择CUDA版本，这里我们选择CUDA-9.2。
-* 示例中选择机器付费方式为后付费，表示随着机器的释放，收费也会对应停止，对运行一次性任务会比较划算。
-在机器创建成功之后，执行下面的命令安装paddlepaddle GPU版本和相关依赖。
-.. code-block:: bash
-  apt-get update && apt-get install -y python python-pip python-opencv
-  # 注：百度云cuda-9.2镜像默认没有安装cudnn和nccl2，需要手动安装，如果自行安装，需要从官网下载
-  wget -q "http://paddle-train-on-cloud.cdn.bcebos.com/libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb"
-  wget -q "http://paddle-train-on-cloud.cdn.bcebos.com/nccl_2.2.13-1+cuda9.0_x86_64.txz"
-  dpkg -i libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb
-  ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/lib/libcudnn.so
-  unxz nccl_2.2.13-1+cuda9.0_x86_64.txz
-  tar xf nccl_2.2.13-1+cuda9.0_x86_64.tar
-  cp -r nccl_2.2.13-1+cuda9.0_x86_64/lib/* /usr/lib
-  # 注：可以选择是否使用下面的pip镜像加速下载
-  pip install -i https://pypi.tuna.tsinghua.edu.cn/simple matplotlib==2.2.3
-  pip install -i https://pypi.tuna.tsinghua.edu.cn/simple paddlepaddle-gpu==0.15.0.post97
-完成安装后，使用下面的测试程序，测试当前机器是否可以正确运行GPU训练程序，如果遇到报错，请根据报错提示修复
-运行环境问题。为了方便启动GPU集群，测试程序执行成功之后，选择当前服务器，然后选择“创建自定义镜像”，后续
-创建GPU集群时即可选择配置好的镜像。
-.. image:: src/create_image.png
-* 测试程序：
-.. code-block:: python
-  from __future__ import print_function
-  import paddle.fluid.core as core
-  import math
-  import os
-  import sys
-  import numpy
-  import paddle
-  import paddle.fluid as fluid
-  BATCH_SIZE = 64
-  PASS_NUM = 1
-  def loss_net(hidden, label):
-      prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
-      loss = fluid.layers.cross_entropy(input=prediction, label=label)
-      avg_loss = fluid.layers.mean(loss)
-      acc = fluid.layers.accuracy(input=prediction, label=label)
-      return prediction, avg_loss, acc
-  def conv_net(img, label):
-      conv_pool_1 = fluid.nets.simple_img_conv_pool(
-          input=img,
-          filter_size=5,
-          num_filters=20,
-          pool_size=2,
-          pool_stride=2,
-          act="relu")
-      conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
-      conv_pool_2 = fluid.nets.simple_img_conv_pool(
-          input=conv_pool_1,
-          filter_size=5,
-          num_filters=50,
-          pool_size=2,
-          pool_stride=2,
-          act="relu")
-      return loss_net(conv_pool_2, label)
-  def train(use_cuda):
-      if use_cuda and not fluid.core.is_compiled_with_cuda():
-          return
-      img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
-      label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-      prediction, avg_loss, acc = conv_net(img, label)
-      test_program = fluid.default_main_program().clone(for_test=True)
-      optimizer = fluid.optimizer.Adam(learning_rate=0.001)
-      optimizer.minimize(avg_loss)
-      place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-      exe = fluid.Executor(place)
-      train_reader = paddle.batch(
-          paddle.reader.shuffle(
-              paddle.dataset.mnist.train(), buf_size=500),
-          batch_size=BATCH_SIZE)
-      test_reader = paddle.batch(
-          paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
-      feeder = fluid.DataFeeder(feed_list=[img, label], place=place)
-      exe.run(fluid.default_startup_program())
-      for pass_id in range(PASS_NUM):
-          for batch_id, data in enumerate(train_reader()):
-              acc_np, avg_loss_np = exe.run(fluid.default_main_program(),
-                                            feed=feeder.feed(data),
-                                            fetch_list=[acc, avg_loss])
-              if (batch_id + 1) % 10 == 0:
-                  print(
-                      'PassID {0:1}, BatchID {1:04}, Loss {2:2.2}, Acc {3:2.2}'.
-                      format(pass_id, batch_id + 1,
-                              float(avg_loss_np.mean()), float(acc_np.mean())))
-  if __name__ == '__main__':
-      train(True)
-创建集群
------
-完成创建镜像之后，可以使用这个配置好的镜像创建一个GPU集群，根据您的实际需求创建足够数量的GPU服务器，
-作为示例，这里启动2台GPU服务器，包括上一步创建的服务器，所以这里再启动一台新的服务器。
-点击“创建实例”，在相同地域选择同样配置的GPU服务器，注意选择刚才创建的镜像作为操作系统。
-.. image:: src/create_more_nodes.png
-编写集群任务启动脚本
----------------
-为了方便在更多的GPU服务器上启动分布式训练任务，我们将使用
-`fabric <http://www.fabfile.org/>`_
-作为集群任务启动管理工具，您可以选择其他熟悉的集群框架，比如MPI, Kubernetes，本示例演示的方法
-仅针对简单集群环境，而且服务器之间可以互相ssh登录。
-安装fabric，需要执行：
-.. code-block:: bash
-  pip install fabric
-假设我们创建了2台GPU服务器，ip分别是 :code:`172.16.0.5,172.16.0.6` ，然后在第一台服务器上，
-先创建训练程序文件 :code:`dist_train_demo.py` ，从
-`这里 <https://raw.githubusercontent.com/PaddlePaddle/FluidDoc/develop/doc/fluid/user_guides/howto/training/src/dist_train_demo.py>`_
-下载代码。然后编写 :code:`fabfile.py` 脚本，用于控制在不同服务器上启动训练任务的parameter server和trainer：
-.. code-block:: python
-  from fabric import Group, task
-  endpoints = "172.16.0.5:6173,172.16.0.6:6173"
-  port = "6173"
-  pservers = 2
-  trainers = 2
-  hosts = []
-  eps = []
-  for ep in endpoints.split(","):
-      eps.append(ep)
-      hosts.append(ep.split(":")[0])
-  def start_server(c):
-      current_endpoint = "%s:%s" % (c.host, port)
-      trainer_id = hosts.index(c.host)
-      cmd = "python /root/work/dist_train_demo.py pserver %s %s %d %d &> /root/work/server.log.%s &" % (
-          endpoints, current_endpoint, trainer_id, trainers, c.host)
-      c.run(cmd)
-  def start_trainer(c):
-      current_endpoint = "%s:%s" % (c.host, port)
-      trainer_id = hosts.index(c.host)
-      cmd = "python /root/work/dist_train_demo.py trainer %s %s %d %d &> /root/work/trainer.log.%s &" % (
-          endpoints, current_endpoint, trainer_id, trainers, c.host)
-      c.run(cmd)
-  @task
-  def start(c):
-      c.connect_kwargs.password = "work@paddle123"
-      c.run("mkdir -p /root/work")
-      c.put("dist_train_demo.py", "/root/work")
-      start_server(c)
-      start_trainer(c)
-  @task
-  def tail_log(c):
-      c.connect_kwargs.password = "work@paddle123"
-      c.run("tail /root/work/trainer.log.%s" % c.host)
-保存上述代码到 :code:`fabfile.py` 之后，执行
-.. code-block:: bash
-  fab -H 172.16.0.5,172.16.0.6 start
-就可以开始一个分布式训练任务。这个任务会在两台GPU服务器分别启动2个pserver进程和2个trainer进程开始训练。
-获取分布式训练结果
---------------
-示例任务会在 :code:`/root/work` 下记录日志，分别为
-:code:`pserver.log.[IP]` 和 :code:`trainer.log.[IP]` 的形式，可以手动在
-服务器上查看这些日志文件观察结果，也可以使用fabric获取所有节点的日志信息，比如：
-.. code-block:: bash
-  fab -H 172.16.0.5,172.16.0.6 tail-log
-关闭集群
------
-任务执行完成后，不要忘记释放掉GPU集群资源，勾选选择需要释放的服务器，选择“释放”，则会关闭机器并释放资源。
-如果需要执行新的任务，可以直接使用之前保存的镜像，启动新的集群，并参照前面的步骤开始训练。
-.. image:: src/release.png
--- a/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_en.rst
+++ b/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_en.rst
-.. _train_on_baidu_cloud_en:
-Distributed Training on Baidu Cloud
-=====================================
-PaddlePaddle Fluid distributed training allows you to start distributed training without relying on cluster systems (such as MPI, Kubernetes).
-This chapter will use `Baidu Cloud <https://cloud.baidu.com/>`_ as an example to show you how to perform large-scale distributed tasks in a cloud environment or even a cloud GPU environment.
-Create a cluster template
---------------------------
-Log in to Baidu Cloud Console, select BCC Service, and click "Create Instance". Select the region, and note that only some regions have GPU servers available.
-After selecting an appropriate region, select the corresponding model and create an empty server, as shown below:
-.. image:: src/create_gpu_machine.png
-* In the operating system options, you can select the corresponding version according to your needs. Note that the CUDA version is selected based on the actual situation. Here we choose CUDA-9.2.
-* In the example, the payment method is selected as post-paid, which means that as the machine is released, the charge will stop correspondingly, which is more cost-effective for running a one-time task.
-After the machine is created successfully, execute the following command to install the paddlepaddle GPU version and related dependencies.
-.. code-block:: bash
-  apt-get update && apt-get install -y python python-pip python-opencv
-  # Note: Baidu cloud cuda-9.2 image does not have cudnn and nccl2 installed by default. It needs to be installed manually. If you intend to install it by yourself, you need to download it from the official website.
-  wget -q "http://paddle-train-on-cloud.cdn.bcebos.com/libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb"
-  wget -q "http://paddle-train-on-cloud.cdn.bcebos.com/nccl_2.2.13-1+cuda9.0_x86_64.txz"
-  dpkg -i libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb
-  ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/lib/libcudnn.so
-  unxz nccl_2.2.13-1+cuda9.0_x86_64.txz
-  tar xf nccl_2.2.13-1+cuda9.0_x86_64.tar
-  cp -r nccl_2.2.13-1+cuda9.0_x86_64/lib/* /usr/lib
-  # Note: You can choose whether to use the following pip image to speed up the download.(for users within China)
-  pip install -i https://pypi.tuna.tsinghua.edu.cn/simple matplotlib==2.2.3
-  pip install -i https://pypi.tuna.tsinghua.edu.cn/simple paddlepaddle-gpu==0.15.0.post97
-After the installation is completed, use the following test program to test whether the current machine can run the GPU training program correctly. If an error is encountered, please fix the running environment problem according to the error message. In order to facilitate the startup of the GPU cluster, after the test program is successfully executed, select the current server and select "Create Customized Image" . You can select the configured image when you create the GPU cluster later.
-.. image:: src/create_image.png
-* test program:
-.. code-block:: python
-  from __future__ import print_function
-  import paddle.fluid.core as core
-  import math
-  import os
-  import sys
-  import numpy
-  import paddle
-  import paddle.fluid as fluid
-  BATCH_SIZE = 64
-  PASS_NUM = 1
-  def loss_net(hidden, label):
-      Prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
-      Loss = fluid.layers.cross_entropy(input=prediction, label=label)
-      Avg_loss = fluid.layers.mean(loss)
-      Acc = fluid.layers.accuracy(input=prediction, label=label)
-      Return prediction, avg_loss, acc
-  def conv_net(img, label):
-      conv_pool_1 = fluid.nets.simple_img_conv_pool(
-          input=img,
-          filter_size=5,
-          num_filters=20,
-          pool_size=2,
-          pool_stride=2,
-          act="relu")
-      conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
-      conv_pool_2 = fluid.nets.simple_img_conv_pool(
-          input=conv_pool_1,
-          filter_size=5,
-          num_filters=50,
-          pool_size=2,
-          pool_stride=2,
-          act="relu")
-      return loss_net(conv_pool_2, label)
-  def train(use_cuda):
-      if use_cuda and not fluid.core.is_compiled_with_cuda():
-          return
-      img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
-      label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-      prediction, avg_loss, acc = conv_net(img, label)
-      test_program = fluid.default_main_program().clone(for_test=True)
-      optimizer = fluid.optimizer.Adam(learning_rate=0.001)
-      optimizer.minimize(avg_loss)
-      place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-      exe = fluid.Executor(place)
-      train_reader = paddle.batch(
-          paddle.reader.shuffle(
-              paddle.dataset.mnist.train(), buf_size=500),
-          batch_size=BATCH_SIZE)
-      test_reader = paddle.batch(
-          paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
-      feeder = fluid.DataFeeder(feed_list=[img, label], place=place)
-      exe.run(fluid.default_startup_program())
-      for pass_id in range(PASS_NUM):
-          for batch_id, data in enumerate(train_reader()):
-              acc_np, avg_loss_np = exe.run(fluid.default_main_program(),
-                                            feed=feeder.feed(data),
-                                            fetch_list=[acc, avg_loss])
-              if (batch_id + 1) % 10 == 0:
-                  print(
-                       'PassID {0:1}, BatchID {1:04}, Loss {2:2.2}, Acc {3:2.2}'.
-                      format(pass_id, batch_id + 1,
-                              float(avg_loss_np.mean()), float(acc_np.mean())))
-  if __name__ == '__main__':
-      train(True)
-Create a cluster
------------------
-After creating the image, you can use this configured image to create a GPU cluster and create a sufficient number of GPU servers according to your actual needs.As an example, here are two GPU servers started, including the one created in the previous step, and a new server here.
-Click "Create Instance" to select GPU servers with the same settings in the same region. Especially, the image you just created should be selected as the operating system.
-.. image:: src/create_more_nodes.png
-Write cluster task startup scripts
------------------------------------
-In order to facilitate the launch of distributed training tasks on more GPU servers, we will use
-`fabric <http://www.fabfile.org/>`_
-as a cluster task launch management tool. You can choose other familiar cluster frameworks, such as MPI, Kubernetes. 
-The methods demonstrated in this example are only proposed for simple cluster environments, and servers can log in to each other through SSH.
-To install the fabric, you need to execute:
-.. code-block:: bash
-  pip install fabric
-Suppose we have created two GPU servers, the ip addresses of them are :code:`172.16.0.5, 172.16.0.6` . On the first server,
-create the training program file :code:`dist_train_demo.py`, from
-`here <https://raw.githubusercontent.com/PaddlePaddle/FluidDoc/develop/doc/fluid/user_guides/howto/training/src/dist_train_demo.py>`_
-to download the code. Then write the :code:`fabfile.py` script to control the parameter servers and trainers that start the training task on different servers:
-.. code-block:: python
-  from fabric import Group, task
-  endpoints = "172.16.0.5:6173,172.16.0.6:6173"
-  port = "6173"
-  pservers = 2
-  trainers = 2
-  hosts = []
-  eps = []
-  for ep in endpoints.split(","):
-      eps.append(ep)
-      hosts.append(ep.split(":")[0])
-  def start_server(c):
-      current_endpoint = "%s:%s" % (c.host, port)
-      trainer_id = hosts.index(c.host)
-      cmd = "python /root/work/dist_train_demo.py pserver %s %s %d %d &> /root/work/server.log.%s &" % (
-          endpoints, current_endpoint, trainer_id, trainers, c.host)
-      c.run(cmd)
-  def start_trainer(c):
-      current_endpoint = "%s:%s" % (c.host, port)
-      trainer_id = hosts.index(c.host)
-      cmd = "python /root/work/dist_train_demo.py trainer %s %s %d %d &> /root/work/trainer.log.%s &" % (
-          endpoints, current_endpoint, trainer_id, trainers, c.host)
-      c.run(cmd)
-  @task
-  def start(c):
-      c.connect_kwargs.password = "work@paddle123"
-      c.run("mkdir -p /root/work")
-      c.put("dist_train_demo.py", "/root/work")
-      start_server(c)
-      start_trainer(c)
-  @task
-  def tail_log(c):
-      c.connect_kwargs.password = "work@paddle123"
-      c.run("tail /root/work/trainer.log.%s" % c.host)
-Save the above code to :code:`fabfile.py` and execute
-.. code-block:: bash
-  fab -H 172.16.0.5,172.16.0.6 start
-Right now, you can start a distributed training task. This task will start training on two GPU servers by starting two pserver processes and two trainer processes respectively.
-Get distributed training results
---------------------------------
-The example task will be logged under :code:`/root/work`, respectively
-:code:`pserver.log.[IP]` and :code:`trainer.log.[IP]` can be manually
-view the results of these log files on the server. You can also use the fabric to obtain log information of all nodes, for example:
-.. code-block:: bash
-  fab -H 172.16.0.5,172.16.0.6 tail-log
-Terminate the cluster
------------------------
-After the task is executed, don't forget to release the GPU cluster resources. To do this, firstly select the servers you want to release, and then select "Release" to shut down the machine and release the resources.
-If you need to perform a new task, you can use the previously saved image directly, start a new cluster, and start the training by following the previous steps.
-.. image:: src/release.png