add cluster_quick_start_en.md && fix cluster_quick_start.rst (#1931)

* add cluster_quick_start_en.md && fix cluster_quick_start.rst * translate cluster_quick_start.rst * update the new version link for the document

add cluster_quick_start_en.md && fix cluster_quick_start.rst (#1931)
* add cluster_quick_start_en.md && fix cluster_quick_start.rst * translate cluster_quick_start.rst * update the new version link for the document
b945760a · barriery · GitHub · c2fbc0f6 · b945760a · b945760a
2 changed file
--- a/doc/fluid/advanced_guide/distributed_training/cluster_quick_start.rst
+++ b/doc/fluid/advanced_guide/distributed_training/cluster_quick_start.rst
@@ -14,7 +14,7 @@
 * 
-  [x] 成功安装Paddle Fluid，如果尚未安装，请参考 `快速开始 <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/quick_start_cn.html>`_
+  [x] 成功安装Paddle Fluid，如果尚未安装，请参考 `快速开始 <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.7/beginners_guide/quick_start_cn.html>`_
 * 
  [x] 学会最基本的单机训练方法，请参考 `单机训练 <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/training/single_node.html>`_ 中描述的单卡训练，进行学习
@@ -113,7 +113,7 @@
       main_function(args.is_local)
-* 说明：示例中使用的IO方法是dataset，想了解具体的文档和用法请参考 `Dataset API <hhttps://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/api_cn/dataset_cn.html>`_ 。示例中使用的 ``train_from_dataset`` 接口，想了解具体的文档和使用方法请参考 `Executor API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/api_cn/executor_cn.html>`_ 。示例中的 ``from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet`` 表示引入参数服务器架构进行分布式训练，如果想更进一步了解Fleet API的更多选项和示例，请参考 `Fleet API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/training/fleet_api_howto_cn.html>`_
+* 说明：示例中使用的IO方法是dataset，想了解具体的文档和用法请参考 `Dataset API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.7/api_cn/dataset_cn.html>`_ 。示例中使用的 ``train_from_dataset`` 接口，想了解具体的文档和使用方法请参考 `Executor API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.7/api_cn/executor_cn.html>`_ 。示例中的 ``from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet`` 表示引入参数服务器架构进行分布式训练，如果想更进一步了解Fleet API的更多选项和示例，请参考 `Fleet API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.6/user_guides/howto/training/fleet_api_howto_cn.html>`_
 单机训练启动命令

--- a/doc/fluid/advanced_guide/distributed_training/cluster_quick_start_en.rst
+++ b/doc/fluid/advanced_guide/distributed_training/cluster_quick_start_en.rst
-..  _cluster_quick_start_en:
+Quick start for distributed training
+====================================
-Quick Start with Distributed Training
+Distributed training with Fleet API
-==========================
+-----------------------------------
-Preparation
+Since Paddle Fluid `Release
--------------------
+1.5.1 <https://github.com/PaddlePaddle/Paddle/releases/tag/v1.5.1>`__,
-In this article, we'll show you how to quickly start a PaddlePaddle distributed training task in a cluster. Before you start, do some preparatory work as follows:
+it is officially recommended to use the Fleet API for distributed
+training. For the introduction of the Fleet API, please refer to `Fleet
-1. Prepare a connected training cluster. Here we use 4 training nodes with format ``*.paddlepaddle.com`` to represent the host name of the node. You can modify it according to the actual situation.
+Design Doc <https://github.com/PaddlePaddle/Fleet>`__.
-2. Make sure you have read :ref:`install_steps` before you start and can run PaddlePaddle on all nodes of the cluster.
-Example code
+Preparation
-------------
+~~~~~~~~~~~
-Let's use a very simple linear regression model as an example to explain how to start a distributed training task with 2 pserver server nodes and 2 trainer nodes. You can save this code as ``dist_train.py`` .
+-  [x] Install Paddle Fluid. If not already installed, please refer to
+   `Beginner’s
+   Guide <https://www.paddlepaddle.org.cn/documentation/docs/en/1.7/beginners_guide/index_en.html>`__.
+-  [x] Master the most basic single node training method. Please refer
+   to the single card training described in `Single-node
+   training <https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/user_guides/howto/training/single_node_en.html>`__.
+Click-through rate prediction
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Here, we will use a simple example, click-through rate prediction task,
+to illustrate how to configure Fleet API for distributed training, and
+gives an example by using a single node environment to simulate the
+distributed environment. The source code of the example comes from `CTR
+with
+Fleet <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/ctr>`__.
+In order to facilitate learning, the example given here is a mixed code
+of single node and multi node. You can start single node or multi node
+tasks through different startup commands. For the part of obtaining data
+and the logic of data preprocessing, please refer to the source code and
+description of `CTR with
+Fleet <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/ctr>`__.
 .. code:: python
+    from __future__ import print_function
+    from args import parse_args
    import os
-    import paddle
    import paddle.fluid as fluid
+    import sys
-    # train reader
+    from network_conf import ctr_dnn_model_dataset
-    BATCH_SIZE = 20
+    import paddle.fluid.incubate.fleet.base.role_maker as role_maker
-    EPOCH_NUM = 30
-    BATCH_SIZE = 8
+    from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
+    from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
-    train_reader = paddle.batch(
-        paddle.reader.shuffle(
+    dense_feature_dim = 13
-            paddle.dataset.uci_housing.train(), buf_size=500),
+    sparse_feature_dim = 10000001
-        batch_size=BATCH_SIZE)
+    batch_size = 100
+    thread_num = 10
-    def train():
+    embedding_size = 10
-        y = fluid.layers.data(name='y', shape=[1], dtype='float32')
+    args = parse_args()
-        x = fluid.layers.data(name='x', shape=[13], dtype='float32')
-        y_predict = fluid.layers.fc(input=x, size=1, act=None)
+    def main_function(is_local):
+      # common code for local training and distributed training
-        loss = fluid.layers.square_error_cost(input=y_predict, label=y)
+      dense_input = fluid.layers.data(
-        avg_loss = fluid.layers.mean(loss)
+        name="dense_input", shape=[dense_feature_dim], dtype='float32')
-        opt = fluid.optimizer.SGD(learning_rate=0.001)
-        opt.minimize(avg_loss)
+      sparse_input_ids = [
+            fluid.layers.data(name="C" + str(i), shape=[1], lod_level=1,
-        place = fluid.CPUPlace()
+                              dtype="int64") for i in range(1, 27)]
-        feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
-        exe = fluid.Executor(place)
+        label = fluid.layers.data(name="label", shape=[1], dtype="int64")
+        dataset = fluid.DatasetFactory().create_dataset()
-        # fetch distributed training environment setting
+        dataset.set_use_var([dense_input] + sparse_input_ids + [label])
-        training_role = os.getenv("PADDLE_TRAINING_ROLE", None)
+        pipe_command = "python criteo_reader.py %d" % sparse_feature_dim
-        port = os.getenv("PADDLE_PSERVER_PORT", "6174")
+        dataset.set_pipe_command(pipe_command)
-        pserver_ips = os.getenv("PADDLE_PSERVER_IPS", "")
+        dataset.set_batch_size(batch_size)
-        trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+        dataset.set_thread(thread_num)
-        eplist = []
-        for ip in pserver_ips.split(","):
+        whole_filelist = ["raw_data/part-%d" % x 
-            eplist.append(':'.join([ip, port]))
+                           for x in range(len(os.listdir("raw_data")))]
-        pserver_endpoints = ",".join(eplist)
-        trainers = int(os.getenv("PADDLE_TRAINERS"))
+        dataset.set_filelist(whole_filelist)
-        current_endpoint = os.getenv("PADDLE_CURRENT_IP", "") + ":" + port
+        loss, auc_var, batch_auc_var = ctr_dnn_model_dataset(
+            dense_input, sparse_input_ids, label, embedding_size,
-        t = fluid.DistributeTranspiler()
+            sparse_feature_dim)
-        t.transpile(
-            trainer_id = trainer_id,
+        exe = fluid.Executor(fluid.CPUPlace())
-            pservers = pserver_endpoints,
+        def train_loop(epoch=20):
-            trainers = trainers)
+            for i in range(epoch):
+                exe.train_from_dataset(program=fluid.default_main_program(),
-        if training_role == "PSERVER":
+                                       dataset=dataset,
-            pserver_prog = t.get_pserver_program(current_endpoint)
+                                       fetch_list=[auc_var],
-            startup_prog = t.get_startup_program(current_endpoint, pserver_prog)
+                                       fetch_info=["auc"],
-            exe.run(startup_prog)
+                                       debug=False)
-            exe.run(pserver_prog)
+        # local training
-        elif training_role == "TRAINER":
+        def local_train():
-            trainer_prog = t.get_trainer_program()
+            optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
+            optimizer.minimize(loss)
            exe.run(fluid.default_startup_program())
+            train_loop()
-            for epoch in range(EPOCH_NUM):
-                for batch_id, batch_data in enumerate(train_reader()):
+      # distributed training
-                    avg_loss_value, = exe.run(trainer_prog,
+        def dist_train():
-                                          feed=feeder.feed(batch_data),
+            role = role_maker.PaddleCloudRoleMaker()
-                                          fetch_list=[avg_loss])
+            fleet.init(role)
-                    if (batch_id + 1) % 10 == 0:
+            strategy = DistributeTranspilerConfig()
-                        print("Epoch: {0}, Batch: {1}, loss: {2}".format(
+            strategy.sync_mode = False
-                            epoch, batch_id, avg_loss_value[0]))
+            optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
-            # destory the resource of current trainer node in pserver server node
+            optimizer = fleet.distributed_optimizer(optimizer, strategy)
-            exe.close()
+            optimizer.minimize(loss)
+            if fleet.is_server():
+                fleet.init_server()
+                fleet.run_server()
+            elif fleet.is_worker():
+                fleet.init_worker()
+                exe.run(fluid.default_startup_program())
+                train_loop()
+        if is_local:
+            local_train()
        else:
-            raise AssertionError("PADDLE_TRAINING_ROLE should be one of [TRAINER, PSERVER]")
+            dist_train()
-    train()
-Environment Variables
------------------------------------
-When starting a distributed training task, different environment variables are used to represent different node roles, details as follows:
-.. list-table::
-  :header-rows: 1
-  * - Environment Variable
-    - Data Type 
-    - Example 
-    - Description
-  * - :code:`PADDLE_TRAINING_ROLE`
-    - str 
-    - :code:`PSERVER,TRANERR`
-    - role of current training node
-  * - :code:`PADDLE_PSERVER_IPS`
-    - str 
-    - :code:`ps0.paddlepaddle.com, ps1.paddlepaddle.com`
-    - The IP addresses or hostnames of all pserver nodes in the distributed training task, separated by ","
-  * - :code:`PADDLE_PSERVER_PORT`
-    - int 
-    - 6174 
-    - port that the pserver process listens to
-  * - :code:`PADDLE_TRAINERS`
-    - int
-    - 2 
-    - Number of trainer nodes in a distributed training task
-  * - :code:`PADDLE_CURRENT_IP`
-    - str 
-    - :code:`ps0.paddlepaddle.com`
-    - IP address or hostname of the current pserver node
-  * - :code:`PADDLE_TRAINER_ID`
-    - str 
-    - 0 
-    - ID of the current trainer node (unique), in the range of [0, PADDLE_TRAINERS)
-**Note:** Environment variables are just a way to get runtime information. In practical tasks, you can use command line parameters to obtain runtime information.
-API related to Distributed Training
---------------------------------
-DistributeTranspiler
-~~~~~~~~~~~~~~~~~~~~~~
-The machines in distributed training tasks based on the pserver-trainer architecture are divided into two roles: Parameter Server (pserver) and trainer. In Fluid, users only need to configure the network configuration required for single node training. The ``DistributeTranspiler`` module automatically modifies the single-node network settings into settings on which pserver and trainer needs to run based on the role of current training node:
-.. code:: python
+    if __name__ == '__main__':
+        main_function(args.is_local)
-  t = fluid.DistributeTranspiler()
+-  Note: The IO method used in this example is dataset, please refer to
-  t.transpile(
+   `Dataset
-    trainer_id = trainer_id,
+   API <https://www.paddlepaddle.org.cn/documentation/docs/en/1.7/api/dataset.html>`__
-    pservers = pserver_endpoints,
+   for specific documents and usage. For the ``train_from_dataset``
-    trainers = trainers)
+   interface, please refer to `Executor
-  if PADDLE_TRAINING_ROLE == "TRAINER":
+   API <https://www.paddlepaddle.org.cn/documentation/docs/en/1.7/api/executor.html>`__.
-    # fetch the trainer program and execute it
+   ``from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet``
-    trainer_prog = t.get_trainer_program()
+   in this example means to introduce parameter server architecture for
-    ...
+   distributed training, which you can refer to `Fleet
+   API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.6/user_guides/howto/training/fleet_api_howto_cn.html>`__
+   for getting more about the options and examples of Fleet API.
-  elif PADDLE_TRAINER_ROLE == "PSERVER":
+Start command of single node training
-    # fetch the pserver program and execute it
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-    pserver_prog = t.get_pserver_program(current_endpoint)
-    ...
+.. code:: bash
-Exe.close()
+    python train.py --is_local 1
-~~~~~~~~~~~~~~
+Start command of single machine simulation distributed training
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The status information of all trainer nodes is saved in the pserver node. When trainer finishes training, ``exe.close()`` should be called to notify all PServer nodes to release the resources of the current Trainer nodes:
+Here we use launch\_ps, a built-in launcher of paddle, which users can
+specify the number of workers and servers to start the parameter server
+tasks.
-.. code:: python
+.. code:: bash
+    python -m paddle.distributed.launch_ps --worker_num 2 --server_num 2 train.py
-  exe = fluid.Executor(fluid.CPUPlace())
+The task running log can be viewed in the logs directory of the working
-  # training process ...
+directory. When you can use a single machine to simulate distributed
-  exe.close() # notify PServer to destory the resource
+training, you can perform true multi node distributed training. We
+recommend that users refer directly to
-Note: every trainer needs to call exe.close() when the trainer finishes.
+`百度云运行分布式任务的示例 <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/training/deploy_ctr_on_baidu_cloud_cn.html>`__.
-Start a Distributed Training Task
----------------------------------
-.. list-table::
-   :header-rows: 1
-   * - Start Node 
-     - Start Command 
-     - Description
-   * - ps0.paddlepaddle.com 
-     - :code:`PADDLE_TRAINING_ROLE=PSERVER PADDLE_CURRENT_IP=ps0.paddlepaddle.com PADDLE_PSERVER_IPS=ps0.paddlepaddle.com, ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
-     - Start pserver node
-   * - ps1.paddlepaddle.com
-     - :code:`PADDLE_TRAINING_ROLE=PSERVER PADDLE_CURRENT_IP=ps1.paddlepaddle.com PADDLE_PSERVER_IPS=ps0.paddlepaddle.com, ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
-     - Start pserver node
-   * - trainer0.paddlepaddle.com       
-     - :code:`PADDLE_TRAINING_ROLE=TRAINER PADDLE_PSERVER_IPS=ps0.paddlepaddle.com, ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=0 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
-     - Start the number 0 Trainer Node 
-   * - trainer1.paddlepaddle.com       
-     - :code:`PADDLE_TRAINING_ROLE=TRAINER PADDLE_PSERVER_IPS=ps0.paddlepaddle.com, ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=1 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
-     - Start the number 1 trainer node