diff --git a/doc/howto/usage/cluster/fluid_cluster_train_en.md b/doc/howto/usage/cluster/fluid_cluster_train_en.md index a64004a7c4ea12bc0d949d7f11f3e26af62bf912..11904a6f71bb6ce37417aeffb8e408ec65961b12 100644 --- a/doc/howto/usage/cluster/fluid_cluster_train_en.md +++ b/doc/howto/usage/cluster/fluid_cluster_train_en.md @@ -2,27 +2,27 @@ ## Introduction -In this article, we'll explain how to config and run distributed training jobs with PaddlePaddle Fluid in a bare metal cluster. +In this article, we'll explain how to configure and run distributed training jobs with PaddlePaddle Fluid in a bare metal cluster. ## Preparations -### Get your cluster ready +### Getting the cluster ready -Prepare your computer nodes in the cluster. Nodes in this cluster can be of any specification that runs PaddlePaddle, and with a unique IP address assigned to it. Make sure they can communicate with each other. +Prepare the compute nodes in the cluster. Nodes in this cluster can be of any specification that runs PaddlePaddle, and with a unique IP address assigned to it. Make sure they can communicate to each other. ### Have PaddlePaddle installed PaddlePaddle must be installed on all nodes. If you have GPU cards on your nodes, be sure to properly install drivers and CUDA libraries. -PaddlePaddle build and installation guide can be found from [here](http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/build_and_install/index_en.html). +PaddlePaddle build and installation guide can be found [here](http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/build_and_install/index_en.html). -### Update training script +### Update the training script #### Non-cluster training script Let's take [Deep Learning 101](http://www.paddlepaddle.org/docs/develop/book/01.fit_a_line/index.html)'s first chapter: "fit a line" as an example. -This demo's non-cluster version with fluid API is as follows: +The non-cluster version of this demo with fluid API is as follows: ``` python import paddle.v2 as paddle @@ -65,25 +65,25 @@ for pass_id in range(PASS_NUM): exit(1) ``` -We created a simple fully connected neural networks training program and handed it to the fluid executor to run for 100 passes. +We created a simple fully-connected neural network training program and handed it to the fluid executor to run for 100 passes. -Now let's try to convert it to a distributed version to run in a cluster. +Now let's try to convert it to a distributed version to run on a cluster. #### Introducing parameter server -As you see from the non-cluster version of training script, there is only one role in it: the trainer, who does the computing as well as holding parameters. In cluster training, since multi-trainers are working on the same task, they need one centralized place to hold and distribute parameters. This centralized place is called the Parameter Server in PaddlePaddle. +As we can see from the non-cluster version of training script, there is only one role in the script: the trainer, that performs the computing as well as holds the parameters. In cluster training, since multi-trainers are working on the same task, they need one centralized place to hold and distribute parameters. This centralized place is called the Parameter Server in PaddlePaddle. -![parameter server architect](src/trainer.png) +![parameter server architecture](src/trainer.png) -Parameter Server in fluid does not only hold parameters but is also assigned with a part of the program. Trainers communicate with parameter servers via send/receive OPs. For more tech detail, please refer to this [document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/dist_refactor/distributed_architecture.md). +Parameter Server in fluid not only holds the parameters but is also assigned with a part of the program. Trainers communicate with parameter servers via send/receive OPs. For more technical details, please refer to [this document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/dist_refactor/distributed_architecture.md). -Now we need to create program for both trainers and parameter servers, the question is how? +Now we need to create programs for both: trainers and parameter servers, the question is how? #### Slice the program -Fluid provides a tool called "Distribute Transpiler" to automatically convert the non-cluster program into cluster program. +Fluid provides a tool called "Distributed Transpiler" that automatically converts the non-cluster program into cluster program. -The idea behind this tool is to find optimize OPs and gradient parameters, slice the program into 2 pieces and connect them with send/receive OP. +The idea behind this tool is to find the optimize OPs and gradient parameters, slice the program into 2 pieces and connect them with send/receive OP. Optimize OPs and gradient parameters can be found from the return values of optimizer's minimize function. @@ -94,9 +94,9 @@ To put them together: optimize_ops, params_grads = sgd_optimizer.minimize(avg_cost) #get optimize OPs and gradient parameters -t = fluid.DistributeTranspiler() # create transpiler instance +t = fluid.DistributeTranspiler() # create the transpiler instance # slice the program into 2 pieces with optimizer_ops and gradient parameters list, as well as pserver_endpoints, which is a comma separated list of [IP:PORT] and number of trainers -t.transpile(optimize_ops, params_grads, pservers=pserver_endpoints, trainers=2) +t.transpile(optimize_ops, params_grads, pservers=pserver_endpoints, trainers=2) ... #create executor @@ -119,7 +119,7 @@ for pass_id in range(100): ### E2E demo -Please find the complete demo from [here](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/tests/book_distribute/notest_dist_fit_a_line.py). In parameter server node run this in the command line: +Please find the complete demo from [here](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/tests/book_distribute/notest_dist_fit_a_line.py). In parameter server node run the following in the command line: ``` bash PSERVERS=192.168.1.2:6174 SERVER_ENDPOINT=192.168.1.2:6174 TRAINING_ROLE=PSERVER python notest_dist_fit_a_line.py @@ -129,12 +129,12 @@ PSERVERS=192.168.1.2:6174 SERVER_ENDPOINT=192.168.1.2:6174 TRAINING_ROLE=PSERVER Wait until the prompt `Server listening on 192.168.1.2:6174` -Then in 2 of your trainer node run this: +Then in 2 of your trainer nodes run this: ``` bash PSERVERS=192.168.1.2:6174 SERVER_ENDPOINT=192.168.1.2:6174 TRAINING_ROLE=TRAINER python notest_dist_fit_a_line.py ``` -*the reason you need to run this command twice in 2 nodes is: in the script we set the trainer count to be 2. You can change this setting on line 50* +*the reason you need to run this command twice in 2 nodes is because: in the script we set the trainer count to be 2. You can change this setting on line 50* Now you have 2 trainers and 1 parameter server up and running.