@@ -17,12 +17,16 @@ A training job will be created once user asks Paddle cloud to train a model. The
1. the *master process*, which dispatches tasks to
1. one or more *trainer processes*, which run distributed training and synchronize gradients/models via
1. one or more *parameter server processes*, where each holds a shard of the global model.
1. one or more *parameter server processes*, where each holds a shard of the global model, and receive the uploaded gradients from every *trainer process*, so they can run the optimize functions to update their parameters.
Their relation is illustrated in the following graph:
<img src="src/paddle-model-sharding.png"/>
By coordinating these processes, PaddlePaddle supports use both Synchronize Stochastic Gradient Descent (sync SGD) and Asynchronous Stochastic Gradient Descent (async SGD) to train user-defined neural network topologies.
When training with sync SGD, parameter servers wait for all trainers to finish gradients update and then send the updated parameters to trainers, training can not proceed until the trainer received the updated parameters. This creates a synchronization point between trainers. When training with async SGD, each trainer upload gradient and download new parameters individually, without the synchronization with other trainers. Using asyc SGD will be faster in terms of time per pass, but have more noise in gradient since trainers are likely to have a stale model.
### Master Process
The master process will:
...
...
@@ -31,7 +35,7 @@ The master process will:
- Keep track of training progress on the dataset with [task queue](#task-queue). A training job will iterate on the dataset for a full pass until it goes into next pass.
#### Task
#### Task
A task is a data shard to be trained. The total number of tasks will be much bigger than the total number of trainers. The number of data instances inside a task will be much bigger than the mini-batch size.
...
...
@@ -78,7 +82,7 @@ The communication pattern between the trainers and the parameter servers depends
Parameter server will wait for all trainer finish n-th mini-batch calculation and send their gradients before broadcasting new parameters to every trainer. Every trainer will wait for the new parameters before starting n+1-th mini-batch.
There will no synchronization between different trainers, and parameter server updates its parameter as soon as it receives new gradient:
...
...
@@ -118,8 +122,6 @@ When the master is started by the Kubernetes, it executes the following steps at
1. Watches the trainer prefix keys `/trainer/` on etcd to find the live trainers.
1. Starts dispatching the tasks to the trainers, and updates task queue using an etcd transaction to ensure lock is held during the update.
The master process will kill itself if its etcd lease expires.
When the master process is dead for any reason, Kubernetes will restart it. It will be online again with all states recovered from etcd in few minutes.
### Trainer Process
...
...
@@ -132,6 +134,8 @@ When the trainer is started by the Kubernetes, it executes the following steps a
If trainer's etcd lease expires, it will try set key `/trainer/<unique ID>` again so that the master process can discover the trainer again.
When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from the TODO queue and go on training.
### Parameter Server Process
When the parameter server is started by Kubernetes, it executes the following steps at startup:
...
...
@@ -140,11 +144,11 @@ When the parameter server is started by Kubernetes, it executes the following st
1. Search through etcd keys `/ps/<index>` (`/ps/0`, `/ps/1`, ...) to find the first non-existant key whose index is smaller than the total number of parameter servers. Set the key using a transaction to avoid concurrent writes. The parameter server's index is inferred from the key name.
The desired number of parameter servers is 3:
<img src="src/paddle-ps-0.png"/>
The third parameter server joined:
<img src="src/paddle-ps-1.png"/>
1. The parameter server can load parameters if there are already saved parameters in the save path (inferred from its index).
...
...
@@ -153,6 +157,13 @@ When the parameter server is started by Kubernetes, it executes the following st
If the parameter server's etcd lease expires, the parameter server will kill itself.
在上图中显示了在一个实际生产环境中的应用(人脸识别)的数据流图。生产环境的日志数据会通过实时流的方式(Kafka)和离线数据的方式(HDFS)存储,并在集群中运行多个分布式数据处理任务,比如流式数据处理(online data process),离线批处理(offline data process)完成数据的预处理,提供给paddle作为训练数据。用于也可以上传labeled data到分布式存储补充训练数据。在paddle之上运行的深度学习训练输出的模型会提供给在线人脸识别的应用使用。
L' inflation , en Europe , a dérapé sur l' alimentation Food : Where European inflation slipped up
L' inflation accélérée , mesurée dans la zone euro , est due principalement à l' augmentation rapide des prix de l' alimentation . The skyward zoom in food prices is the dominant force behind the speed up in eurozone inflation .
<li><aclass="reference internal"href="#parameter-server-scaling">Parameter Server Scaling</a></li>
...
...
@@ -246,10 +248,12 @@
<olclass="simple">
<li>the <em>master process</em>, which dispatches tasks to</li>
<li>one or more <em>trainer processes</em>, which run distributed training and synchronize gradients/models via</li>
<li>one or more <em>parameter server processes</em>, where each holds a shard of the global model.</li>
<li>one or more <em>parameter server processes</em>, where each holds a shard of the global model, and receive the uploaded gradients from every <em>trainer process</em>, so they can run the optimize functions to update their parameters.</li>
</ol>
<p>Their relation is illustrated in the following graph:</p>
<p><imgsrc="src/paddle-model-sharding.png"/></p>
<p>By coordinating these processes, PaddlePaddle supports use both Synchronize Stochastic Gradient Descent (sync SGD) and Asynchronous Stochastic Gradient Descent (async SGD) to train user-defined neural network topologies.</p>
<p>When training with sync SGD, parameter servers wait for all trainers to finish gradients update and then send the updated parameters to trainers, training can not proceed until the trainer received the updated parameters. This creates a synchronization point between trainers. When training with async SGD, each trainer upload gradient and download new parameters individually, without the synchronization with other trainers. Using asyc SGD will be faster in terms of time per pass, but have more noise in gradient since trainers are likely to have a stale model.</p>
<divclass="section"id="master-process">
<spanid="master-process"></span><h3>Master Process<aclass="headerlink"href="#master-process"title="Permalink to this headline">¶</a></h3>
<p>The master process will:</p>
...
...
@@ -343,7 +347,6 @@
<li>Watches the trainer prefix keys <codeclass="docutils literal"><spanclass="pre">/trainer/</span></code> on etcd to find the live trainers.</li>
<li>Starts dispatching the tasks to the trainers, and updates task queue using an etcd transaction to ensure lock is held during the update.</li>
</ol>
<p>The master process will kill itself if its etcd lease expires.</p>
<p>When the master process is dead for any reason, Kubernetes will restart it. It will be online again with all states recovered from etcd in few minutes.</p>
</div>
<divclass="section"id="trainer-process">
...
...
@@ -355,6 +358,7 @@
<li>Waits for tasks from the master to start training.</li>
</ol>
<p>If trainer’s etcd lease expires, it will try set key <codeclass="docutils literal"><spanclass="pre">/trainer/<unique</span><spanclass="pre">ID></span></code> again so that the master process can discover the trainer again.</p>
<p>When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from the TODO queue and go on training.</p>
</div>
<divclass="section"id="parameter-server-process">
<spanid="id3"></span><h3>Parameter Server Process<aclass="headerlink"href="#parameter-server-process"title="Permalink to this headline">¶</a></h3>
...
...
@@ -376,6 +380,14 @@
<p>If the parameter server’s etcd lease expires, the parameter server will kill itself.</p>
<spanid="parameter-server-checkpointing"></span><h2>Parameter Server Checkpointing<aclass="headerlink"href="#parameter-server-checkpointing"title="Permalink to this headline">¶</a></h2>
<spanid="store-and-dispatching-trainning-data"></span><h2>Store and dispatching trainning data<aclass="headerlink"href="#store-and-dispatching-trainning-data"title="Permalink to this headline">¶</a></h2>
<liclass="toctree-l2"><aclass="reference internal"href="../../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
<liclass="toctree-l3"><aclass="reference internal"href="../../getstarted/build_and_install/docker_install_en.html">PaddlePaddle in Docker Containers</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../../howto/usage/k8s/k8s_en.html">Paddle On Kubernetes</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../../howto/usage/k8s/k8s_aws_en.html">Distributed PaddlePaddle Training on AWS with Kubernetes</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../../howto/dev/new_layer_en.html">Write New Layers</a></li>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
@@ -17,12 +17,16 @@ A training job will be created once user asks Paddle cloud to train a model. The
1. the *master process*, which dispatches tasks to
1. one or more *trainer processes*, which run distributed training and synchronize gradients/models via
1. one or more *parameter server processes*, where each holds a shard of the global model.
1. one or more *parameter server processes*, where each holds a shard of the global model, and receive the uploaded gradients from every *trainer process*, so they can run the optimize functions to update their parameters.
Their relation is illustrated in the following graph:
<img src="src/paddle-model-sharding.png"/>
By coordinating these processes, PaddlePaddle supports use both Synchronize Stochastic Gradient Descent (sync SGD) and Asynchronous Stochastic Gradient Descent (async SGD) to train user-defined neural network topologies.
When training with sync SGD, parameter servers wait for all trainers to finish gradients update and then send the updated parameters to trainers, training can not proceed until the trainer received the updated parameters. This creates a synchronization point between trainers. When training with async SGD, each trainer upload gradient and download new parameters individually, without the synchronization with other trainers. Using asyc SGD will be faster in terms of time per pass, but have more noise in gradient since trainers are likely to have a stale model.
### Master Process
The master process will:
...
...
@@ -31,7 +35,7 @@ The master process will:
- Keep track of training progress on the dataset with [task queue](#task-queue). A training job will iterate on the dataset for a full pass until it goes into next pass.
#### Task
#### Task
A task is a data shard to be trained. The total number of tasks will be much bigger than the total number of trainers. The number of data instances inside a task will be much bigger than the mini-batch size.
...
...
@@ -78,7 +82,7 @@ The communication pattern between the trainers and the parameter servers depends
Parameter server will wait for all trainer finish n-th mini-batch calculation and send their gradients before broadcasting new parameters to every trainer. Every trainer will wait for the new parameters before starting n+1-th mini-batch.
There will no synchronization between different trainers, and parameter server updates its parameter as soon as it receives new gradient:
...
...
@@ -118,8 +122,6 @@ When the master is started by the Kubernetes, it executes the following steps at
1. Watches the trainer prefix keys `/trainer/` on etcd to find the live trainers.
1. Starts dispatching the tasks to the trainers, and updates task queue using an etcd transaction to ensure lock is held during the update.
The master process will kill itself if its etcd lease expires.
When the master process is dead for any reason, Kubernetes will restart it. It will be online again with all states recovered from etcd in few minutes.
### Trainer Process
...
...
@@ -132,6 +134,8 @@ When the trainer is started by the Kubernetes, it executes the following steps a
If trainer's etcd lease expires, it will try set key `/trainer/<unique ID>` again so that the master process can discover the trainer again.
When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from the TODO queue and go on training.
### Parameter Server Process
When the parameter server is started by Kubernetes, it executes the following steps at startup:
...
...
@@ -140,11 +144,11 @@ When the parameter server is started by Kubernetes, it executes the following st
1. Search through etcd keys `/ps/<index>` (`/ps/0`, `/ps/1`, ...) to find the first non-existant key whose index is smaller than the total number of parameter servers. Set the key using a transaction to avoid concurrent writes. The parameter server's index is inferred from the key name.
The desired number of parameter servers is 3:
<img src="src/paddle-ps-0.png"/>
The third parameter server joined:
<img src="src/paddle-ps-1.png"/>
1. The parameter server can load parameters if there are already saved parameters in the save path (inferred from its index).
...
...
@@ -153,6 +157,13 @@ When the parameter server is started by Kubernetes, it executes the following st
If the parameter server's etcd lease expires, the parameter server will kill itself.
在上图中显示了在一个实际生产环境中的应用(人脸识别)的数据流图。生产环境的日志数据会通过实时流的方式(Kafka)和离线数据的方式(HDFS)存储,并在集群中运行多个分布式数据处理任务,比如流式数据处理(online data process),离线批处理(offline data process)完成数据的预处理,提供给paddle作为训练数据。用于也可以上传labeled data到分布式存储补充训练数据。在paddle之上运行的深度学习训练输出的模型会提供给在线人脸识别的应用使用。
L' inflation , en Europe , a dérapé sur l' alimentation Food : Where European inflation slipped up
L' inflation accélérée , mesurée dans la zone euro , est due principalement à l' augmentation rapide des prix de l' alimentation . The skyward zoom in food prices is the dominant force behind the speed up in eurozone inflation .
<li><aclass="reference internal"href="#parameter-server-scaling">Parameter Server Scaling</a></li>
...
...
@@ -253,10 +255,12 @@
<olclass="simple">
<li>the <em>master process</em>, which dispatches tasks to</li>
<li>one or more <em>trainer processes</em>, which run distributed training and synchronize gradients/models via</li>
<li>one or more <em>parameter server processes</em>, where each holds a shard of the global model.</li>
<li>one or more <em>parameter server processes</em>, where each holds a shard of the global model, and receive the uploaded gradients from every <em>trainer process</em>, so they can run the optimize functions to update their parameters.</li>
</ol>
<p>Their relation is illustrated in the following graph:</p>
<p><imgsrc="src/paddle-model-sharding.png"/></p>
<p>By coordinating these processes, PaddlePaddle supports use both Synchronize Stochastic Gradient Descent (sync SGD) and Asynchronous Stochastic Gradient Descent (async SGD) to train user-defined neural network topologies.</p>
<p>When training with sync SGD, parameter servers wait for all trainers to finish gradients update and then send the updated parameters to trainers, training can not proceed until the trainer received the updated parameters. This creates a synchronization point between trainers. When training with async SGD, each trainer upload gradient and download new parameters individually, without the synchronization with other trainers. Using asyc SGD will be faster in terms of time per pass, but have more noise in gradient since trainers are likely to have a stale model.</p>
<li>Watches the trainer prefix keys <codeclass="docutils literal"><spanclass="pre">/trainer/</span></code> on etcd to find the live trainers.</li>
<li>Starts dispatching the tasks to the trainers, and updates task queue using an etcd transaction to ensure lock is held during the update.</li>
</ol>
<p>The master process will kill itself if its etcd lease expires.</p>
<p>When the master process is dead for any reason, Kubernetes will restart it. It will be online again with all states recovered from etcd in few minutes.</p>
</div>
<divclass="section"id="trainer-process">
...
...
@@ -362,6 +365,7 @@
<li>Waits for tasks from the master to start training.</li>
</ol>
<p>If trainer’s etcd lease expires, it will try set key <codeclass="docutils literal"><spanclass="pre">/trainer/<unique</span><spanclass="pre">ID></span></code> again so that the master process can discover the trainer again.</p>
<p>When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from the TODO queue and go on training.</p>
</div>
<divclass="section"id="parameter-server-process">
<spanid="id3"></span><h3>Parameter Server Process<aclass="headerlink"href="#parameter-server-process"title="永久链接至标题">¶</a></h3>
...
...
@@ -383,6 +387,14 @@
<p>If the parameter server’s etcd lease expires, the parameter server will kill itself.</p>
<spanid="parameter-server-checkpointing"></span><h2>Parameter Server Checkpointing<aclass="headerlink"href="#parameter-server-checkpointing"title="永久链接至标题">¶</a></h2>
<spanid="store-and-dispatching-trainning-data"></span><h2>Store and dispatching trainning data<aclass="headerlink"href="#store-and-dispatching-trainning-data"title="永久链接至标题">¶</a></h2>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.