@@ -17,12 +17,16 @@ A training job will be created once user asks Paddle cloud to train a model. The
1. the *master process*, which dispatches tasks to
1. one or more *trainer processes*, which run distributed training and synchronize gradients/models via
1. one or more *parameter server processes*, where each holds a shard of the global model.
1. one or more *parameter server processes*, where each holds a shard of the global model, and receive the uploaded gradients from every *trainer process*, so they can run the optimize functions to update their parameters.
Their relation is illustrated in the following graph:
<img src="src/paddle-model-sharding.png"/>
By coordinating these processes, PaddlePaddle supports use both Synchronize Stochastic Gradient Descent (sync SGD) and Asynchronous Stochastic Gradient Descent (async SGD) to train user-defined neural network topologies.
When training with sync SGD, parameter servers wait for all trainers to finish gradients update and then send the updated parameters to trainers, training can not proceed until the trainer received the updated parameters. This creates a synchronization point between trainers. When training with async SGD, each trainer upload gradient and download new parameters individually, without the synchronization with other trainers. Using asyc SGD will be faster in terms of time per pass, but have more noise in gradient since trainers are likely to have a stale model.
### Master Process
The master process will:
...
...
@@ -118,8 +122,6 @@ When the master is started by the Kubernetes, it executes the following steps at
1. Watches the trainer prefix keys `/trainer/` on etcd to find the live trainers.
1. Starts dispatching the tasks to the trainers, and updates task queue using an etcd transaction to ensure lock is held during the update.
The master process will kill itself if its etcd lease expires.
When the master process is dead for any reason, Kubernetes will restart it. It will be online again with all states recovered from etcd in few minutes.
### Trainer Process
...
...
@@ -132,6 +134,8 @@ When the trainer is started by the Kubernetes, it executes the following steps a
If trainer's etcd lease expires, it will try set key `/trainer/<unique ID>` again so that the master process can discover the trainer again.
When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from the TODO queue and go on training.
### Parameter Server Process
When the parameter server is started by Kubernetes, it executes the following steps at startup:
...
...
@@ -153,6 +157,13 @@ When the parameter server is started by Kubernetes, it executes the following st
If the parameter server's etcd lease expires, the parameter server will kill itself.
在上图中显示了在一个实际生产环境中的应用(人脸识别)的数据流图。生产环境的日志数据会通过实时流的方式(Kafka)和离线数据的方式(HDFS)存储,并在集群中运行多个分布式数据处理任务,比如流式数据处理(online data process),离线批处理(offline data process)完成数据的预处理,提供给paddle作为训练数据。用于也可以上传labeled data到分布式存储补充训练数据。在paddle之上运行的深度学习训练输出的模型会提供给在线人脸识别的应用使用。
L' inflation , en Europe , a dérapé sur l' alimentation Food : Where European inflation slipped up
L' inflation accélérée , mesurée dans la zone euro , est due principalement à l' augmentation rapide des prix de l' alimentation . The skyward zoom in food prices is the dominant force behind the speed up in eurozone inflation .
<li><aclass="reference internal"href="#parameter-server-scaling">Parameter Server Scaling</a></li>
...
...
@@ -246,10 +248,12 @@
<olclass="simple">
<li>the <em>master process</em>, which dispatches tasks to</li>
<li>one or more <em>trainer processes</em>, which run distributed training and synchronize gradients/models via</li>
<li>one or more <em>parameter server processes</em>, where each holds a shard of the global model.</li>
<li>one or more <em>parameter server processes</em>, where each holds a shard of the global model, and receive the uploaded gradients from every <em>trainer process</em>, so they can run the optimize functions to update their parameters.</li>
</ol>
<p>Their relation is illustrated in the following graph:</p>
<p><imgsrc="src/paddle-model-sharding.png"/></p>
<p>By coordinating these processes, PaddlePaddle supports use both Synchronize Stochastic Gradient Descent (sync SGD) and Asynchronous Stochastic Gradient Descent (async SGD) to train user-defined neural network topologies.</p>
<p>When training with sync SGD, parameter servers wait for all trainers to finish gradients update and then send the updated parameters to trainers, training can not proceed until the trainer received the updated parameters. This creates a synchronization point between trainers. When training with async SGD, each trainer upload gradient and download new parameters individually, without the synchronization with other trainers. Using asyc SGD will be faster in terms of time per pass, but have more noise in gradient since trainers are likely to have a stale model.</p>
<divclass="section"id="master-process">
<spanid="master-process"></span><h3>Master Process<aclass="headerlink"href="#master-process"title="Permalink to this headline">¶</a></h3>
<p>The master process will:</p>
...
...
@@ -343,7 +347,6 @@
<li>Watches the trainer prefix keys <codeclass="docutils literal"><spanclass="pre">/trainer/</span></code> on etcd to find the live trainers.</li>
<li>Starts dispatching the tasks to the trainers, and updates task queue using an etcd transaction to ensure lock is held during the update.</li>
</ol>
<p>The master process will kill itself if its etcd lease expires.</p>
<p>When the master process is dead for any reason, Kubernetes will restart it. It will be online again with all states recovered from etcd in few minutes.</p>
</div>
<divclass="section"id="trainer-process">
...
...
@@ -355,6 +358,7 @@
<li>Waits for tasks from the master to start training.</li>
</ol>
<p>If trainer’s etcd lease expires, it will try set key <codeclass="docutils literal"><spanclass="pre">/trainer/<unique</span><spanclass="pre">ID></span></code> again so that the master process can discover the trainer again.</p>
<p>When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from the TODO queue and go on training.</p>
</div>
<divclass="section"id="parameter-server-process">
<spanid="id3"></span><h3>Parameter Server Process<aclass="headerlink"href="#parameter-server-process"title="Permalink to this headline">¶</a></h3>
...
...
@@ -376,6 +380,14 @@
<p>If the parameter server’s etcd lease expires, the parameter server will kill itself.</p>
<spanid="parameter-server-checkpointing"></span><h2>Parameter Server Checkpointing<aclass="headerlink"href="#parameter-server-checkpointing"title="Permalink to this headline">¶</a></h2>
<spanid="store-and-dispatching-trainning-data"></span><h2>Store and dispatching trainning data<aclass="headerlink"href="#store-and-dispatching-trainning-data"title="Permalink to this headline">¶</a></h2>
<liclass="toctree-l2"><aclass="reference internal"href="../../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
<liclass="toctree-l3"><aclass="reference internal"href="../../getstarted/build_and_install/docker_install_en.html">PaddlePaddle in Docker Containers</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../../howto/usage/k8s/k8s_en.html">Paddle On Kubernetes</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../../howto/usage/k8s/k8s_aws_en.html">Distributed PaddlePaddle Training on AWS with Kubernetes</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../../howto/dev/new_layer_en.html">Write New Layers</a></li>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
<liclass="toctree-l2"><aclass="reference internal"href="../../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
<liclass="toctree-l3"><aclass="reference internal"href="../../getstarted/build_and_install/docker_install_en.html">PaddlePaddle in Docker Containers</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../../howto/usage/k8s/k8s_en.html">Paddle On Kubernetes</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../../howto/usage/k8s/k8s_aws_en.html">Distributed PaddlePaddle Training on AWS with Kubernetes</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../../howto/dev/new_layer_en.html">Write New Layers</a></li>
<p>在上图中显示了在一个实际生产环境中的应用(人脸识别)的数据流图。生产环境的日志数据会通过实时流的方式(Kafka)和离线数据的方式(HDFS)存储,并在集群中运行多个分布式数据处理任务,比如流式数据处理(online data process),离线批处理(offline data process)完成数据的预处理,提供给paddle作为训练数据。用于也可以上传labeled data到分布式存储补充训练数据。在paddle之上运行的深度学习训练输出的模型会提供给在线人脸识别的应用使用。</p>
</div>
<divclass="section"id="">
<spanid="id3"></span><h2>训练数据的存储<aclass="headerlink"href="#"title="Permalink to this headline">¶</a></h2>
<spanclass="n">r</span><spanclass="o">=</span><spanclass="n">batch_reader</span><spanclass="p">()</span><spanclass="c1"># create a iterator for one pass of data</span>
<spanid="todo"></span><h1>TODO<aclass="headerlink"href="#todo"title="Permalink to this headline">¶</a></h1>
<divclass="section"id="key-value-sharding">
<spanid="key-value-sharding"></span><h2>支持将数据合并成内部的文件格式(key-value),方便sharding与顺序读取<aclass="headerlink"href="#key-value-sharding"title="Permalink to this headline">¶</a></h2>
</div>
<divclass="section"id="job">
<spanid="job"></span><h2>支持用户自定义的数据预处理job<aclass="headerlink"href="#job"title="Permalink to this headline">¶</a></h2>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
@@ -17,12 +17,16 @@ A training job will be created once user asks Paddle cloud to train a model. The
1. the *master process*, which dispatches tasks to
1. one or more *trainer processes*, which run distributed training and synchronize gradients/models via
1. one or more *parameter server processes*, where each holds a shard of the global model.
1. one or more *parameter server processes*, where each holds a shard of the global model, and receive the uploaded gradients from every *trainer process*, so they can run the optimize functions to update their parameters.
Their relation is illustrated in the following graph:
<img src="src/paddle-model-sharding.png"/>
By coordinating these processes, PaddlePaddle supports use both Synchronize Stochastic Gradient Descent (sync SGD) and Asynchronous Stochastic Gradient Descent (async SGD) to train user-defined neural network topologies.
When training with sync SGD, parameter servers wait for all trainers to finish gradients update and then send the updated parameters to trainers, training can not proceed until the trainer received the updated parameters. This creates a synchronization point between trainers. When training with async SGD, each trainer upload gradient and download new parameters individually, without the synchronization with other trainers. Using asyc SGD will be faster in terms of time per pass, but have more noise in gradient since trainers are likely to have a stale model.
### Master Process
The master process will:
...
...
@@ -118,8 +122,6 @@ When the master is started by the Kubernetes, it executes the following steps at
1. Watches the trainer prefix keys `/trainer/` on etcd to find the live trainers.
1. Starts dispatching the tasks to the trainers, and updates task queue using an etcd transaction to ensure lock is held during the update.
The master process will kill itself if its etcd lease expires.
When the master process is dead for any reason, Kubernetes will restart it. It will be online again with all states recovered from etcd in few minutes.
### Trainer Process
...
...
@@ -132,6 +134,8 @@ When the trainer is started by the Kubernetes, it executes the following steps a
If trainer's etcd lease expires, it will try set key `/trainer/<unique ID>` again so that the master process can discover the trainer again.
When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from the TODO queue and go on training.
### Parameter Server Process
When the parameter server is started by Kubernetes, it executes the following steps at startup:
...
...
@@ -153,6 +157,13 @@ When the parameter server is started by Kubernetes, it executes the following st
If the parameter server's etcd lease expires, the parameter server will kill itself.
在上图中显示了在一个实际生产环境中的应用(人脸识别)的数据流图。生产环境的日志数据会通过实时流的方式(Kafka)和离线数据的方式(HDFS)存储,并在集群中运行多个分布式数据处理任务,比如流式数据处理(online data process),离线批处理(offline data process)完成数据的预处理,提供给paddle作为训练数据。用于也可以上传labeled data到分布式存储补充训练数据。在paddle之上运行的深度学习训练输出的模型会提供给在线人脸识别的应用使用。
L' inflation , en Europe , a dérapé sur l' alimentation Food : Where European inflation slipped up
L' inflation accélérée , mesurée dans la zone euro , est due principalement à l' augmentation rapide des prix de l' alimentation . The skyward zoom in food prices is the dominant force behind the speed up in eurozone inflation .
<li><aclass="reference internal"href="#parameter-server-scaling">Parameter Server Scaling</a></li>
...
...
@@ -253,10 +255,12 @@
<olclass="simple">
<li>the <em>master process</em>, which dispatches tasks to</li>
<li>one or more <em>trainer processes</em>, which run distributed training and synchronize gradients/models via</li>
<li>one or more <em>parameter server processes</em>, where each holds a shard of the global model.</li>
<li>one or more <em>parameter server processes</em>, where each holds a shard of the global model, and receive the uploaded gradients from every <em>trainer process</em>, so they can run the optimize functions to update their parameters.</li>
</ol>
<p>Their relation is illustrated in the following graph:</p>
<p><imgsrc="src/paddle-model-sharding.png"/></p>
<p>By coordinating these processes, PaddlePaddle supports use both Synchronize Stochastic Gradient Descent (sync SGD) and Asynchronous Stochastic Gradient Descent (async SGD) to train user-defined neural network topologies.</p>
<p>When training with sync SGD, parameter servers wait for all trainers to finish gradients update and then send the updated parameters to trainers, training can not proceed until the trainer received the updated parameters. This creates a synchronization point between trainers. When training with async SGD, each trainer upload gradient and download new parameters individually, without the synchronization with other trainers. Using asyc SGD will be faster in terms of time per pass, but have more noise in gradient since trainers are likely to have a stale model.</p>
<li>Watches the trainer prefix keys <codeclass="docutils literal"><spanclass="pre">/trainer/</span></code> on etcd to find the live trainers.</li>
<li>Starts dispatching the tasks to the trainers, and updates task queue using an etcd transaction to ensure lock is held during the update.</li>
</ol>
<p>The master process will kill itself if its etcd lease expires.</p>
<p>When the master process is dead for any reason, Kubernetes will restart it. It will be online again with all states recovered from etcd in few minutes.</p>
</div>
<divclass="section"id="trainer-process">
...
...
@@ -362,6 +365,7 @@
<li>Waits for tasks from the master to start training.</li>
</ol>
<p>If trainer’s etcd lease expires, it will try set key <codeclass="docutils literal"><spanclass="pre">/trainer/<unique</span><spanclass="pre">ID></span></code> again so that the master process can discover the trainer again.</p>
<p>When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from the TODO queue and go on training.</p>
</div>
<divclass="section"id="parameter-server-process">
<spanid="id3"></span><h3>Parameter Server Process<aclass="headerlink"href="#parameter-server-process"title="永久链接至标题">¶</a></h3>
...
...
@@ -383,6 +387,14 @@
<p>If the parameter server’s etcd lease expires, the parameter server will kill itself.</p>
<spanid="parameter-server-checkpointing"></span><h2>Parameter Server Checkpointing<aclass="headerlink"href="#parameter-server-checkpointing"title="永久链接至标题">¶</a></h2>
<spanid="store-and-dispatching-trainning-data"></span><h2>Store and dispatching trainning data<aclass="headerlink"href="#store-and-dispatching-trainning-data"title="永久链接至标题">¶</a></h2>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
<p>在上图中显示了在一个实际生产环境中的应用(人脸识别)的数据流图。生产环境的日志数据会通过实时流的方式(Kafka)和离线数据的方式(HDFS)存储,并在集群中运行多个分布式数据处理任务,比如流式数据处理(online data process),离线批处理(offline data process)完成数据的预处理,提供给paddle作为训练数据。用于也可以上传labeled data到分布式存储补充训练数据。在paddle之上运行的深度学习训练输出的模型会提供给在线人脸识别的应用使用。</p>
<spanclass="n">r</span><spanclass="o">=</span><spanclass="n">batch_reader</span><spanclass="p">()</span><spanclass="c1"># create a iterator for one pass of data</span>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.