<liclass="toctree-l3"><aclass="reference internal"href="../howto/cluster/multi_cluster/index_en.html">Use different clusters</a><ul>
<liclass="toctree-l4"><aclass="reference internal"href="../howto/cluster/multi_cluster/fabric_en.html">Cluster Training Using Fabric</a></li>
<liclass="toctree-l4"><aclass="reference internal"href="../howto/cluster/multi_cluster/openmpi_en.html">Cluster Training Using OpenMPI</a></li>
<liclass="toctree-l4"><aclass="reference internal"href="../howto/cluster/multi_cluster/k8s_en.html">PaddlePaddle On Kubernetes</a></li>
<liclass="toctree-l4"><aclass="reference internal"href="../howto/cluster/multi_cluster/k8s_aws_en.html">Distributed PaddlePaddle Training on AWS with Kubernetes</a></li>
<spanid="design-doc-parallel-do-in-paddlepaddle"></span><h1>Design Doc: Parallel_Do in PaddlePaddle<aclass="headerlink"href="#design-doc-parallel-do-in-paddlepaddle"title="Permalink to this headline">¶</a></h1>
<p>In PaddlePaddle, we use parallel_do primitive to represent multithread data parallel processing.</p>
<divclass="section"id="design-overview">
<spanid="design-overview"></span><h2>Design overview<aclass="headerlink"href="#design-overview"title="Permalink to this headline">¶</a></h2>
<p>The definition of a parallel_do op looks like the following</p>
<divclass="highlight-c++"><divclass="highlight"><pre><span></span><spanclass="n">AddInput</span><spanclass="p">(</span><spanclass="n">kInputs</span><spanclass="p">,</span><spanclass="s">"Inputs needed to be split onto different devices"</span><spanclass="p">).</span><spanclass="n">AsDuplicable</span><spanclass="p">();</span>
<spanclass="n">AddInput</span><spanclass="p">(</span><spanclass="n">kParameters</span><spanclass="p">,</span><spanclass="s">"Parameters are duplicated over different devices"</span><spanclass="p">)</span>
<spanclass="n">AddInput</span><spanclass="p">(</span><spanclass="n">kPlaces</span><spanclass="p">,</span><spanclass="s">"Devices used for parallel processing"</span><spanclass="p">);</span>
<spanclass="n">AddOutput</span><spanclass="p">(</span><spanclass="n">kOutputs</span><spanclass="p">,</span><spanclass="s">"Outputs needed to be merged from different devices"</span><spanclass="p">).</span><spanclass="n">AsDuplicable</span><spanclass="p">();</span>
<spanclass="s">"List of operaters to be executed in parallel"</span><spanclass="p">);</span>
</pre></div>
</div>
<p>A vanilla implementation of parallel_do can be shown as the following (<codeclass="docutils literal"><spanclass="pre">|</span></code> means single thread and
<codeclass="docutils literal"><spanclass="pre">||||</span></code> means multiple threads)</p>
<divclass="highlight-default"><divclass="highlight"><pre><span></span><spanclass="c1"># start_program will be run by executor(CPUPlace), all w1, w2 will be allocated on CPU</span>
<spanid="proformance-imporvement"></span><h2>Proformance Imporvement<aclass="headerlink"href="#proformance-imporvement"title="Permalink to this headline">¶</a></h2>
<p>There are serial places we can make this parallel_do faster.</p>
<spanid="forward-split-input-onto-different-devices"></span><h3>forward: split input onto different devices<aclass="headerlink"href="#forward-split-input-onto-different-devices"title="Permalink to this headline">¶</a></h3>
<p>If the input of the parallel_do is independent from any prior opeartors, we can avoid this step by
prefetching the input onto different devices in a seperate background thread. And the python code
<spanid="forward-copy-parameter-to-onto-different-devices"></span><h3>forward: Copy parameter to onto different devices<aclass="headerlink"href="#forward-copy-parameter-to-onto-different-devices"title="Permalink to this headline">¶</a></h3>
<p>We can avoid this step by making each device have a copy of the parameter. This requires:</p>
<olclass="simple">
<li><codeclass="docutils literal"><spanclass="pre">fluid.default_start_up_program()</span></code> to be run on all devices</li>
<li>In the backward, allreduce param@grad at different devices, this requires<ol>
<li><codeclass="docutils literal"><spanclass="pre">backward.py</span></code> add <codeclass="docutils literal"><spanclass="pre">allreduce</span></code> operators at parallel_do_grad</li>
<li><codeclass="docutils literal"><spanclass="pre">allreduce</span></code> operators need to be called in async mode to achieve maximum throughput</li>
</ol>
</li>
<li>apply gradients related op(i.e. cliping, normalization, decay, sgd) on different devices in parallel</li>
</ol>
<p>By doing so, we also avoided “backward: accumulate param@grad from different devices to the first device”.
And the ProgramDesc looks like the following</p>
<divclass="highlight-default"><divclass="highlight"><pre><span></span><spanclass="c1"># w1, w2 will be allocated on all GPUs</span>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
<liclass="toctree-l4"><aclass="reference internal"href="../howto/cluster/multi_cluster/k8s_aws_cn.html">Distributed PaddlePaddle Training on AWS with Kubernetes</a></li>
<spanid="design-doc-parallel-do-in-paddlepaddle"></span><h1>Design Doc: Parallel_Do in PaddlePaddle<aclass="headerlink"href="#design-doc-parallel-do-in-paddlepaddle"title="永久链接至标题">¶</a></h1>
<p>In PaddlePaddle, we use parallel_do primitive to represent multithread data parallel processing.</p>
<p>The definition of a parallel_do op looks like the following</p>
<divclass="highlight-c++"><divclass="highlight"><pre><span></span><spanclass="n">AddInput</span><spanclass="p">(</span><spanclass="n">kInputs</span><spanclass="p">,</span><spanclass="s">"Inputs needed to be split onto different devices"</span><spanclass="p">).</span><spanclass="n">AsDuplicable</span><spanclass="p">();</span>
<spanclass="n">AddInput</span><spanclass="p">(</span><spanclass="n">kParameters</span><spanclass="p">,</span><spanclass="s">"Parameters are duplicated over different devices"</span><spanclass="p">)</span>
<spanclass="n">AddInput</span><spanclass="p">(</span><spanclass="n">kPlaces</span><spanclass="p">,</span><spanclass="s">"Devices used for parallel processing"</span><spanclass="p">);</span>
<spanclass="n">AddOutput</span><spanclass="p">(</span><spanclass="n">kOutputs</span><spanclass="p">,</span><spanclass="s">"Outputs needed to be merged from different devices"</span><spanclass="p">).</span><spanclass="n">AsDuplicable</span><spanclass="p">();</span>
<spanclass="s">"List of operaters to be executed in parallel"</span><spanclass="p">);</span>
</pre></div>
</div>
<p>A vanilla implementation of parallel_do can be shown as the following (<codeclass="docutils literal"><spanclass="pre">|</span></code> means single thread and
<codeclass="docutils literal"><spanclass="pre">||||</span></code> means multiple threads)</p>
<divclass="highlight-default"><divclass="highlight"><pre><span></span><spanclass="c1"># start_program will be run by executor(CPUPlace), all w1, w2 will be allocated on CPU</span>
<spanid="forward-copy-parameter-to-onto-different-devices"></span><h3>forward: Copy parameter to onto different devices<aclass="headerlink"href="#forward-copy-parameter-to-onto-different-devices"title="永久链接至标题">¶</a></h3>
<p>We can avoid this step by making each device have a copy of the parameter. This requires:</p>
<olclass="simple">
<li><codeclass="docutils literal"><spanclass="pre">fluid.default_start_up_program()</span></code> to be run on all devices</li>
<li>In the backward, allreduce param@grad at different devices, this requires<ol>
<li><codeclass="docutils literal"><spanclass="pre">backward.py</span></code> add <codeclass="docutils literal"><spanclass="pre">allreduce</span></code> operators at parallel_do_grad</li>
<li><codeclass="docutils literal"><spanclass="pre">allreduce</span></code> operators need to be called in async mode to achieve maximum throughput</li>
</ol>
</li>
<li>apply gradients related op(i.e. cliping, normalization, decay, sgd) on different devices in parallel</li>
</ol>
<p>By doing so, we also avoided “backward: accumulate param@grad from different devices to the first device”.
And the ProgramDesc looks like the following</p>
<divclass="highlight-default"><divclass="highlight"><pre><span></span><spanclass="c1"># w1, w2 will be allocated on all GPUs</span>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.