<!DOCTYPE html> <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]--> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>分布式训练 — PaddlePaddle 文档</title> <link rel="stylesheet" href="../../../_static/css/theme.css" type="text/css" /> <link rel="index" title="索引" href="../../../genindex.html"/> <link rel="search" title="搜索" href="../../../search.html"/> <link rel="top" title="PaddlePaddle 文档" href="../../../index.html"/> <link rel="up" title="进阶指南" href="../../index_cn.html"/> <link rel="next" title="使用fabric启动集群训练" href="fabric_cn.html"/> <link rel="prev" title="细节描述" href="../cmd_parameter/detail_introduction_cn.html"/> <link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" /> <link rel="stylesheet" href="../../../_static/css/override.css" type="text/css" /> <script> var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> <script src="../../../_static/js/modernizr.min.js"></script> </head> <body class="wy-body-for-nav" role="document"> <header class="site-header"> <div class="site-logo"> <a href="/"><img src="../../../_static/images/PP_w.png"></a> </div> <div class="site-nav-links"> <div class="site-menu"> <a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a> <div class="language-switcher dropdown"> <a type="button" data-toggle="dropdown"> <span>English</span> <i class="fa fa-angle-up"></i> <i class="fa fa-angle-down"></i> </a> <ul class="dropdown-menu"> <li><a href="/doc_cn">中文</a></li> <li><a href="/doc">English</a></li> </ul> </div> <ul class="site-page-links"> <li><a href="/">Home</a></li> </ul> </div> <div class="doc-module"> <ul class="current"> <li class="toctree-l1"><a class="reference internal" href="../../../getstarted/index_cn.html">新手入门</a></li> <li class="toctree-l1 current"><a class="reference internal" href="../../index_cn.html">进阶指南</a></li> <li class="toctree-l1"><a class="reference internal" href="../../../api/index_cn.html">API</a></li> <li class="toctree-l1"><a class="reference internal" href="../../../faq/index_cn.html">FAQ</a></li> </ul> <div role="search"> <form id="rtd-search-form" class="wy-form" action="../../../search.html" method="get"> <input type="text" name="q" placeholder="Search docs" /> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> </div> </div> </div> </header> <div class="main-content-wrap"> <nav class="doc-menu-vertical" role="navigation"> <ul class="current"> <li class="toctree-l1"><a class="reference internal" href="../../../getstarted/index_cn.html">新手入门</a><ul> <li class="toctree-l2"><a class="reference internal" href="../../../getstarted/build_and_install/index_cn.html">安装与编译</a><ul> <li class="toctree-l3"><a class="reference internal" href="../../../getstarted/build_and_install/pip_install_cn.html">使用pip安装</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../getstarted/build_and_install/docker_install_cn.html">使用Docker安装运行</a></li> <li class="toctree-l3"><a class="reference internal" href="../../dev/build_cn.html">用Docker编译和测试PaddlePaddle</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../getstarted/build_and_install/build_from_source_cn.html">从源码编译</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="../../../getstarted/concepts/use_concepts_cn.html">基本使用概念</a></li> </ul> </li> <li class="toctree-l1 current"><a class="reference internal" href="../../index_cn.html">进阶指南</a><ul class="current"> <li class="toctree-l2"><a class="reference internal" href="../cmd_parameter/index_cn.html">设置命令行参数</a><ul> <li class="toctree-l3"><a class="reference internal" href="../cmd_parameter/use_case_cn.html">使用案例</a></li> <li class="toctree-l3"><a class="reference internal" href="../cmd_parameter/arguments_cn.html">参数概述</a></li> <li class="toctree-l3"><a class="reference internal" href="../cmd_parameter/detail_introduction_cn.html">细节描述</a></li> </ul> </li> <li class="toctree-l2 current"><a class="current reference internal" href="#">分布式训练</a><ul> <li class="toctree-l3"><a class="reference internal" href="fabric_cn.html">fabric集群</a></li> <li class="toctree-l3"><a class="reference internal" href="openmpi_cn.html">openmpi集群</a></li> <li class="toctree-l3"><a class="reference internal" href="k8s_cn.html">kubernetes单机</a></li> <li class="toctree-l3"><a class="reference internal" href="k8s_distributed_cn.html">kubernetes distributed分布式</a></li> <li class="toctree-l3"><a class="reference internal" href="k8s_aws_cn.html">AWS上运行kubernetes集群训练</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="../capi/index_cn.html">PaddlePaddle C-API</a><ul> <li class="toctree-l3"><a class="reference internal" href="../capi/compile_paddle_lib_cn.html">编译 PaddlePaddle 预测库</a></li> <li class="toctree-l3"><a class="reference internal" href="../capi/organization_of_the_inputs_cn.html">输入/输出数据组织</a></li> <li class="toctree-l3"><a class="reference internal" href="../capi/workflow_of_capi_cn.html">C-API 使用流程</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="../../dev/contribute_to_paddle_cn.html">如何贡献代码</a></li> <li class="toctree-l2"><a class="reference internal" href="../../dev/write_docs_cn.html">如何贡献/修改文档</a></li> <li class="toctree-l2"><a class="reference internal" href="../../deep_model/rnn/index_cn.html">RNN相关模型</a><ul> <li class="toctree-l3"><a class="reference internal" href="../../deep_model/rnn/rnn_config_cn.html">RNN配置</a></li> <li class="toctree-l3"><a class="reference internal" href="../../deep_model/rnn/recurrent_group_cn.html">Recurrent Group教程</a></li> <li class="toctree-l3"><a class="reference internal" href="../../deep_model/rnn/hierarchical_layer_cn.html">支持双层序列作为输入的Layer</a></li> <li class="toctree-l3"><a class="reference internal" href="../../deep_model/rnn/hrnn_rnn_api_compare_cn.html">单双层RNN API对比介绍</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="../../optimization/gpu_profiling_cn.html">GPU性能分析与调优</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="../../../api/index_cn.html">API</a><ul> <li class="toctree-l2"><a class="reference internal" href="../../../api/v2/model_configs.html">模型配置</a><ul> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/config/activation.html">Activation</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/config/layer.html">Layers</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/config/evaluators.html">Evaluators</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/config/optimizer.html">Optimizer</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/config/pooling.html">Pooling</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/config/networks.html">Networks</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/config/attr.html">Parameter Attribute</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="../../../api/v2/data.html">数据访问</a><ul> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/data/data_reader.html">Data Reader Interface</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/data/image.html">Image Interface</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/data/dataset.html">Dataset</a></li> </ul> </li> <li class="toctree-l2"><a class="reference internal" href="../../../api/v2/run_logic.html">训练与应用</a></li> <li class="toctree-l2"><a class="reference internal" href="../../../api/v2/fluid.html">Fluid</a><ul> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/fluid/layers.html">layers</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/fluid/data_feeder.html">data_feeder</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/fluid/executor.html">executor</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/fluid/initializer.html">initializer</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/fluid/evaluator.html">evaluator</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/fluid/nets.html">nets</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/fluid/optimizer.html">optimizer</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/fluid/param_attr.html">param_attr</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/fluid/profiler.html">profiler</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/fluid/regularizer.html">regularizer</a></li> <li class="toctree-l3"><a class="reference internal" href="../../../api/v2/fluid/io.html">io</a></li> </ul> </li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="../../../faq/index_cn.html">FAQ</a><ul> <li class="toctree-l2"><a class="reference internal" href="../../../faq/build_and_install/index_cn.html">编译安装与单元测试</a></li> <li class="toctree-l2"><a class="reference internal" href="../../../faq/model/index_cn.html">模型配置</a></li> <li class="toctree-l2"><a class="reference internal" href="../../../faq/parameter/index_cn.html">参数设置</a></li> <li class="toctree-l2"><a class="reference internal" href="../../../faq/local/index_cn.html">本地训练与预测</a></li> <li class="toctree-l2"><a class="reference internal" href="../../../faq/cluster/index_cn.html">集群训练与预测</a></li> </ul> </li> </ul> </nav> <section class="doc-content-wrap"> <div role="navigation" aria-label="breadcrumbs navigation"> <ul class="wy-breadcrumbs"> <li><a href="../../index_cn.html">进阶指南</a> > </li> <li>分布式训练</li> </ul> </div> <div class="wy-nav-content" id="doc-content"> <div class="rst-content"> <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> <div itemprop="articleBody"> <div class="section" id=""> <span id="id1"></span><h1>分布式训练<a class="headerlink" href="#" title="永久链接至标题">¶</a></h1> <div class="section" id=""> <span id="id2"></span><h2>概述<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2> <p>本文将介绍如何使用PaddlePaddle在不同的集群框架下完成分布式训练。分布式训练架构如下图所示:</p> <p><img src="https://user-images.githubusercontent.com/13348433/31772175-5f419eca-b511-11e7-9db7-5231fe3d9ccb.png" width="500"></p> <ul class="simple"> <li>数据分片(Data shard): 用于训练神经网络的数据,被切分成多个部分,每个部分分别给每个trainer使用。</li> <li>计算节点(Trainer): 每个trainer启动后读取切分好的一部分数据,开始神经网络的“前馈”和“后馈”计算,并和参数服务器通信。在完成一定量数据的训练后,上传计算得出的梯度(gradients),然后下载优化更新后的神经网络参数(parameters)。</li> <li>参数服务器(Parameter server):每个参数服务器只保存整个神经网络所有参数的一部分。参数服务器接收从计算节点上传的梯度,并完成参数优化更新,再将更新后的参数下发到每个计算节点。</li> </ul> <p>这样,通过计算节点和参数服务器的分布式协作,可以完成神经网络的SGD方法的训练。PaddlePaddle可以同时支持同步随机梯度下降(SGD)和异步随机梯度下降。</p> <p>在使用同步SGD训练神经网络时,PaddlePaddle使用同步屏障(barrier),使梯度的提交和参数的更新按照顺序方式执行。在异步SGD中,则并不会等待所有trainer提交梯度才更新参数,这样极大地提高了计算的并行性:参数服务器之间不相互依赖,并行地接收梯度和更新参数,参数服务器也不会等待计算节点全部都提交梯度之后才开始下一步,计算节点之间也不会相互依赖,并行地执行模型的训练。可以看出,虽然异步SGD方式会提高参数更新并行度, 但是并不能保证参数同步更新,在任意时间某一台参数服务器上保存的参数可能比另一台要更新,与同步SGD相比,梯度会有噪声。</p> </div> <div class="section" id=""> <span id="id3"></span><h2>环境准备<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2> <ol class="simple"> <li>准备您的计算集群。计算集群通常由一组(几台到几千台规模)的Linux服务器组成。服务器之间可以通过局域网(LAN)联通,每台服务器具有集群中唯一的IP地址(或者可被DNS解析的主机名)。集群中的每台计算机通常被成为一个“节点”。</li> <li>我们需要在集群的所有节点上安装 PaddlePaddle。 如果要启用GPU,还需要在节点上安装对应的GPU驱动以及CUDA。PaddlePaddle的安装可以参考<a class="reference external" href="http://www.paddlepaddle.org/docs/develop/documentation/zh/getstarted/build_and_install/index_cn.html">build_and_install</a>的多种安装方式。我们推荐使用<a class="reference external" href="http://www.paddlepaddle.org/docs/develop/documentation/zh/getstarted/build_and_install/docker_install_cn.html">Docker</a>安装方式来快速安装PaddlePaddle。</li> </ol> <p>安装完成之后,执行下面的命令可以查看已经安装的版本(docker安装方式可以进入docker容器执行:<code class="docutils literal"><span class="pre">docker</span> <span class="pre">run</span> <span class="pre">-it</span> <span class="pre">paddlepaddle/paddle:[tag]</span> <span class="pre">/bin/bash</span></code>):</p> <div class="highlight-bash"><div class="highlight"><pre><span></span>$ paddle version PaddlePaddle <span class="m">0</span>.10.0, compiled with with_avx: ON with_gpu: OFF with_double: OFF with_python: ON with_rdma: OFF with_timer: OFF </pre></div> </div> <p>下面以<code class="docutils literal"><span class="pre">doc/howto/usage/cluster/src/word2vec</span></code>中的代码作为实例,介绍使用PaddlePaddle v2 API完成分布式训练。</p> </div> <div class="section" id=""> <span id="id4"></span><h2>启动参数说明<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2> <div class="section" id=""> <span id="id5"></span><h3>启动参数服务器<a class="headerlink" href="#" title="永久链接至标题">¶</a></h3> <p>执行以下的命令启动一个参数服务器并等待和计算节点的数据交互</p> <div class="highlight-bash"><div class="highlight"><pre><span></span>$ paddle pserver --port<span class="o">=</span><span class="m">7164</span> --ports_num<span class="o">=</span><span class="m">1</span> --ports_num_for_sparse<span class="o">=</span><span class="m">1</span> --num_gradient_servers<span class="o">=</span><span class="m">1</span> </pre></div> </div> <p>如果希望可以在后台运行pserver程序,并保存输出到一个日志文件,可以运行:</p> <div class="highlight-bash"><div class="highlight"><pre><span></span>$ stdbuf -oL /usr/bin/nohup paddle pserver --port<span class="o">=</span><span class="m">7164</span> --ports_num<span class="o">=</span><span class="m">1</span> --ports_num_for_sparse<span class="o">=</span><span class="m">1</span> --num_gradient_servers<span class="o">=</span><span class="m">1</span> <span class="p">&</span>> pserver.log </pre></div> </div> <p>参数说明</p> <ul class="simple"> <li>port:<strong>必选,默认7164</strong>,pserver监听的起始端口,根据ports_num决定总端口个数,从起始端口监听多个端口用于通信</li> <li>ports_num:<strong>必选,默认1</strong>,监听的端口个数</li> <li>ports_num_for_sparse:<strong>必选,默认0</strong>,用于稀疏类型参数通信的端口个数</li> <li>num_gradient_servers:<strong>必选,默认1</strong>,当前训练任务pserver总数</li> </ul> </div> <div class="section" id=""> <span id="id6"></span><h3>启动计算节点<a class="headerlink" href="#" title="永久链接至标题">¶</a></h3> <p>执行以下命令启动使用python编写的trainer程序(文件名为任意文件名,如train.py)</p> <div class="highlight-bash"><div class="highlight"><pre><span></span>$ python train.py </pre></div> </div> <p>trainer需要和pserver保持网络联通以完成训练。trainer启动需要传入端口、pserver地址等参数使trainer可以正确连接到pserver。这些参数可以通过<a class="reference external" href="https://zh.wikipedia.org/wiki/环境变量">环境变量</a>或编写程序时<code class="docutils literal"><span class="pre">paddle.init()</span></code>中传入参数。如果同时使用<code class="docutils literal"><span class="pre">paddle.init()</span></code>参数和环境变量,将会优先使用<code class="docutils literal"><span class="pre">paddle.init()</span></code>中传入的参数。</p> <p>使用环境变量:</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">PADDLE_INIT_USE_GPU</span><span class="o">=</span>False <span class="nb">export</span> <span class="nv">PADDLE_INIT_TRAINER_COUNT</span><span class="o">=</span><span class="m">1</span> <span class="nb">export</span> <span class="nv">PADDLE_INIT_PORT</span><span class="o">=</span><span class="m">7164</span> <span class="nb">export</span> <span class="nv">PADDLE_INIT_PORTS_NUM</span><span class="o">=</span><span class="m">1</span> <span class="nb">export</span> <span class="nv">PADDLE_INIT_PORTS_NUM_FOR_SPARSE</span><span class="o">=</span><span class="m">1</span> <span class="nb">export</span> <span class="nv">PADDLE_INIT_NUM_GRADIENT_SERVERS</span><span class="o">=</span><span class="m">1</span> <span class="nb">export</span> <span class="nv">PADDLE_INIT_TRAINER_ID</span><span class="o">=</span><span class="m">0</span> <span class="nb">export</span> <span class="nv">PADDLE_INIT_PSERVERS</span><span class="o">=</span><span class="m">127</span>.0.0.1 </pre></div> </div> <p>使用参数:</p> <div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">paddle</span><span class="o">.</span><span class="n">init</span><span class="p">(</span> <span class="n">use_gpu</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">trainer_count</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">7164</span><span class="p">,</span> <span class="n">ports_num</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">ports_num_for_sparse</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">num_gradient_servers</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">trainer_id</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">pservers</span><span class="o">=</span><span class="s2">"127.0.0.1"</span><span class="p">)</span> </pre></div> </div> <p>参数说明</p> <ul class="simple"> <li>use_gpu: <strong>可选,默认False</strong>,是否启用GPU训练</li> <li>trainer_count:<strong>必选,默认1</strong>,当前trainer的线程数目</li> <li>port:<strong>必选,默认7164</strong>,连接到pserver的端口</li> <li>ports_num:<strong>必选,默认1</strong>,连接到pserver的端口个数</li> <li>ports_num_for_sparse:<strong>必选,默认0</strong>,和pserver之间用于稀疏类型参数通信的端口个数</li> <li>num_gradient_servers:<strong>必选,默认1</strong>,当前训练任务trainer总数</li> <li>trainer_id:<strong>必选,默认0</strong>,每个trainer的唯一ID,从0开始的整数</li> <li>pservers:<strong>必选,默认127.0.0.1</strong>,当前训练任务启动的pserver的IP列表,多个IP使用“,”隔开</li> </ul> </div> <div class="section" id=""> <span id="id7"></span><h3>准备数据集<a class="headerlink" href="#" title="永久链接至标题">¶</a></h3> <p>参考样例数据准备脚本<a class="reference external" href="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py">prepare.py</a>,准备训练数据和验证数据集,我们使用paddle.dataset.imikolov数据集,并根据分布式训练并发数(trainer节点个数),在<code class="docutils literal"><span class="pre">prepare.py</span></code>开头部分指定<code class="docutils literal"><span class="pre">SPLIT_COUNT</span></code>将数据切分成多份。</p> <p>在线上系统中,通常会使用MapReduce任务的输出结果作为训练结果,这样训练文件的个数会比较多,而且个数并不确定。在trainer中可以使用下面取模的方法为每个trainer分配训练数据文件:</p> <div class="highlight-python"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">os</span> <span class="n">train_list</span> <span class="o">=</span> <span class="p">[]</span> <span class="n">flist</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="s2">"/train_data/"</span><span class="p">)</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">flist</span><span class="p">:</span> <span class="n">suffix</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">"-"</span><span class="p">)[</span><span class="mi">1</span><span class="p">])</span> <span class="k">if</span> <span class="n">suffix</span> <span class="o">%</span> <span class="n">TRAINER_COUNT</span> <span class="o">==</span> <span class="n">TRAINER_ID</span><span class="p">:</span> <span class="n">train_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">f</span><span class="p">)</span> </pre></div> </div> <p>示例程序<code class="docutils literal"><span class="pre">prepare.py</span></code>会把训练集和测试集分别分割成多个文件(例子中为3个,后缀为<code class="docutils literal"><span class="pre">-00000</span></code>、<code class="docutils literal"><span class="pre">-00001</span></code>和<code class="docutils literal"><span class="pre">-00002</span></code>):</p> <div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">train</span><span class="o">.</span><span class="n">txt</span> <span class="n">train</span><span class="o">.</span><span class="n">txt</span><span class="o">-</span><span class="mi">00000</span> <span class="n">train</span><span class="o">.</span><span class="n">txt</span><span class="o">-</span><span class="mi">00001</span> <span class="n">train</span><span class="o">.</span><span class="n">txt</span><span class="o">-</span><span class="mi">00002</span> <span class="n">test</span><span class="o">.</span><span class="n">txt</span> <span class="n">test</span><span class="o">.</span><span class="n">txt</span><span class="o">-</span><span class="mi">00000</span> <span class="n">test</span><span class="o">.</span><span class="n">txt</span><span class="o">-</span><span class="mi">00001</span> <span class="n">test</span><span class="o">.</span><span class="n">txt</span><span class="o">-</span><span class="mi">00002</span> </pre></div> </div> <p>在进行分布式训练时,每个trainer进程需要能够读取属于自己的一份数据。在一些分布式系统中,系统会提供一个分布式存储服务,这样保存在分布式存储中的数据可以被集群中的每个节点读取到。如果不使用分布式存储,则需要手动拷贝属于每个trainer节点的训练数据到对应的节点上。</p> <p>对于不同的训练任务,训练数据格式和训练程序的<code class="docutils literal"><span class="pre">reader()</span></code>会大不相同,所以开发者需要根据自己训练任务的实际场景完成训练数据的分割和<code class="docutils literal"><span class="pre">reader()</span></code>的编写。</p> </div> <div class="section" id=""> <span id="id8"></span><h3>准备训练程序<a class="headerlink" href="#" title="永久链接至标题">¶</a></h3> <p>我们会对每个训练任务都会在每个节点上创建一个工作空间(workspace),其中包含了用户的训练程序、程序依赖、挂载或下载的训练数据分片。</p> <p>最后,工作空间应如下所示:</p> <div class="highlight-default"><div class="highlight"><pre><span></span>. |-- my_lib.py |-- word_dict.pickle |-- train.py |-- train_data_dir/ | |-- train.txt-00000 | |-- train.txt-00001 | |-- train.txt-00002 `-- test_data_dir/ |-- test.txt-00000 |-- test.txt-00001 `-- test.txt-00002 </pre></div> </div> <ul> <li><p class="first"><code class="docutils literal"><span class="pre">my_lib.py</span></code>:会被<code class="docutils literal"><span class="pre">train.py</span></code>调用的一些用户定义的库函数,比如PIL库等。</p> </li> <li><p class="first"><code class="docutils literal"><span class="pre">word_dict.pickle</span></code>:在<code class="docutils literal"><span class="pre">train.py</span></code>中会使用到的字典数据文件。</p> </li> <li><p class="first"><code class="docutils literal"><span class="pre">train.py</span></code>:训练程序,代码参考<a class="reference external" href="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/api_train_v2_cluster.py">api_train_v2_cluster.py</a>。<strong><em>注意:</em></strong> 对于本样例代码,在使用不同的分布式计算平台时,您可能需要修改<code class="docutils literal"><span class="pre">train.py</span></code>开头的部分(如下),以便获得训练数据的位置和获取环境变量配置:</p> <div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">cluster_train_file</span> <span class="o">=</span> <span class="s2">"./train_data_dir/train/train.txt"</span> <span class="n">cluster_test_file</span> <span class="o">=</span> <span class="s2">"./test_data_dir/test/test.txt"</span> <span class="n">node_id</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">getenv</span><span class="p">(</span><span class="s2">"OMPI_COMM_WORLD_RANK"</span><span class="p">)</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">node_id</span><span class="p">:</span> <span class="k">raise</span> <span class="ne">EnvironmentError</span><span class="p">(</span><span class="s2">"must provied OMPI_COMM_WORLD_RANK"</span><span class="p">)</span> </pre></div> </div> </li> <li><p class="first"><code class="docutils literal"><span class="pre">train_data_dir</span></code>:包含训练数据的目录,可以是从分布式存储挂载过来的,也可以是在任务启动前下载到本地的。</p> </li> <li><p class="first"><code class="docutils literal"><span class="pre">test_data_dir</span></code>:包含测试数据集的目录。</p> </li> </ul> </div> </div> <div class="section" id=""> <span id="id9"></span><h2>使用分布式计算平台或工具<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2> <p>PaddlePaddle可以使用多种分布式计算平台构建分布式计算任务,包括:</p> <ul class="simple"> <li><a class="reference external" href="http://kubernetes.io">Kubernetes</a> Google开源的容器集群的调度框架,支持大规模集群生产环境的完整集群方案。</li> <li><a class="reference external" href="https://www.open-mpi.org">OpenMPI</a> 成熟的高性能并行计算框架。</li> <li><a class="reference external" href="http://www.fabfile.org">Fabric</a> 集群管理工具。可以使用<code class="docutils literal"><span class="pre">Fabric</span></code>编写集群任务提交和管理脚本。</li> </ul> <p>对于不同的集群平台,会分别介绍集群作业的启动和停止方法。这些例子都可以在<a class="reference external" href="https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/scripts/cluster_train_v2">cluster_train_v2</a>找到。</p> <p>在使用分布式计算平台进行训练时,任务被调度在集群中时,分布式计算平台通常会通过API或者环境变量提供任务运行需要的参数,比如节点的ID、IP和任务节点个数等。</p> </div> <div class="section" id=""> <span id="id10"></span><h2>在不同集群中运行<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2> <div class="toctree-wrapper compound"> <ul> <li class="toctree-l1"><a class="reference internal" href="fabric_cn.html">fabric集群</a></li> <li class="toctree-l1"><a class="reference internal" href="openmpi_cn.html">openmpi集群</a></li> <li class="toctree-l1"><a class="reference internal" href="k8s_cn.html">kubernetes单机</a></li> <li class="toctree-l1"><a class="reference internal" href="k8s_distributed_cn.html">kubernetes distributed分布式</a></li> <li class="toctree-l1"><a class="reference internal" href="k8s_aws_cn.html">AWS上运行kubernetes集群训练</a></li> </ul> </div> </div> </div> </div> </div> <footer> <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation"> <a href="fabric_cn.html" class="btn btn-neutral float-right" title="使用fabric启动集群训练" accesskey="n">Next <span class="fa fa-arrow-circle-right"></span></a> <a href="../cmd_parameter/detail_introduction_cn.html" class="btn btn-neutral" title="细节描述" accesskey="p"><span class="fa fa-arrow-circle-left"></span> Previous</a> </div> <hr/> <div role="contentinfo"> <p> © Copyright 2016, PaddlePaddle developers. </p> </div> Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. </footer> </div> </div> </section> </div> <script type="text/javascript"> var DOCUMENTATION_OPTIONS = { URL_ROOT:'../../../', VERSION:'', COLLAPSE_INDEX:false, FILE_SUFFIX:'.html', HAS_SOURCE: true, SOURCELINK_SUFFIX: ".txt", }; </script> <script type="text/javascript" src="../../../_static/jquery.js"></script> <script type="text/javascript" src="../../../_static/underscore.js"></script> <script type="text/javascript" src="../../../_static/doctools.js"></script> <script type="text/javascript" src="../../../_static/translations.js"></script> <script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.0/MathJax.js"></script> <script type="text/javascript" src="../../../_static/js/theme.js"></script> <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script> <script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script> <script src="../../../_static/js/paddle_doc_init.js"></script> </body> </html>