cluster_train.html 19.0 KB
Newer Older
1 2


Y
Yu Yang 已提交
3 4 5 6 7 8 9 10
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
11
    <title>Cluster Training &#8212; PaddlePaddle  documentation</title>
Y
Yu Yang 已提交
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
    
    <link rel="stylesheet" href="../../_static/classic.css" type="text/css" />
    <link rel="stylesheet" href="../../_static/pygments.css" type="text/css" />
    
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../../',
        VERSION:     '',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
    <script type="text/javascript" src="../../_static/jquery.js"></script>
    <script type="text/javascript" src="../../_static/underscore.js"></script>
    <script type="text/javascript" src="../../_static/doctools.js"></script>
    <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
    <link rel="top" title="PaddlePaddle  documentation" href="../../index.html" />
    <link rel="up" title="Cluster Train" href="../index.html" />
Y
Yu Yang 已提交
31
    <link rel="next" title="Layer Documents" href="../../layer.html" />
Y
Yu Yang 已提交
32
    <link rel="prev" title="Cluster Train" href="../index.html" /> 
33 34 35 36 37 38 39 40 41 42
<script>
var _hmt = _hmt || [];
(function() {
  var hm = document.createElement("script");
  hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
  var s = document.getElementsByTagName("script")[0]; 
  s.parentNode.insertBefore(hm, s);
})();
</script>

Y
Yu Yang 已提交
43 44 45 46 47 48 49 50 51 52 53
  </head>
  <body role="document">
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="../../genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="../../py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
Y
Yu Yang 已提交
54 55 56
        <li class="right" >
          <a href="../../layer.html" title="Layer Documents"
             accesskey="N">next</a> |</li>
Y
Yu Yang 已提交
57 58 59
        <li class="right" >
          <a href="../index.html" title="Cluster Train"
             accesskey="P">previous</a> |</li>
60 61
        <li class="nav-item nav-item-0"><a href="../../index.html">PaddlePaddle  documentation</a> &#187;</li>
          <li class="nav-item nav-item-1"><a href="../index.html" accesskey="U">Cluster Train</a> &#187;</li> 
Y
Yu Yang 已提交
62 63 64 65 66 67 68 69 70 71
      </ul>
    </div>  

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body" role="main">
            
  <div class="section" id="cluster-training">
<span id="cluster-training"></span><h1>Cluster Training<a class="headerlink" href="#cluster-training" title="Permalink to this headline"></a></h1>
Y
Yu Yang 已提交
72 73
<p>We provide some simple scripts <code class="docutils literal"><span class="pre">paddle/scripts/cluster_train</span></code> to help you to launch cluster training Job to harness PaddlePaddle&#8217;s distributed trainning. For MPI and other cluster scheduler refer this naive script to implement more robust cluster training platform by yourself.</p>
<p>The following cluster demo is based on RECOMMENDATION local training demo in PaddlePaddle <code class="docutils literal"><span class="pre">demo/recommendation</span></code> directory.  Assuming you enter the <code class="docutils literal"><span class="pre">paddle/scripts/cluster_train/</span></code> directory.</p>
Y
Yu Yang 已提交
74 75 76 77 78 79
<div class="section" id="pre-requirements">
<span id="pre-requirements"></span><h2>Pre-requirements<a class="headerlink" href="#pre-requirements" title="Permalink to this headline"></a></h2>
<p>Firstly,</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>pip install fabric
</pre></div>
</div>
Y
Yu Yang 已提交
80 81
<p>Secondly, go through installing scripts to install PaddlePaddle at all nodes to make sure demo can run as local mode. For CUDA enabled training, we assume that CUDA is installed in <code class="docutils literal"><span class="pre">/usr/local/cuda</span></code>, otherwise missed cuda runtime libraries error could be reported at cluster runtime. In one word, the local training environment should be well prepared for the simple scripts.</p>
<p>Then you should prepare same ROOT_DIR directory in all nodes. ROOT_DIR is from in cluster_train/conf.py. Assuming that the ROOT_DIR = /home/paddle, you can create <code class="docutils literal"><span class="pre">paddle</span></code> user account as well, at last <code class="docutils literal"><span class="pre">paddle.py</span></code> can ssh connections to all nodes with <code class="docutils literal"><span class="pre">paddle</span></code> user automatically.</p>
Y
Yu Yang 已提交
82 83 84 85 86 87 88 89
<p>At last you can create ssh mutual trust relationship between all nodes for easy ssh login, otherwise <code class="docutils literal"><span class="pre">password</span></code> should be provided at runtime from <code class="docutils literal"><span class="pre">paddle.py</span></code>.</p>
</div>
<div class="section" id="prepare-job-workspace">
<span id="prepare-job-workspace"></span><h2>Prepare Job Workspace<a class="headerlink" href="#prepare-job-workspace" title="Permalink to this headline"></a></h2>
<p><code class="docutils literal"><span class="pre">Job</span> <span class="pre">workspace</span></code> is defined as one package directory which contains dependency libraries, train data, test data, model config file and all other related file dependencies.</p>
<p>These <code class="docutils literal"><span class="pre">train/test</span></code> data should be prepared before launching cluster job. To  satisfy the requirement that train/test data are placed in different directory from workspace, PADDLE refers train/test data according to index file named as <code class="docutils literal"><span class="pre">train.list/test.list</span></code> which are used in model config file. So the train/test data also contains train.list/test.list two list file. All local training demo already provides scripts to help you create these two files,  and all nodes in cluster job will handle files with same logical code in normal condition.</p>
<p>Generally, you can use same model file from local training for cluster training. What you should have in mind that, the <code class="docutils literal"><span class="pre">batch_size</span></code> set in <code class="docutils literal"><span class="pre">setting</span></code> function in model file means batch size in <code class="docutils literal"><span class="pre">each</span></code> node of cluster job instead of total batch size if synchronization SGD was used.</p>
<p>Following steps are based on demo/recommendation demo in demo directory.</p>
Y
Yu Yang 已提交
90
<p>You just go through demo/recommendation tutorial doc until <code class="docutils literal"><span class="pre">Train</span></code> section, and at last you will get train/test data and model configuration file. Finaly, just use demo/recommendation as workspace for cluster training.</p>
Y
Yu Yang 已提交
91
<p>At last your workspace should look like as follow:</p>
92
<div class="highlight-default"><div class="highlight"><pre><span></span>.
Y
Yu Yang 已提交
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
|-- common_utils.py
|-- data
|   |-- config.json
|   |-- config_generator.py
|   |-- meta.bin
|   |-- meta_config.json
|   |-- meta_generator.py
|   |-- ml-1m
|   |-- ml_data.sh
|   |-- ratings.dat.test
|   |-- ratings.dat.train
|   |-- split.py
|   |-- test.list
|   `-- train.list
|-- dataprovider.py
|-- evaluate.sh
|-- prediction.py
|-- preprocess.sh
|-- requirements.txt
|-- run.sh
`-- trainer_config.py
Y
Yu Yang 已提交
114 115
</pre></div>
</div>
Y
Yu Yang 已提交
116 117
<p>Not all of these files are needed for cluster training, but it&#8217;s not necessary to remove useless files.</p>
<p><code class="docutils literal"><span class="pre">trainer_config.py</span></code>
Y
Yu Yang 已提交
118 119 120
Indicates the model config file.</p>
<p><code class="docutils literal"><span class="pre">train.list</span></code> and <code class="docutils literal"><span class="pre">test.list</span></code>
File index. It stores all relative or absolute file paths of all train/test data at current node.</p>
Y
Yu Yang 已提交
121 122 123 124
<p><code class="docutils literal"><span class="pre">dataprovider.py</span></code>
used to read train/test samples. It&#8217;s same as local training.</p>
<p><code class="docutils literal"><span class="pre">data</span></code>
all files in data directory are refered by train.list/test.list which are refered by data provider.</p>
Y
Yu Yang 已提交
125 126 127
</div>
<div class="section" id="prepare-cluster-job-configuration">
<span id="prepare-cluster-job-configuration"></span><h2>Prepare Cluster Job Configuration<a class="headerlink" href="#prepare-cluster-job-configuration" title="Permalink to this headline"></a></h2>
Y
Yu Yang 已提交
128
<p>The options below must be carefully set in cluster_train/conf.py</p>
Y
Yu Yang 已提交
129 130 131 132 133 134
<p><code class="docutils literal"><span class="pre">HOSTS</span></code>  all nodes hostname or ip that will run cluster job. You can also append user and ssh port with hostname, such as root&#64;192.168.100.17:9090.</p>
<p><code class="docutils literal"><span class="pre">ROOT_DIR</span></code> workspace ROOT directory for placing JOB workspace directory</p>
<p><code class="docutils literal"><span class="pre">PADDLE_NIC</span></code> the NIC(Network Interface Card) interface name for cluster communication channel, such as eth0 for ethternet, ib0 for infiniband.</p>
<p><code class="docutils literal"><span class="pre">PADDLE_PORT</span></code> port number for cluster commnunication channel</p>
<p><code class="docutils literal"><span class="pre">PADDLE_PORTS_NUM</span></code> the number of port used for cluster communication channle. if the number of cluster nodes is small(less than 5~6nodes), recommend you set it to larger, such as 2 ~ 8, for better network performance.</p>
<p><code class="docutils literal"><span class="pre">PADDLE_PORTS_NUM_FOR_SPARSE</span></code> the number of port used for sparse updater cluster commnunication channel. if sparse remote update is used, set it like <code class="docutils literal"><span class="pre">PADDLE_PORTS_NUM</span></code></p>
Y
Yu Yang 已提交
135
<p><code class="docutils literal"><span class="pre">LD_LIBRARY_PATH</span></code> set addtional LD_LIBRARY_PATH for cluster job. You can use it to set CUDA libraries path.</p>
Y
Yu Yang 已提交
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
<p>Default Configuration as follow:</p>
<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">HOSTS</span> <span class="o">=</span> <span class="p">[</span>
        <span class="s2">&quot;root@192.168.100.17&quot;</span><span class="p">,</span>
        <span class="s2">&quot;root@192.168.100.18&quot;</span><span class="p">,</span>
        <span class="p">]</span>

<span class="sd">&#39;&#39;&#39;</span>
<span class="sd">workspace configuration</span>
<span class="sd">&#39;&#39;&#39;</span>

<span class="c1">#root dir for workspace</span>
<span class="n">ROOT_DIR</span> <span class="o">=</span> <span class="s2">&quot;/home/paddle&quot;</span>

<span class="sd">&#39;&#39;&#39;</span>
<span class="sd">network configuration</span>
<span class="sd">&#39;&#39;&#39;</span>
<span class="c1">#pserver nics</span>
<span class="n">PADDLE_NIC</span> <span class="o">=</span> <span class="s2">&quot;eth0&quot;</span>
<span class="c1">#pserver port</span>
<span class="n">PADDLE_PORT</span> <span class="o">=</span> <span class="mi">7164</span>
<span class="c1">#pserver ports num</span>
<span class="n">PADDLE_PORTS_NUM</span> <span class="o">=</span> <span class="mi">2</span>
<span class="c1">#pserver sparse ports num</span>
<span class="n">PADDLE_PORTS_NUM_FOR_SPARSE</span> <span class="o">=</span> <span class="mi">2</span>
Y
Yu Yang 已提交
160 161 162

<span class="c1">#environments setting for all processes in cluster job</span>
<span class="n">LD_LIBRARY_PATH</span><span class="o">=</span><span class="s2">&quot;/usr/local/cuda/lib64:/usr/lib64&quot;</span>
Y
Yu Yang 已提交
163 164 165 166 167 168 169 170 171
</pre></div>
</div>
<div class="section" id="launching-cluster-job">
<span id="launching-cluster-job"></span><h3>Launching Cluster Job<a class="headerlink" href="#launching-cluster-job" title="Permalink to this headline"></a></h3>
<p><code class="docutils literal"><span class="pre">paddle.py</span></code> provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can set as <code class="docutils literal"><span class="pre">paddle.py</span></code> command options and <code class="docutils literal"><span class="pre">paddle.py</span></code> will transparently and automatically set these options to PaddlePaddle lower level processes.</p>
<p><code class="docutils literal"><span class="pre">paddle.py</span></code>provides two distinguished command option for easy job launching.</p>
<p><code class="docutils literal"><span class="pre">job_dispatch_package</span></code>  set it with local <code class="docutils literal"><span class="pre">workspace</span></code>directory, it will be dispatched to all nodes set in conf.py. It could be helpful for frequent hacking workspace files, otherwise frequent mulit-nodes workspace deployment could make your crazy.
<code class="docutils literal"><span class="pre">job_workspace</span></code>  set it with already deployed workspace directory, <code class="docutils literal"><span class="pre">paddle.py</span></code> will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy
dispatch latency.</p>
Y
Yu Yang 已提交
172
<p><code class="docutils literal"><span class="pre">cluster_train/run.sh</span></code> provides command line sample to run <code class="docutils literal"><span class="pre">demo/recommendation</span></code> cluster job, just modify <code class="docutils literal"><span class="pre">job_dispatch_package</span></code> and <code class="docutils literal"><span class="pre">job_workspace</span></code> with your defined directory, then:</p>
173
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">sh</span> <span class="n">run</span><span class="o">.</span><span class="n">sh</span>
Y
Yu Yang 已提交
174 175 176 177 178 179
</pre></div>
</div>
<p>The cluster Job will start in several seconds.</p>
</div>
<div class="section" id="kill-cluster-job">
<span id="kill-cluster-job"></span><h3>Kill Cluster Job<a class="headerlink" href="#kill-cluster-job" title="Permalink to this headline"></a></h3>
Y
Yu Yang 已提交
180
<p><code class="docutils literal"><span class="pre">paddle.py</span></code> can capture <code class="docutils literal"><span class="pre">Ctrl</span> <span class="pre">+</span> <span class="pre">C</span></code> SIGINT signal to automatically kill all processes launched by it. So just stop <code class="docutils literal"><span class="pre">paddle.py</span></code> to kill cluster job. You should mannally kill job if program crashed.</p>
Y
Yu Yang 已提交
181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
</div>
<div class="section" id="check-cluster-training-result">
<span id="check-cluster-training-result"></span><h3>Check Cluster Training Result<a class="headerlink" href="#check-cluster-training-result" title="Permalink to this headline"></a></h3>
<p>Check log in $workspace/log for details, each node owns same log structure.</p>
<p><code class="docutils literal"><span class="pre">paddle_trainer.INFO</span></code>
It provides almost all interal output log for training,  same as local training. Check runtime model convergence here.</p>
<p><code class="docutils literal"><span class="pre">paddle_pserver2.INFO</span></code>
It provides pserver running log, which could help to diagnose distributed error.</p>
<p><code class="docutils literal"><span class="pre">server.log</span></code>
It provides stderr and stdout of pserver process. Check error log if training crashs.</p>
<p><code class="docutils literal"><span class="pre">train.log</span></code>
It provides stderr and stdout of trainer process. Check error log if training crashs.</p>
</div>
<div class="section" id="check-model-output">
<span id="check-model-output"></span><h3>Check Model Output<a class="headerlink" href="#check-model-output" title="Permalink to this headline"></a></h3>
<p>After one pass finished, model files will be writed in <code class="docutils literal"><span class="pre">output</span></code> directory in node 0.
<code class="docutils literal"><span class="pre">nodefile</span></code> in workspace indicates the node id of current cluster job.</p>
</div>
</div>
</div>


          </div>
        </div>
      </div>
      <div class="sphinxsidebar" role="navigation" aria-label="main navigation">
        <div class="sphinxsidebarwrapper">
  <h3><a href="../../index.html">Table Of Contents</a></h3>
  <ul>
<li><a class="reference internal" href="#">Cluster Training</a><ul>
<li><a class="reference internal" href="#pre-requirements">Pre-requirements</a></li>
<li><a class="reference internal" href="#prepare-job-workspace">Prepare Job Workspace</a></li>
<li><a class="reference internal" href="#prepare-cluster-job-configuration">Prepare Cluster Job Configuration</a><ul>
<li><a class="reference internal" href="#launching-cluster-job">Launching Cluster Job</a></li>
<li><a class="reference internal" href="#kill-cluster-job">Kill Cluster Job</a></li>
<li><a class="reference internal" href="#check-cluster-training-result">Check Cluster Training Result</a></li>
<li><a class="reference internal" href="#check-model-output">Check Model Output</a></li>
</ul>
</li>
</ul>
</li>
</ul>

  <h4>Previous topic</h4>
  <p class="topless"><a href="../index.html"
                        title="previous chapter">Cluster Train</a></p>
Y
Yu Yang 已提交
227 228 229
  <h4>Next topic</h4>
  <p class="topless"><a href="../../layer.html"
                        title="next chapter">Layer Documents</a></p>
Y
Yu Yang 已提交
230 231 232 233 234 235 236 237 238 239
  <div role="note" aria-label="source link">
    <h3>This Page</h3>
    <ul class="this-page-menu">
      <li><a href="../../_sources/cluster/opensource/cluster_train.txt"
            rel="nofollow">Show Source</a></li>
    </ul>
   </div>
<div id="searchbox" style="display: none" role="search">
  <h3>Quick search</h3>
    <form class="search" action="../../search.html" method="get">
240 241
      <div><input type="text" name="q" /></div>
      <div><input type="submit" value="Go" /></div>
Y
Yu Yang 已提交
242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259
      <input type="hidden" name="check_keywords" value="yes" />
      <input type="hidden" name="area" value="default" />
    </form>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
        </div>
      </div>
      <div class="clearer"></div>
    </div>
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="../../genindex.html" title="General Index"
             >index</a></li>
        <li class="right" >
          <a href="../../py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
Y
Yu Yang 已提交
260 261 262
        <li class="right" >
          <a href="../../layer.html" title="Layer Documents"
             >next</a> |</li>
Y
Yu Yang 已提交
263 264 265
        <li class="right" >
          <a href="../index.html" title="Cluster Train"
             >previous</a> |</li>
266 267
        <li class="nav-item nav-item-0"><a href="../../index.html">PaddlePaddle  documentation</a> &#187;</li>
          <li class="nav-item nav-item-1"><a href="../index.html" >Cluster Train</a> &#187;</li> 
Y
Yu Yang 已提交
268 269 270
      </ul>
    </div>
    <div class="footer" role="contentinfo">
271 272
        &#169; Copyright 2016, PaddlePaddle developers.
      Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.4.6.
Y
Yu Yang 已提交
273 274 275
    </div>
  </body>
</html>