submit-job.html 17.3 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30


<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
  <meta charset="utf-8">
  
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  
  <title>Submit a Distributed Training Job &mdash; PaddlePaddle  documentation</title>
  

  
  

  

  
  
    

  

  
  
    <link rel="stylesheet" href="../../_static/css/theme.css" type="text/css" />
  

  
31

32 33 34 35 36 37 38 39 40 41 42 43 44
  
        <link rel="index" title="Index"
              href="../../genindex.html"/>
        <link rel="search" title="Search" href="../../search.html"/>
    <link rel="top" title="PaddlePaddle  documentation" href="../../index.html"/> 

  
  <script src="../../_static/js/modernizr.min.js"></script>

</head>

<body class="wy-body-for-nav" role="document">

45 46 47 48 49 50 51 52 53 54 55 56 57
  <div class="wy-grid-for-nav">

    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
        <div class="wy-side-nav-search">
          

          
            <a href="../../index_en.html" class="icon icon-home"> PaddlePaddle
          

          
58 59
          </a>

60 61 62 63 64 65
          
            
            
          

          
66 67 68 69 70 71
<div role="search">
  <form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
    <input type="text" name="q" placeholder="Search docs" />
    <input type="hidden" name="check_keywords" value="yes" />
    <input type="hidden" name="area" value="default" />
  </form>
72
</div>
73 74

          
75 76 77 78 79 80 81 82 83 84 85 86
        </div>

        <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
          
            
            
                <ul>
<li class="toctree-l1"><a class="reference internal" href="../../getstarted/index_en.html">GET STARTED</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../build_and_install/index_en.html">Install and Build</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../howto/index_en.html">HOW TO</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../dev/index_en.html">Development</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../faq/index_en.html">FAQ</a></li>
87 88
</ul>

89 90 91 92
            
          
        </div>
      </div>
93 94
    </nav>

95
    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
96

97 98 99 100 101
      
      <nav class="wy-nav-top" role="navigation" aria-label="top navigation">
        <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
        <a href="../../index_en.html">PaddlePaddle</a>
      </nav>
102 103


104 105 106 107
      
      <div class="wy-nav-content">
        <div class="rst-content">
          
108

109
 
110 111 112 113 114



<div role="navigation" aria-label="breadcrumbs navigation">
  <ul class="wy-breadcrumbs">
115
    <li><a href="../../index_en.html">Docs</a> &raquo;</li>
116 117
      
    <li>Submit a Distributed Training Job</li>
118 119 120 121 122 123 124
      <li class="wy-breadcrumbs-aside">
        
          
            <a href="../../_sources/design/cluster_train/submit-job.md.txt" rel="nofollow"> View page source</a>
          
        
      </li>
125
  </ul>
126
  <hr/>
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303
</div>
          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
           <div itemprop="articleBody">
            
  <div class="section" id="submit-a-distributed-training-job">
<span id="submit-a-distributed-training-job"></span><h1>Submit a Distributed Training Job<a class="headerlink" href="#submit-a-distributed-training-job" title="Permalink to this headline"></a></h1>
<p>The user can submit a distributed training job with Python code, rather than with a command-line interface.</p>
<div class="section" id="runtime-environment-on-kubernetes">
<span id="runtime-environment-on-kubernetes"></span><h2>Runtime Environment On Kubernetes<a class="headerlink" href="#runtime-environment-on-kubernetes" title="Permalink to this headline"></a></h2>
<p>For a distributed training job, there is two Docker image called <em>runtime Docker image</em> and <em>base Docker image</em>. The runtime Docker image is the Docker image that gets scheduled by Kubernetes to run during training. The base Docker image is for building the runtime Docker image.</p>
<div class="section" id="base-docker-image">
<span id="base-docker-image"></span><h3>Base Docker Image<a class="headerlink" href="#base-docker-image" title="Permalink to this headline"></a></h3>
<p>Usually, the base Docker image is PaddlePaddle product Docker image including paddle binary files and python package. And of course, users can specify any image name hosted on any docker registry which users have the access right.</p>
</div>
<div class="section" id="runtime-docker-image">
<span id="runtime-docker-image"></span><h3>Runtime Docker Image<a class="headerlink" href="#runtime-docker-image" title="Permalink to this headline"></a></h3>
<p>The trainer package which user upload and some Python dependencies are packaged into a runtime Docker image based on base Docker image.</p>
<ul>
<li><p class="first">Handle Python Dependencies</p>
<p>You need to provide requirements.txt file in your <code class="docutils literal"><span class="pre">trainer-package</span></code> folder. Example:</p>
<div class="highlight-txt"><div class="highlight"><pre><span></span>pillow
protobuf==3.1.0
</pre></div>
</div>
<p>More <a class="reference external" href="https://pip.readthedocs.io/en/1.1/requirements.html">details</a> about requirements, an example project looks like:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>  paddle_example
    <span class="p">|</span>-quick_start
      <span class="p">|</span>-trainer.py
      <span class="p">|</span>-dataset.py
      <span class="p">|</span>-requirements.txt
</pre></div>
</div>
</li>
</ul>
</div>
</div>
<div class="section" id="submit-distributed-training-job-with-python-code">
<span id="submit-distributed-training-job-with-python-code"></span><h2>Submit Distributed Training Job With Python Code<a class="headerlink" href="#submit-distributed-training-job-with-python-code" title="Permalink to this headline"></a></h2>
<p><img src="./src/submit-job.png" width="800"></p>
<ul class="simple">
<li><code class="docutils literal"><span class="pre">paddle.job.dist_train()</span></code> will call the Job Server API <code class="docutils literal"><span class="pre">/v1/packages</span></code> to upload the trainer package and save them on CephFS, and then call <code class="docutils literal"><span class="pre">/v1/trainer/job</span></code> to submit the PaddlePaddle distributed job.</li>
<li><code class="docutils literal"><span class="pre">/v1/trainer/job</span></code> will start a building job for preparing the runtime Docker image. When the building job is finished, Job Server will submit the PaddlePaddle distributed job to Kubernetes.</li>
<li><em>NOTE</em>: For the first version, we will not prepare the runtime Docker image, instead, the package is uploaded to Paddle Cloud, and Paddle Cloud will mount the package in a temporary folder into the base Docker image. We will not support custom Python dependencies in the first version as well.</li>
</ul>
<p>You can call <code class="docutils literal"><span class="pre">paddle.job.dist_train</span></code> and provide distributed training configuration as the parameters:</p>
<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">paddle</span><span class="o">.</span><span class="n">job</span><span class="o">.</span><span class="n">dist_train</span><span class="p">(</span>
  <span class="n">trainer</span><span class="o">=</span><span class="n">dist_trainer</span><span class="p">(),</span>
  <span class="n">paddle_job</span><span class="o">=</span><span class="n">PaddleJob</span><span class="p">(</span>
    <span class="n">job_name</span> <span class="o">=</span> <span class="s2">&quot;paddle-cloud&quot;</span><span class="p">,</span>
    <span class="n">entry_point</span> <span class="o">=</span> <span class="s2">&quot;python </span><span class="si">%s</span><span class="s2">&quot;</span><span class="o">%</span><span class="vm">__file__</span><span class="p">,</span>
    <span class="n">trainer_package</span> <span class="o">=</span> <span class="s2">&quot;/example/word2vec&quot;</span><span class="p">,</span>
    <span class="n">image</span> <span class="o">=</span> <span class="s2">&quot;yancey1989/paddle-job&quot;</span><span class="p">,</span>
    <span class="n">trainers</span> <span class="o">=</span> <span class="mi">10</span><span class="p">,</span>
    <span class="n">pservers</span> <span class="o">=</span> <span class="mi">3</span><span class="p">,</span>
    <span class="n">trainer_cpu</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span>
    <span class="n">trainer_gpu</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span>
    <span class="n">trainer_mem</span> <span class="o">=</span> <span class="s2">&quot;10G&quot;</span><span class="p">,</span>
    <span class="n">pserver_cpu</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span>
    <span class="n">pserver_mem</span> <span class="o">=</span> <span class="s2">&quot;2G&quot;</span>
  <span class="p">))</span>
</pre></div>
</div>
<p>The parameter <code class="docutils literal"><span class="pre">trainer</span></code> of <code class="docutils literal"><span class="pre">paddle.job.dist_train</span></code> is a function and you can implement it as follows:</p>
<div class="highlight-python"><div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">dist_trainer</span><span class="p">():</span>
  <span class="k">def</span> <span class="nf">trainer_creator</span><span class="p">():</span>
    <span class="n">trainer</span> <span class="o">=</span> <span class="n">paddle</span><span class="o">.</span><span class="n">v2</span><span class="o">.</span><span class="n">trainer</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
    <span class="n">trainer</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
  <span class="k">return</span> <span class="n">trainer_creator</span>
</pre></div>
</div>
<p>The pseudo code of <code class="docutils literal"><span class="pre">paddle.job.dist_train</span></code> is as follows:</p>
<div class="highlight-python"><div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">dist_train</span><span class="p">(</span><span class="n">trainer</span><span class="p">,</span> <span class="n">paddle_job</span><span class="p">):</span>
  <span class="c1"># if the code is running on cloud, set PADDLE_ON_CLOUD=YES</span>
  <span class="k">if</span> <span class="n">os</span><span class="o">.</span><span class="n">getenv</span><span class="p">(</span><span class="s2">&quot;RUNNING_ON_CLOUD&quot;</span><span class="p">,</span> <span class="s2">&quot;NO&quot;</span><span class="p">)</span> <span class="o">==</span> <span class="s2">&quot;NO&quot;</span><span class="p">:</span>
    <span class="c1">#submit the paddle job</span>
    <span class="n">paddle_job</span><span class="o">.</span><span class="n">submit</span><span class="p">()</span>
  <span class="k">else</span><span class="p">:</span>
    <span class="c1">#start the training</span>
    <span class="n">trainer</span><span class="p">()</span>
</pre></div>
</div>
<div class="section" id="paddlejob-parameters">
<span id="paddlejob-parameters"></span><h3>PaddleJob Parameters<a class="headerlink" href="#paddlejob-parameters" title="Permalink to this headline"></a></h3>
<p>parameter | type | explanation
&#8212; | &#8212; | &#8212;
job_name | str | the unique name for the training job
entry_point | str | entry point for startup trainer process
trainer_package | str | trainer package file path which user have the access right
image|str|the <a class="reference external" href="#base-docker-image">base image</a> for building the <a class="reference external" href="#runtime-docker-image">runtime image</a>
pservers|int| Parameter Server process count
trainers|int| Trainer process count
pserver_cpu|int| CPU count for each Parameter Server process
pserver_mem|str| memory allocated for each Parameter Server process, a plain integer using one of these suffixes: E, P, T, G, M, K
trainer_cpu|int| CPU count for each Trainer process
trainer_mem|str| memory allocated for each Trainer process, a plain integer using one of these suffixes: E, P, T, G, M, K
trainer_gpu|int| GPU count for each Trainer process, if you only want CPU, do not set this parameter</p>
</div>
<div class="section" id="deploy-parameter-server-trainer-and-master-process">
<span id="deploy-parameter-server-trainer-and-master-process"></span><h3>Deploy Parameter Server, Trainer and Master Process<a class="headerlink" href="#deploy-parameter-server-trainer-and-master-process" title="Permalink to this headline"></a></h3>
<ul class="simple">
<li>Deploy PaddlePaddle Parameter Server processes, it&#8217;s a Kubernetes ReplicaSet.</li>
<li>Deploy PaddlePaddle Trainer processes, it&#8217;s a Kubernetes Job.</li>
<li>Deploy PaddlePaddle Master processes, it&#8217;s a Kubernetes ReplicaSet.</li>
</ul>
</div>
</div>
<div class="section" id="job-server">
<span id="job-server"></span><h2>Job Server<a class="headerlink" href="#job-server" title="Permalink to this headline"></a></h2>
<ul>
<li><p class="first">RESTful API</p>
<p>Job server provides RESTful HTTP API for receiving the trainer package and displaying
PaddlePaddle job related informations.</p>
<ul class="simple">
<li><code class="docutils literal"><span class="pre">POST</span> <span class="pre">/v1/package</span></code> receive the trainer package and save them on CephFS</li>
<li><code class="docutils literal"><span class="pre">POST</span> <span class="pre">/v1/trainer/job</span></code> submit a trainer job</li>
<li><code class="docutils literal"><span class="pre">GET</span> <span class="pre">/v1/jobs/</span></code> list all jobs</li>
<li><code class="docutils literal"><span class="pre">GET</span> <span class="pre">/v1/jobs/&lt;job-name&gt;</span></code> the status of a job</li>
<li><code class="docutils literal"><span class="pre">DELETE</span> <span class="pre">/v1/jobs/&lt;job-name&gt;</span></code> delete a job</li>
<li><code class="docutils literal"><span class="pre">GET</span> <span class="pre">/v1/version</span></code> job server version</li>
</ul>
</li>
<li><p class="first">Build Runtime Docker Image on Kubernetes</p>
<p><code class="docutils literal"><span class="pre">paddle.job.dist_train</span></code> will upload the trainer package to Job Server, save them on the distributed filesystem, and then start up a job for building the runtime Docker image that gets scheduled by Kubernetes to run during training.</p>
<p>There are some benefits for building runtime Docker image on JobServer:</p>
<ul class="simple">
<li>On Paddle Cloud, users will run the trainer code in a Jupyter Notebook which is a Kubernetes Pod, if we want to execute <code class="docutils literal"><span class="pre">docker</span> <span class="pre">build</span></code> in the Pod, we should mount the host&#8217;s <code class="docutils literal"><span class="pre">docker.sock</span></code> to the Pod, user&#8217;s code will connect the host&#8217;s Docker Engine directly, it&#8217;s not safe.</li>
<li>Users only need to upload the training package files, does not need to install docker engine, docker registry as dependencies.</li>
<li>If we want to change another image type, such as RKT, users do not need to care about it.</li>
</ul>
</li>
<li><p class="first">Deploy Parameter Server, Trainer and Master Processes</p>
<p><code class="docutils literal"><span class="pre">POST</span> <span class="pre">/v1/trainer/job</span></code> receives the distributed training parameters, and deploy the job as follows:</p>
<ul class="simple">
<li>Deploy PaddlePaddle Parameter Server processes, it&#8217;s a Kubernetes ReplicaSet.</li>
<li>Deploy PaddlePaddle Trainer processes, it&#8217;s a Kubernetes Job.</li>
<li>Deploy PaddlePaddle Master processes, it&#8217;s a Kubernetes ReplicaSet.</li>
</ul>
</li>
</ul>
</div>
</div>


           </div>
          </div>
          <footer>
  

  <hr/>

  <div role="contentinfo">
    <p>
        &copy; Copyright 2016, PaddlePaddle developers.

    </p>
  </div>
  Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. 

</footer>

        </div>
      </div>

    </section>

  </div>
  


  

    <script type="text/javascript">
        var DOCUMENTATION_OPTIONS = {
            URL_ROOT:'../../',
            VERSION:'',
            COLLAPSE_INDEX:false,
            FILE_SUFFIX:'.html',
304
            HAS_SOURCE:  true
305 306 307 308 309 310
        };
    </script>
      <script type="text/javascript" src="../../_static/jquery.js"></script>
      <script type="text/javascript" src="../../_static/underscore.js"></script>
      <script type="text/javascript" src="../../_static/doctools.js"></script>
      <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
311

312 313 314 315 316 317
  

  
  
    <script type="text/javascript" src="../../_static/js/theme.js"></script>
  
318

319
  
320 321 322 323 324 325 326
  
  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.StickyNav.enable();
      });
  </script>
   
327 328 329

</body>
</html>