<liclass="toctree-l2"><aclass="reference internal"href="../../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
<liclass="toctree-l3"><aclass="reference internal"href="../../getstarted/build_and_install/docker_install_en.html">PaddlePaddle in Docker Containers</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../../howto/usage/k8s/k8s_en.html">Paddle On Kubernetes</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../../howto/usage/k8s/k8s_aws_en.html">Distributed PaddlePaddle Training on AWS with Kubernetes</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../../howto/dev/new_layer_en.html">Write New Layers</a></li>
<spanid="design-doc-save-model"></span><h1>Design Doc: Save Model<aclass="headerlink"href="#design-doc-save-model"title="Permalink to this headline">¶</a></h1>
<divclass="section"id="overview">
<spanid="overview"></span><h2>Overview<aclass="headerlink"href="#overview"title="Permalink to this headline">¶</a></h2>
<p>The model is the output of the training process. There are two
ways from which user can obtain a model:</p>
<ulclass="simple">
<li>Save model triggered by user code: user code asks PaddlePaddle to
save a model.</li>
<li>Convert model from the checkpoint: model being converted from
pservers’ periodic checkpoint. In this way, the user can cancel a
job at any time, and still have a relatively fresh model (we
<spanid="trainer-saving-model-vs-pservers-saving-model"></span><h3>Trainer Saving Model vs. Pservers Saving Model<aclass="headerlink"href="#trainer-saving-model-vs-pservers-saving-model"title="Permalink to this headline">¶</a></h3>
<p>Both trainers and pservers have access to the model. So the model can
be saved from a trainer or pservers. We need to decide where the model
<spanid="dense-update-vs-sparse-update"></span><h4>Dense Update vs. Sparse Update<aclass="headerlink"href="#dense-update-vs-sparse-update"title="Permalink to this headline">¶</a></h4>
<p>There are two types of model update methods: dense update and sparse
update (when the model parameter is configured to be sparse).</p>
<ul>
<li><pclass="first">Dense update</p>
<p>Every trainer has it’s own full copy of the model. Every model
update will update the entire model.</p>
</li>
<li><pclass="first">Sparse update</p>
<p>The training input is sparse, and the trainer does not have the
entire model. It will only download the sub-model necessary related
to the input. When updating the model, only the sub-model related to
the training input is updated.</p>
</li>
</ul>
</div>
<divclass="section"id="pservers-saving-model">
<spanid="pservers-saving-model"></span><h4>Pservers Saving Model<aclass="headerlink"href="#pservers-saving-model"title="Permalink to this headline">¶</a></h4>
<p>The benefit of letting pservers save model is they have the entire
model all the time. However, since pservers are on different nodes, it
requires a merging process to merge model shards into the same
model. Thus requires the pservers to write models to a distributed
filesystem, making the checkpoint shards visible to the merge program.</p>
</div>
<divclass="section"id="trainer-saving-model">
<spanid="trainer-saving-model"></span><h4>Trainer Saving Model<aclass="headerlink"href="#trainer-saving-model"title="Permalink to this headline">¶</a></h4>
<p>The benefit of letting one trainer to save the model is it does not
require a distributed filesystem. And it’s reusing the same save model
logic when training locally - except when doing sparse update, the
trainer needs to download the entire model during the saving process.</p>
</div>
<divclass="section"id="conclusion">
<spanid="conclusion"></span><h4>Conclusion<aclass="headerlink"href="#conclusion"title="Permalink to this headline">¶</a></h4>
<p>Given trainer saving model does not require a distributed filesystem,
and is an intuitive extension to trainer saving model when training
locally, we decide to let the trainer save the model when doing
<spanid="convert-model-from-checkpoint"></span><h3>Convert Model from Checkpoint<aclass="headerlink"href="#convert-model-from-checkpoint"title="Permalink to this headline">¶</a></h3>
<p>TODO</p>
</div>
</div>
<divclass="section"id="timeline">
<spanid="timeline"></span><h2>Timeline<aclass="headerlink"href="#timeline"title="Permalink to this headline">¶</a></h2>
<p>We first implement trainer save the model. Converting the latest
snapshot to a model will be a TODO for future.</p>
</div>
<divclass="section"id="trainer-save-model">
<spanid="trainer-save-model"></span><h2>Trainer Save Model<aclass="headerlink"href="#trainer-save-model"title="Permalink to this headline">¶</a></h2>
<divclass="section"id="trainer-election">
<spanid="trainer-election"></span><h3>Trainer Election<aclass="headerlink"href="#trainer-election"title="Permalink to this headline">¶</a></h3>
<p>One trainer will be elected as the one to save the model. When using
etcd, trainer ID is a randomly generated UUID, we will utilize etcd to
elect one trainer. When not using etcd, unique trainer IDs will be
given by the administrator, the trainer whose ID is “0” is elected to
save the model.</p>
</div>
<divclass="section"id="model-save-path">
<spanid="model-save-path"></span><h3>Model Save Path<aclass="headerlink"href="#model-save-path"title="Permalink to this headline">¶</a></h3>
<p>Each trainer will be given the directory to save the model. The
elected trainer will save the model to
<codeclass="docutils literal"><spanclass="pre">given-directory/trainerID</span></code>. Since the trainer ID is unique, this
would prevent concurrent save to the same file when multiple trainers
are elected to save the model when split-brain problem happens.</p>
<spanid="what-happens-when-model-is-saving"></span><h3>What Happens When Model Is Saving<aclass="headerlink"href="#what-happens-when-model-is-saving"title="Permalink to this headline">¶</a></h3>
<p>It takes some time to save model, we need to define what will happen
when save model is taking place.</p>
<p>When doing dense update, the trainer uses the local model. Pservers
does not need to pause model update.</p>
<p>When doing sparse update. The trainer needs to download the entire
model while saving. To get the most accurate model, the model update
needs to be paused before the download starts and resumed after the
download finishes. Otherwise, the trainer gets a model that is
“polluted”: some part of the model is old, some part of the model is
new.</p>
<p>It’s unclear that the “polluted” model will be inferior due to the
stochastic nature of deep learning, and pausing the model update will
add more complexity to the system. Since supporting sparse update is a
TODO item. We defer the evaluation of pause the model update or not
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
<spanid="trainer-saving-model-vs-pservers-saving-model"></span><h3>Trainer Saving Model vs. Pservers Saving Model<aclass="headerlink"href="#trainer-saving-model-vs-pservers-saving-model"title="永久链接至标题">¶</a></h3>
<p>Both trainers and pservers have access to the model. So the model can
be saved from a trainer or pservers. We need to decide where the model
<spanid="dense-update-vs-sparse-update"></span><h4>Dense Update vs. Sparse Update<aclass="headerlink"href="#dense-update-vs-sparse-update"title="永久链接至标题">¶</a></h4>
<p>There are two types of model update methods: dense update and sparse
update (when the model parameter is configured to be sparse).</p>
<ul>
<li><pclass="first">Dense update</p>
<p>Every trainer has it’s own full copy of the model. Every model
update will update the entire model.</p>
</li>
<li><pclass="first">Sparse update</p>
<p>The training input is sparse, and the trainer does not have the
entire model. It will only download the sub-model necessary related
to the input. When updating the model, only the sub-model related to
<spanid="convert-model-from-checkpoint"></span><h3>Convert Model from Checkpoint<aclass="headerlink"href="#convert-model-from-checkpoint"title="永久链接至标题">¶</a></h3>
<spanid="what-happens-when-model-is-saving"></span><h3>What Happens When Model Is Saving<aclass="headerlink"href="#what-happens-when-model-is-saving"title="永久链接至标题">¶</a></h3>
<p>It takes some time to save model, we need to define what will happen
when save model is taking place.</p>
<p>When doing dense update, the trainer uses the local model. Pservers
does not need to pause model update.</p>
<p>When doing sparse update. The trainer needs to download the entire
model while saving. To get the most accurate model, the model update
needs to be paused before the download starts and resumed after the
download finishes. Otherwise, the trainer gets a model that is
“polluted”: some part of the model is old, some part of the model is
new.</p>
<p>It’s unclear that the “polluted” model will be inferior due to the
stochastic nature of deep learning, and pausing the model update will
add more complexity to the system. Since supporting sparse update is a
TODO item. We defer the evaluation of pause the model update or not
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.