Deploy to GitHub Pages: d08550fd

3114d1e0 · Travis CI · b7493ca8 · 3114d1e0 · 3114d1e0 · 3114d1e0
8 changed file
--- a/develop/doc/_sources/design/cluster_train/large_model_dist_train.md.txt
+++ b/develop/doc/_sources/design/cluster_train/large_model_dist_train.md.txt
+# Alalysis of large model distributed training in Paddle
+***NOTE: This is only some note for how we implemeted this scheme in V1, not a new design.***
+## What is it
+We often encounter cases that the embedding layer parameters(sparse) are so large that we can not store it in the trainer's memory when training. So we need to put them to several servers, and fetch them row by row instead of fetch all of the parameters.
+## How to use
+Specify command-line argument like  `--loadsave_parameters_in_pserver=true --ports_num_for_sparse=1  --use_old_updater=1` when starting the paddle trainer. And also add something like `--ports_num_for_sparse=1 --pserver_num_threads=5` when starting pserver processes.
+Accrodingly, configure your embedding layers like:
+```python
+SPARSE_REMOTE=True
+w1 = data_layer(name="w1", size=dict_size)
+emb1 = embedding_layer(input=w1, size=32, param_attr=ParameterAttribute(sparse_update=SPARSE_REMOTE))
+w2 = data_layer(name="w2", size=dict_size)
+emb2 = embedding_layer(input=w2, size=32, param_attr=ParameterAttribute(sparse_update=SPARSE_REMOTE))
+...
+```
+## Implementation details
+```c++
+enum MatType {
+  MAT_NORMAL,
+  MAT_NORMAL_SHARED,
+  MAT_VALUE_SHARED,
+  MAT_SPARSE_ROW_IDS,
+  MAT_SPARSE_ROW_AUTO_GROW,
+  MAT_CACHE_ROW,
+  MAT_SPARSE_ROW,
+  MAT_SPARSE_ROW_PREFETCH,
+  MAT_SPARSE_ROW_PREFETCH_FULL_SIZE,
+};
+```
+`MAT_SPARSE_ROW_PREFETCH` is what we use when configured to fetch only row of matrix when training.
+In `trainer_internal.cpp:L93 trainOneBatch`:
+```c++
+  if (config_->getOptConfig().use_sparse_remote_updater()) {
+    REGISTER_TIMER("prefetch");
+    gradientMachine_->prefetch(inArgs);
+    parameterUpdater_->getParametersRemote();
+  }
+```
+When doing actual network forward and backward, at the beginning of each batch, the trainer will try to download one row of data from pserver.
+In `trainer/RemoteParameterUpdater.cpp`: `parameterUpdater_->getParametersRemote();`:
+```c++
+if (fullSize) {
+    ...
+} else {
+getParams = [&] {
+    parameterClient_->getParameterSparse(
+        /* recvParameterType= */ PARAMETER_VALUE, sendBackParameterType);
+};
+applyL1 = [](Parameter& para, real decayRate) {
+    para.getMat(PARAMETER_VALUE)->applyL1(/*lr=*/1.0f, decayRate);
+};
+}
+```
+Calling `parameterClient_->getParameterSparse` will do remote call to pserver's `getParameterSparse`:
+```c++
+void ParameterServer2::getParameterSparse(const SendParameterRequest& request,
+                                          std::vector<Buffer>& inputBuffers,
+                                          SendParameterResponse* response,
+                                          std::vector<Buffer>* outputBuffers) {
+  (void)inputBuffers;
+  auto& buffer = *readWriteBuffer_;
+  size_t numReals = 0;
+  for (const auto& block : request.blocks()) {
+    numReals += getParameterConfig(block).dims(1);
+  }
+  buffer.resize(numReals);
+  VLOG(3) << "pserver: getParameterSparse, numReals=" << numReals;
+  ReadLockGuard guard(parameterMutex_);
+  size_t offset = 0;
+  for (const auto& block : request.blocks()) {
+    size_t width = getParameterConfig(block).dims(1);
+    Buffer buf = {buffer.data() + offset, width};
+    int type = request.send_back_parameter_type();
+    sendBackParameterSparse(block, type, response, &buf, width, outputBuffers);
+    offset += width;
+  }
+}
+```
+`getParameterConfig(block).dims(1)` returns the width of the current "parameter block"(a shard of parameter object),
+then `getParameterSparse` remote call returns only one row of data to the client.
--- a/develop/doc/design/cluster_train/large_model_dist_train.html
+++ b/develop/doc/design/cluster_train/large_model_dist_train.html
+<!DOCTYPE html>
+<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>Alalysis of large model distributed training in Paddle &mdash; PaddlePaddle  documentation</title>
+    <link rel="stylesheet" href="../../_static/css/theme.css" type="text/css" />
+        <link rel="index" title="Index"
+              href="../../genindex.html"/>
+        <link rel="search" title="Search" href="../../search.html"/>
+    <link rel="top" title="PaddlePaddle  documentation" href="../../index.html"/> 
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" />
+  <link rel="stylesheet" href="../../_static/css/override.css" type="text/css" />
+  <script>
+  var _hmt = _hmt || [];
+  (function() {
+    var hm = document.createElement("script");
+    hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
+    var s = document.getElementsByTagName("script")[0]; 
+    s.parentNode.insertBefore(hm, s);
+  })();
+  </script>
+  <script src="../../_static/js/modernizr.min.js"></script>
+</head>
+<body class="wy-body-for-nav" role="document">
+  <header class="site-header">
+    <div class="site-logo">
+      <a href="/"><img src="../../_static/images/PP_w.png"></a>
+    </div>
+    <div class="site-nav-links">
+      <div class="site-menu">
+        <a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a>
+        <div class="language-switcher dropdown">
+          <a type="button" data-toggle="dropdown">
+            <span>English</span>
+            <i class="fa fa-angle-up"></i>
+            <i class="fa fa-angle-down"></i>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/doc_cn">中文</a></li>
+            <li><a href="/doc">English</a></li>
+          </ul>
+        </div>
+        <ul class="site-page-links">
+          <li><a href="/">Home</a></li>
+        </ul>
+      </div>
+      <div class="doc-module">
+        <ul>
+<li class="toctree-l1"><a class="reference internal" href="../../getstarted/index_en.html">GET STARTED</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../howto/index_en.html">HOW TO</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../api/index_en.html">API</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../about/index_en.html">ABOUT</a></li>
+</ul>
+<div role="search">
+  <form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
+    <input type="text" name="q" placeholder="Search docs" />
+    <input type="hidden" name="check_keywords" value="yes" />
+    <input type="hidden" name="area" value="default" />
+  </form>
+</div>        
+      </div>
+    </div>
+  </header>
+  <div class="main-content-wrap">
+    <nav class="doc-menu-vertical" role="navigation">
+          <ul>
+<li class="toctree-l1"><a class="reference internal" href="../../getstarted/index_en.html">GET STARTED</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/docker_install_en.html">PaddlePaddle in Docker Containers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/ubuntu_install_en.html">Debian Package installation guide</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/build_from_source_en.html">Installing from Sources</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../../howto/index_en.html">HOW TO</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/usage/cmd_parameter/index_en.html">Set Command-line Parameters</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../howto/usage/cmd_parameter/use_case_en.html">Use Case</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../howto/usage/cmd_parameter/arguments_en.html">Argument Outline</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../howto/usage/cmd_parameter/detail_introduction_en.html">Detail Description</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/usage/cluster/cluster_train_en.html">Run Distributed Training</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/usage/k8s/k8s_en.html">Paddle On Kubernetes</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/usage/k8s/k8s_aws_en.html">Distributed PaddlePaddle Training on AWS with Kubernetes</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/dev/new_layer_en.html">Write New Layers</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/dev/contribute_to_paddle_en.html">Contribute Code</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/deep_model/rnn/index_en.html">RNN Models</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../howto/deep_model/rnn/rnn_config_en.html">RNN Configuration</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/optimization/gpu_profiling_en.html">Tune GPU Performance</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../../api/index_en.html">API</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../../api/v2/model_configs.html">Model Configuration</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/activation.html">Activation</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/layer.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/evaluators.html">Evaluators</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/pooling.html">Pooling</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/networks.html">Networks</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/attr.html">Parameter Attribute</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../../api/v2/data.html">Data Reader Interface and DataSets</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../api/v2/run_logic.html">Training and Inference</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../../about/index_en.html">ABOUT</a></li>
+</ul>
+    </nav>
+    <section class="doc-content-wrap">
+<div role="navigation" aria-label="breadcrumbs navigation">
+  <ul class="wy-breadcrumbs">
+    <li>Alalysis of large model distributed training in Paddle</li>
+  </ul>
+</div>
+      <div class="wy-nav-content" id="doc-content">
+        <div class="rst-content">
+          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
+           <div itemprop="articleBody">
+  <div class="section" id="alalysis-of-large-model-distributed-training-in-paddle">
+<span id="alalysis-of-large-model-distributed-training-in-paddle"></span><h1>Alalysis of large model distributed training in Paddle<a class="headerlink" href="#alalysis-of-large-model-distributed-training-in-paddle" title="Permalink to this headline">¶</a></h1>
+<p><strong><em>NOTE: This is only some note for how we implemeted this scheme in V1, not a new design.</em></strong></p>
+<div class="section" id="what-is-it">
+<span id="what-is-it"></span><h2>What is it<a class="headerlink" href="#what-is-it" title="Permalink to this headline">¶</a></h2>
+<p>We often encounter cases that the embedding layer parameters(sparse) are so large that we can not store it in the trainer&#8217;s memory when training. So we need to put them to several servers, and fetch them row by row instead of fetch all of the parameters.</p>
+</div>
+<div class="section" id="how-to-use">
+<span id="how-to-use"></span><h2>How to use<a class="headerlink" href="#how-to-use" title="Permalink to this headline">¶</a></h2>
+<p>Specify command-line argument like  <code class="docutils literal"><span class="pre">--loadsave_parameters_in_pserver=true</span> <span class="pre">--ports_num_for_sparse=1</span> <span class="pre">--use_old_updater=1</span></code> when starting the paddle trainer. And also add something like <code class="docutils literal"><span class="pre">--ports_num_for_sparse=1</span> <span class="pre">--pserver_num_threads=5</span></code> when starting pserver processes.</p>
+<p>Accrodingly, configure your embedding layers like:</p>
+<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">SPARSE_REMOTE</span><span class="o">=</span><span class="bp">True</span>
+<span class="n">w1</span> <span class="o">=</span> <span class="n">data_layer</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s2">&quot;w1&quot;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">dict_size</span><span class="p">)</span>
+<span class="n">emb1</span> <span class="o">=</span> <span class="n">embedding_layer</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">w1</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span> <span class="n">param_attr</span><span class="o">=</span><span class="n">ParameterAttribute</span><span class="p">(</span><span class="n">sparse_update</span><span class="o">=</span><span class="n">SPARSE_REMOTE</span><span class="p">))</span>
+<span class="n">w2</span> <span class="o">=</span> <span class="n">data_layer</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s2">&quot;w2&quot;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">dict_size</span><span class="p">)</span>
+<span class="n">emb2</span> <span class="o">=</span> <span class="n">embedding_layer</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">w2</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span> <span class="n">param_attr</span><span class="o">=</span><span class="n">ParameterAttribute</span><span class="p">(</span><span class="n">sparse_update</span><span class="o">=</span><span class="n">SPARSE_REMOTE</span><span class="p">))</span>
+<span class="o">...</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="implementation-details">
+<span id="implementation-details"></span><h2>Implementation details<a class="headerlink" href="#implementation-details" title="Permalink to this headline">¶</a></h2>
+<div class="highlight-c++"><div class="highlight"><pre><span></span><span class="k">enum</span> <span class="n">MatType</span> <span class="p">{</span>
+  <span class="n">MAT_NORMAL</span><span class="p">,</span>
+  <span class="n">MAT_NORMAL_SHARED</span><span class="p">,</span>
+  <span class="n">MAT_VALUE_SHARED</span><span class="p">,</span>
+  <span class="n">MAT_SPARSE_ROW_IDS</span><span class="p">,</span>
+  <span class="n">MAT_SPARSE_ROW_AUTO_GROW</span><span class="p">,</span>
+  <span class="n">MAT_CACHE_ROW</span><span class="p">,</span>
+  <span class="n">MAT_SPARSE_ROW</span><span class="p">,</span>
+  <span class="n">MAT_SPARSE_ROW_PREFETCH</span><span class="p">,</span>
+  <span class="n">MAT_SPARSE_ROW_PREFETCH_FULL_SIZE</span><span class="p">,</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+<p><code class="docutils literal"><span class="pre">MAT_SPARSE_ROW_PREFETCH</span></code> is what we use when configured to fetch only row of matrix when training.</p>
+<p>In <code class="docutils literal"><span class="pre">trainer_internal.cpp:L93</span> <span class="pre">trainOneBatch</span></code>:</p>
+<div class="highlight-c++"><div class="highlight"><pre><span></span>  <span class="k">if</span> <span class="p">(</span><span class="n">config_</span><span class="o">-&gt;</span><span class="n">getOptConfig</span><span class="p">().</span><span class="n">use_sparse_remote_updater</span><span class="p">())</span> <span class="p">{</span>
+    <span class="n">REGISTER_TIMER</span><span class="p">(</span><span class="s">&quot;prefetch&quot;</span><span class="p">);</span>
+    <span class="n">gradientMachine_</span><span class="o">-&gt;</span><span class="n">prefetch</span><span class="p">(</span><span class="n">inArgs</span><span class="p">);</span>
+    <span class="n">parameterUpdater_</span><span class="o">-&gt;</span><span class="n">getParametersRemote</span><span class="p">();</span>
+  <span class="p">}</span>
+</pre></div>
+</div>
+<p>When doing actual network forward and backward, at the beginning of each batch, the trainer will try to download one row of data from pserver.</p>
+<p>In <code class="docutils literal"><span class="pre">trainer/RemoteParameterUpdater.cpp</span></code>: <code class="docutils literal"><span class="pre">parameterUpdater_-&gt;getParametersRemote();</span></code>:</p>
+<div class="highlight-c++"><div class="highlight"><pre><span></span><span class="k">if</span> <span class="p">(</span><span class="n">fullSize</span><span class="p">)</span> <span class="p">{</span>
+    <span class="p">...</span>
+<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
+<span class="n">getParams</span> <span class="o">=</span> <span class="p">[</span><span class="o">&amp;</span><span class="p">]</span> <span class="p">{</span>
+    <span class="n">parameterClient_</span><span class="o">-&gt;</span><span class="n">getParameterSparse</span><span class="p">(</span>
+        <span class="cm">/* recvParameterType= */</span> <span class="n">PARAMETER_VALUE</span><span class="p">,</span> <span class="n">sendBackParameterType</span><span class="p">);</span>
+<span class="p">};</span>
+<span class="n">applyL1</span> <span class="o">=</span> <span class="p">[](</span><span class="n">Parameter</span><span class="o">&amp;</span> <span class="n">para</span><span class="p">,</span> <span class="n">real</span> <span class="n">decayRate</span><span class="p">)</span> <span class="p">{</span>
+    <span class="n">para</span><span class="p">.</span><span class="n">getMat</span><span class="p">(</span><span class="n">PARAMETER_VALUE</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">applyL1</span><span class="p">(</span><span class="cm">/*lr=*/</span><span class="mf">1.0f</span><span class="p">,</span> <span class="n">decayRate</span><span class="p">);</span>
+<span class="p">};</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+<p>Calling <code class="docutils literal"><span class="pre">parameterClient_-&gt;getParameterSparse</span></code> will do remote call to pserver&#8217;s <code class="docutils literal"><span class="pre">getParameterSparse</span></code>:</p>
+<div class="highlight-c++"><div class="highlight"><pre><span></span><span class="kt">void</span> <span class="n">ParameterServer2</span><span class="o">::</span><span class="n">getParameterSparse</span><span class="p">(</span><span class="k">const</span> <span class="n">SendParameterRequest</span><span class="o">&amp;</span> <span class="n">request</span><span class="p">,</span>
+                                          <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">Buffer</span><span class="o">&gt;&amp;</span> <span class="n">inputBuffers</span><span class="p">,</span>
+                                          <span class="n">SendParameterResponse</span><span class="o">*</span> <span class="n">response</span><span class="p">,</span>
+                                          <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">Buffer</span><span class="o">&gt;*</span> <span class="n">outputBuffers</span><span class="p">)</span> <span class="p">{</span>
+  <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">inputBuffers</span><span class="p">;</span>
+  <span class="k">auto</span><span class="o">&amp;</span> <span class="n">buffer</span> <span class="o">=</span> <span class="o">*</span><span class="n">readWriteBuffer_</span><span class="p">;</span>
+  <span class="kt">size_t</span> <span class="n">numReals</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
+  <span class="k">for</span> <span class="p">(</span><span class="k">const</span> <span class="k">auto</span><span class="o">&amp;</span> <span class="nl">block</span> <span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">blocks</span><span class="p">())</span> <span class="p">{</span>
+    <span class="n">numReals</span> <span class="o">+=</span> <span class="n">getParameterConfig</span><span class="p">(</span><span class="n">block</span><span class="p">).</span><span class="n">dims</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
+  <span class="p">}</span>
+  <span class="n">buffer</span><span class="p">.</span><span class="n">resize</span><span class="p">(</span><span class="n">numReals</span><span class="p">);</span>
+  <span class="n">VLOG</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="s">&quot;pserver: getParameterSparse, numReals=&quot;</span> <span class="o">&lt;&lt;</span> <span class="n">numReals</span><span class="p">;</span>
+  <span class="n">ReadLockGuard</span> <span class="nf">guard</span><span class="p">(</span><span class="n">parameterMutex_</span><span class="p">);</span>
+  <span class="kt">size_t</span> <span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
+  <span class="k">for</span> <span class="p">(</span><span class="k">const</span> <span class="k">auto</span><span class="o">&amp;</span> <span class="nl">block</span> <span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">blocks</span><span class="p">())</span> <span class="p">{</span>
+    <span class="kt">size_t</span> <span class="n">width</span> <span class="o">=</span> <span class="n">getParameterConfig</span><span class="p">(</span><span class="n">block</span><span class="p">).</span><span class="n">dims</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
+    <span class="n">Buffer</span> <span class="n">buf</span> <span class="o">=</span> <span class="p">{</span><span class="n">buffer</span><span class="p">.</span><span class="n">data</span><span class="p">()</span> <span class="o">+</span> <span class="n">offset</span><span class="p">,</span> <span class="n">width</span><span class="p">};</span>
+    <span class="kt">int</span> <span class="n">type</span> <span class="o">=</span> <span class="n">request</span><span class="p">.</span><span class="n">send_back_parameter_type</span><span class="p">();</span>
+    <span class="n">sendBackParameterSparse</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="n">type</span><span class="p">,</span> <span class="n">response</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">buf</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">outputBuffers</span><span class="p">);</span>
+    <span class="n">offset</span> <span class="o">+=</span> <span class="n">width</span><span class="p">;</span>
+  <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+<p><code class="docutils literal"><span class="pre">getParameterConfig(block).dims(1)</span></code> returns the width of the current &#8220;parameter block&#8221;(a shard of parameter object),
+then <code class="docutils literal"><span class="pre">getParameterSparse</span></code> remote call returns only one row of data to the client.</p>
+</div>
+</div>
+           </div>
+          </div>
+          <footer>
+  <hr/>
+  <div role="contentinfo">
+    <p>
+        &copy; Copyright 2016, PaddlePaddle developers.
+    </p>
+  </div>
+  Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. 
+</footer>
+        </div>
+      </div>
+    </section>
+  </div>
+    <script type="text/javascript">
+        var DOCUMENTATION_OPTIONS = {
+            URL_ROOT:'../../',
+            VERSION:'',
+            COLLAPSE_INDEX:false,
+            FILE_SUFFIX:'.html',
+            HAS_SOURCE:  true,
+            SOURCELINK_SUFFIX: ".txt",
+        };
+    </script>
+      <script type="text/javascript" src="../../_static/jquery.js"></script>
+      <script type="text/javascript" src="../../_static/underscore.js"></script>
+      <script type="text/javascript" src="../../_static/doctools.js"></script>
+      <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
+    <script type="text/javascript" src="../../_static/js/theme.js"></script>
+  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
+  <script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script>
+  <script src="../../_static/js/paddle_doc_init.js"></script> 
+</body>
+</html>
\ No newline at end of file
--- a/develop/doc/objects.inv
+++ b/develop/doc/objects.inv
--- a/develop/doc/searchindex.js
+++ b/develop/doc/searchindex.js
--- a/develop/doc_cn/_sources/design/cluster_train/large_model_dist_train.md.txt
+++ b/develop/doc_cn/_sources/design/cluster_train/large_model_dist_train.md.txt
+# Alalysis of large model distributed training in Paddle
+***NOTE: This is only some note for how we implemeted this scheme in V1, not a new design.***
+## What is it
+We often encounter cases that the embedding layer parameters(sparse) are so large that we can not store it in the trainer's memory when training. So we need to put them to several servers, and fetch them row by row instead of fetch all of the parameters.
+## How to use
+Specify command-line argument like  `--loadsave_parameters_in_pserver=true --ports_num_for_sparse=1  --use_old_updater=1` when starting the paddle trainer. And also add something like `--ports_num_for_sparse=1 --pserver_num_threads=5` when starting pserver processes.
+Accrodingly, configure your embedding layers like:
+```python
+SPARSE_REMOTE=True
+w1 = data_layer(name="w1", size=dict_size)
+emb1 = embedding_layer(input=w1, size=32, param_attr=ParameterAttribute(sparse_update=SPARSE_REMOTE))
+w2 = data_layer(name="w2", size=dict_size)
+emb2 = embedding_layer(input=w2, size=32, param_attr=ParameterAttribute(sparse_update=SPARSE_REMOTE))
+...
+```
+## Implementation details
+```c++
+enum MatType {
+  MAT_NORMAL,
+  MAT_NORMAL_SHARED,
+  MAT_VALUE_SHARED,
+  MAT_SPARSE_ROW_IDS,
+  MAT_SPARSE_ROW_AUTO_GROW,
+  MAT_CACHE_ROW,
+  MAT_SPARSE_ROW,
+  MAT_SPARSE_ROW_PREFETCH,
+  MAT_SPARSE_ROW_PREFETCH_FULL_SIZE,
+};
+```
+`MAT_SPARSE_ROW_PREFETCH` is what we use when configured to fetch only row of matrix when training.
+In `trainer_internal.cpp:L93 trainOneBatch`:
+```c++
+  if (config_->getOptConfig().use_sparse_remote_updater()) {
+    REGISTER_TIMER("prefetch");
+    gradientMachine_->prefetch(inArgs);
+    parameterUpdater_->getParametersRemote();
+  }
+```
+When doing actual network forward and backward, at the beginning of each batch, the trainer will try to download one row of data from pserver.
+In `trainer/RemoteParameterUpdater.cpp`: `parameterUpdater_->getParametersRemote();`:
+```c++
+if (fullSize) {
+    ...
+} else {
+getParams = [&] {
+    parameterClient_->getParameterSparse(
+        /* recvParameterType= */ PARAMETER_VALUE, sendBackParameterType);
+};
+applyL1 = [](Parameter& para, real decayRate) {
+    para.getMat(PARAMETER_VALUE)->applyL1(/*lr=*/1.0f, decayRate);
+};
+}
+```
+Calling `parameterClient_->getParameterSparse` will do remote call to pserver's `getParameterSparse`:
+```c++
+void ParameterServer2::getParameterSparse(const SendParameterRequest& request,
+                                          std::vector<Buffer>& inputBuffers,
+                                          SendParameterResponse* response,
+                                          std::vector<Buffer>* outputBuffers) {
+  (void)inputBuffers;
+  auto& buffer = *readWriteBuffer_;
+  size_t numReals = 0;
+  for (const auto& block : request.blocks()) {
+    numReals += getParameterConfig(block).dims(1);
+  }
+  buffer.resize(numReals);
+  VLOG(3) << "pserver: getParameterSparse, numReals=" << numReals;
+  ReadLockGuard guard(parameterMutex_);
+  size_t offset = 0;
+  for (const auto& block : request.blocks()) {
+    size_t width = getParameterConfig(block).dims(1);
+    Buffer buf = {buffer.data() + offset, width};
+    int type = request.send_back_parameter_type();
+    sendBackParameterSparse(block, type, response, &buf, width, outputBuffers);
+    offset += width;
+  }
+}
+```
+`getParameterConfig(block).dims(1)` returns the width of the current "parameter block"(a shard of parameter object),
+then `getParameterSparse` remote call returns only one row of data to the client.
--- a/develop/doc_cn/design/cluster_train/large_model_dist_train.html
+++ b/develop/doc_cn/design/cluster_train/large_model_dist_train.html
+<!DOCTYPE html>
+<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>Alalysis of large model distributed training in Paddle &mdash; PaddlePaddle  文档</title>
+    <link rel="stylesheet" href="../../_static/css/theme.css" type="text/css" />
+        <link rel="index" title="索引"
+              href="../../genindex.html"/>
+        <link rel="search" title="搜索" href="../../search.html"/>
+    <link rel="top" title="PaddlePaddle  文档" href="../../index.html"/> 
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" />
+  <link rel="stylesheet" href="../../_static/css/override.css" type="text/css" />
+  <script>
+  var _hmt = _hmt || [];
+  (function() {
+    var hm = document.createElement("script");
+    hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
+    var s = document.getElementsByTagName("script")[0]; 
+    s.parentNode.insertBefore(hm, s);
+  })();
+  </script>
+  <script src="../../_static/js/modernizr.min.js"></script>
+</head>
+<body class="wy-body-for-nav" role="document">
+  <header class="site-header">
+    <div class="site-logo">
+      <a href="/"><img src="../../_static/images/PP_w.png"></a>
+    </div>
+    <div class="site-nav-links">
+      <div class="site-menu">
+        <a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a>
+        <div class="language-switcher dropdown">
+          <a type="button" data-toggle="dropdown">
+            <span>English</span>
+            <i class="fa fa-angle-up"></i>
+            <i class="fa fa-angle-down"></i>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/doc_cn">中文</a></li>
+            <li><a href="/doc">English</a></li>
+          </ul>
+        </div>
+        <ul class="site-page-links">
+          <li><a href="/">Home</a></li>
+        </ul>
+      </div>
+      <div class="doc-module">
+        <ul>
+<li class="toctree-l1"><a class="reference internal" href="../../getstarted/index_cn.html">新手入门</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../howto/index_cn.html">进阶指南</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../api/index_cn.html">API</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../faq/index_cn.html">FAQ</a></li>
+</ul>
+<div role="search">
+  <form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
+    <input type="text" name="q" placeholder="Search docs" />
+    <input type="hidden" name="check_keywords" value="yes" />
+    <input type="hidden" name="area" value="default" />
+  </form>
+</div>        
+      </div>
+    </div>
+  </header>
+  <div class="main-content-wrap">
+    <nav class="doc-menu-vertical" role="navigation">
+          <ul>
+<li class="toctree-l1"><a class="reference internal" href="../../getstarted/index_cn.html">新手入门</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../../getstarted/build_and_install/index_cn.html">安装与编译</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/docker_install_cn.html">PaddlePaddle的Docker容器使用方式</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/ubuntu_install_cn.html">Ubuntu部署PaddlePaddle</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/cmake/build_from_source_cn.html">PaddlePaddle的编译选项</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../../getstarted/concepts/use_concepts_cn.html">基本使用概念</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../../howto/index_cn.html">进阶指南</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/usage/cmd_parameter/index_cn.html">设置命令行参数</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../howto/usage/cmd_parameter/use_case_cn.html">使用案例</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../howto/usage/cmd_parameter/arguments_cn.html">参数概述</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../howto/usage/cmd_parameter/detail_introduction_cn.html">细节描述</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/usage/cluster/cluster_train_cn.html">运行分布式训练</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/usage/k8s/k8s_basis_cn.html">Kubernetes 简介</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/usage/k8s/k8s_cn.html">Kubernetes单机训练</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/usage/k8s/k8s_distributed_cn.html">Kubernetes分布式训练</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/dev/write_docs_cn.html">如何贡献/修改文档</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/dev/contribute_to_paddle_cn.html">如何贡献代码</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/deep_model/rnn/index_cn.html">RNN相关模型</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../howto/deep_model/rnn/rnn_config_cn.html">RNN配置</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../howto/deep_model/rnn/recurrent_group_cn.html">Recurrent Group教程</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../howto/deep_model/rnn/hierarchical_layer_cn.html">支持双层序列作为输入的Layer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../howto/deep_model/rnn/hrnn_rnn_api_compare_cn.html">单双层RNN API对比介绍</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../../howto/optimization/gpu_profiling_cn.html">GPU性能分析与调优</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../../api/index_cn.html">API</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../../api/v2/model_configs.html">模型配置</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/activation.html">Activation</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/layer.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/evaluators.html">Evaluators</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/pooling.html">Pooling</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/networks.html">Networks</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/attr.html">Parameter Attribute</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../../api/v2/data.html">数据访问</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../api/v2/run_logic.html">训练与应用</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../../faq/index_cn.html">FAQ</a></li>
+</ul>
+    </nav>
+    <section class="doc-content-wrap">
+<div role="navigation" aria-label="breadcrumbs navigation">
+  <ul class="wy-breadcrumbs">
+    <li>Alalysis of large model distributed training in Paddle</li>
+  </ul>
+</div>
+      <div class="wy-nav-content" id="doc-content">
+        <div class="rst-content">
+          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
+           <div itemprop="articleBody">
+  <div class="section" id="alalysis-of-large-model-distributed-training-in-paddle">
+<span id="alalysis-of-large-model-distributed-training-in-paddle"></span><h1>Alalysis of large model distributed training in Paddle<a class="headerlink" href="#alalysis-of-large-model-distributed-training-in-paddle" title="永久链接至标题">¶</a></h1>
+<p><strong><em>NOTE: This is only some note for how we implemeted this scheme in V1, not a new design.</em></strong></p>
+<div class="section" id="what-is-it">
+<span id="what-is-it"></span><h2>What is it<a class="headerlink" href="#what-is-it" title="永久链接至标题">¶</a></h2>
+<p>We often encounter cases that the embedding layer parameters(sparse) are so large that we can not store it in the trainer&#8217;s memory when training. So we need to put them to several servers, and fetch them row by row instead of fetch all of the parameters.</p>
+</div>
+<div class="section" id="how-to-use">
+<span id="how-to-use"></span><h2>How to use<a class="headerlink" href="#how-to-use" title="永久链接至标题">¶</a></h2>
+<p>Specify command-line argument like  <code class="docutils literal"><span class="pre">--loadsave_parameters_in_pserver=true</span> <span class="pre">--ports_num_for_sparse=1</span> <span class="pre">--use_old_updater=1</span></code> when starting the paddle trainer. And also add something like <code class="docutils literal"><span class="pre">--ports_num_for_sparse=1</span> <span class="pre">--pserver_num_threads=5</span></code> when starting pserver processes.</p>
+<p>Accrodingly, configure your embedding layers like:</p>
+<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">SPARSE_REMOTE</span><span class="o">=</span><span class="bp">True</span>
+<span class="n">w1</span> <span class="o">=</span> <span class="n">data_layer</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s2">&quot;w1&quot;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">dict_size</span><span class="p">)</span>
+<span class="n">emb1</span> <span class="o">=</span> <span class="n">embedding_layer</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">w1</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span> <span class="n">param_attr</span><span class="o">=</span><span class="n">ParameterAttribute</span><span class="p">(</span><span class="n">sparse_update</span><span class="o">=</span><span class="n">SPARSE_REMOTE</span><span class="p">))</span>
+<span class="n">w2</span> <span class="o">=</span> <span class="n">data_layer</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s2">&quot;w2&quot;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">dict_size</span><span class="p">)</span>
+<span class="n">emb2</span> <span class="o">=</span> <span class="n">embedding_layer</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">w2</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span> <span class="n">param_attr</span><span class="o">=</span><span class="n">ParameterAttribute</span><span class="p">(</span><span class="n">sparse_update</span><span class="o">=</span><span class="n">SPARSE_REMOTE</span><span class="p">))</span>
+<span class="o">...</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="implementation-details">
+<span id="implementation-details"></span><h2>Implementation details<a class="headerlink" href="#implementation-details" title="永久链接至标题">¶</a></h2>
+<div class="highlight-c++"><div class="highlight"><pre><span></span><span class="k">enum</span> <span class="n">MatType</span> <span class="p">{</span>
+  <span class="n">MAT_NORMAL</span><span class="p">,</span>
+  <span class="n">MAT_NORMAL_SHARED</span><span class="p">,</span>
+  <span class="n">MAT_VALUE_SHARED</span><span class="p">,</span>
+  <span class="n">MAT_SPARSE_ROW_IDS</span><span class="p">,</span>
+  <span class="n">MAT_SPARSE_ROW_AUTO_GROW</span><span class="p">,</span>
+  <span class="n">MAT_CACHE_ROW</span><span class="p">,</span>
+  <span class="n">MAT_SPARSE_ROW</span><span class="p">,</span>
+  <span class="n">MAT_SPARSE_ROW_PREFETCH</span><span class="p">,</span>
+  <span class="n">MAT_SPARSE_ROW_PREFETCH_FULL_SIZE</span><span class="p">,</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+<p><code class="docutils literal"><span class="pre">MAT_SPARSE_ROW_PREFETCH</span></code> is what we use when configured to fetch only row of matrix when training.</p>
+<p>In <code class="docutils literal"><span class="pre">trainer_internal.cpp:L93</span> <span class="pre">trainOneBatch</span></code>:</p>
+<div class="highlight-c++"><div class="highlight"><pre><span></span>  <span class="k">if</span> <span class="p">(</span><span class="n">config_</span><span class="o">-&gt;</span><span class="n">getOptConfig</span><span class="p">().</span><span class="n">use_sparse_remote_updater</span><span class="p">())</span> <span class="p">{</span>
+    <span class="n">REGISTER_TIMER</span><span class="p">(</span><span class="s">&quot;prefetch&quot;</span><span class="p">);</span>
+    <span class="n">gradientMachine_</span><span class="o">-&gt;</span><span class="n">prefetch</span><span class="p">(</span><span class="n">inArgs</span><span class="p">);</span>
+    <span class="n">parameterUpdater_</span><span class="o">-&gt;</span><span class="n">getParametersRemote</span><span class="p">();</span>
+  <span class="p">}</span>
+</pre></div>
+</div>
+<p>When doing actual network forward and backward, at the beginning of each batch, the trainer will try to download one row of data from pserver.</p>
+<p>In <code class="docutils literal"><span class="pre">trainer/RemoteParameterUpdater.cpp</span></code>: <code class="docutils literal"><span class="pre">parameterUpdater_-&gt;getParametersRemote();</span></code>:</p>
+<div class="highlight-c++"><div class="highlight"><pre><span></span><span class="k">if</span> <span class="p">(</span><span class="n">fullSize</span><span class="p">)</span> <span class="p">{</span>
+    <span class="p">...</span>
+<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
+<span class="n">getParams</span> <span class="o">=</span> <span class="p">[</span><span class="o">&amp;</span><span class="p">]</span> <span class="p">{</span>
+    <span class="n">parameterClient_</span><span class="o">-&gt;</span><span class="n">getParameterSparse</span><span class="p">(</span>
+        <span class="cm">/* recvParameterType= */</span> <span class="n">PARAMETER_VALUE</span><span class="p">,</span> <span class="n">sendBackParameterType</span><span class="p">);</span>
+<span class="p">};</span>
+<span class="n">applyL1</span> <span class="o">=</span> <span class="p">[](</span><span class="n">Parameter</span><span class="o">&amp;</span> <span class="n">para</span><span class="p">,</span> <span class="n">real</span> <span class="n">decayRate</span><span class="p">)</span> <span class="p">{</span>
+    <span class="n">para</span><span class="p">.</span><span class="n">getMat</span><span class="p">(</span><span class="n">PARAMETER_VALUE</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">applyL1</span><span class="p">(</span><span class="cm">/*lr=*/</span><span class="mf">1.0f</span><span class="p">,</span> <span class="n">decayRate</span><span class="p">);</span>
+<span class="p">};</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+<p>Calling <code class="docutils literal"><span class="pre">parameterClient_-&gt;getParameterSparse</span></code> will do remote call to pserver&#8217;s <code class="docutils literal"><span class="pre">getParameterSparse</span></code>:</p>
+<div class="highlight-c++"><div class="highlight"><pre><span></span><span class="kt">void</span> <span class="n">ParameterServer2</span><span class="o">::</span><span class="n">getParameterSparse</span><span class="p">(</span><span class="k">const</span> <span class="n">SendParameterRequest</span><span class="o">&amp;</span> <span class="n">request</span><span class="p">,</span>
+                                          <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">Buffer</span><span class="o">&gt;&amp;</span> <span class="n">inputBuffers</span><span class="p">,</span>
+                                          <span class="n">SendParameterResponse</span><span class="o">*</span> <span class="n">response</span><span class="p">,</span>
+                                          <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">Buffer</span><span class="o">&gt;*</span> <span class="n">outputBuffers</span><span class="p">)</span> <span class="p">{</span>
+  <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">inputBuffers</span><span class="p">;</span>
+  <span class="k">auto</span><span class="o">&amp;</span> <span class="n">buffer</span> <span class="o">=</span> <span class="o">*</span><span class="n">readWriteBuffer_</span><span class="p">;</span>
+  <span class="kt">size_t</span> <span class="n">numReals</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
+  <span class="k">for</span> <span class="p">(</span><span class="k">const</span> <span class="k">auto</span><span class="o">&amp;</span> <span class="nl">block</span> <span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">blocks</span><span class="p">())</span> <span class="p">{</span>
+    <span class="n">numReals</span> <span class="o">+=</span> <span class="n">getParameterConfig</span><span class="p">(</span><span class="n">block</span><span class="p">).</span><span class="n">dims</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
+  <span class="p">}</span>
+  <span class="n">buffer</span><span class="p">.</span><span class="n">resize</span><span class="p">(</span><span class="n">numReals</span><span class="p">);</span>
+  <span class="n">VLOG</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="s">&quot;pserver: getParameterSparse, numReals=&quot;</span> <span class="o">&lt;&lt;</span> <span class="n">numReals</span><span class="p">;</span>
+  <span class="n">ReadLockGuard</span> <span class="nf">guard</span><span class="p">(</span><span class="n">parameterMutex_</span><span class="p">);</span>
+  <span class="kt">size_t</span> <span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
+  <span class="k">for</span> <span class="p">(</span><span class="k">const</span> <span class="k">auto</span><span class="o">&amp;</span> <span class="nl">block</span> <span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">blocks</span><span class="p">())</span> <span class="p">{</span>
+    <span class="kt">size_t</span> <span class="n">width</span> <span class="o">=</span> <span class="n">getParameterConfig</span><span class="p">(</span><span class="n">block</span><span class="p">).</span><span class="n">dims</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
+    <span class="n">Buffer</span> <span class="n">buf</span> <span class="o">=</span> <span class="p">{</span><span class="n">buffer</span><span class="p">.</span><span class="n">data</span><span class="p">()</span> <span class="o">+</span> <span class="n">offset</span><span class="p">,</span> <span class="n">width</span><span class="p">};</span>
+    <span class="kt">int</span> <span class="n">type</span> <span class="o">=</span> <span class="n">request</span><span class="p">.</span><span class="n">send_back_parameter_type</span><span class="p">();</span>
+    <span class="n">sendBackParameterSparse</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="n">type</span><span class="p">,</span> <span class="n">response</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">buf</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">outputBuffers</span><span class="p">);</span>
+    <span class="n">offset</span> <span class="o">+=</span> <span class="n">width</span><span class="p">;</span>
+  <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+<p><code class="docutils literal"><span class="pre">getParameterConfig(block).dims(1)</span></code> returns the width of the current &#8220;parameter block&#8221;(a shard of parameter object),
+then <code class="docutils literal"><span class="pre">getParameterSparse</span></code> remote call returns only one row of data to the client.</p>
+</div>
+</div>
+           </div>
+          </div>
+          <footer>
+  <hr/>
+  <div role="contentinfo">
+    <p>
+        &copy; Copyright 2016, PaddlePaddle developers.
+    </p>
+  </div>
+  Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. 
+</footer>
+        </div>
+      </div>
+    </section>
+  </div>
+    <script type="text/javascript">
+        var DOCUMENTATION_OPTIONS = {
+            URL_ROOT:'../../',
+            VERSION:'',
+            COLLAPSE_INDEX:false,
+            FILE_SUFFIX:'.html',
+            HAS_SOURCE:  true,
+            SOURCELINK_SUFFIX: ".txt",
+        };
+    </script>
+      <script type="text/javascript" src="../../_static/jquery.js"></script>
+      <script type="text/javascript" src="../../_static/underscore.js"></script>
+      <script type="text/javascript" src="../../_static/doctools.js"></script>
+      <script type="text/javascript" src="../../_static/translations.js"></script>
+      <script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.0/MathJax.js"></script>
+    <script type="text/javascript" src="../../_static/js/theme.js"></script>
+  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
+  <script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script>
+  <script src="../../_static/js/paddle_doc_init.js"></script> 
+</body>
+</html>
\ No newline at end of file
--- a/develop/doc_cn/objects.inv
+++ b/develop/doc_cn/objects.inv
--- a/develop/doc_cn/searchindex.js
+++ b/develop/doc_cn/searchindex.js