Deploy to GitHub Pages: 59116442

57ab7052 · Travis CI · 04a02b1e · 57ab7052 · 57ab7052 · 57ab7052
10 changed file
--- a/develop/doc/_sources/design/concurrent_programming.md.txt
+++ b/develop/doc/_sources/design/concurrent_programming.md.txt
+# Design Doc: Concurrent Programming with Fluid
+With PaddlePaddle Fluid, users describe a program other than a model.  The program is a [`ProgramDesc`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto) protobuf message. TensorFlow/MxNet/Caffe2 applications generate protobuf messages too, but their protobuf messages represent the model, a graph of operators, but not the program that trains/uses the model.   
+Many know that when we program TensorFlow, we can specify the device on which each operator runs.  This allows us to create a concurrent/parallel AI application.   An interesting questions is **how does a `ProgramDesc` represents a concurrent program?**  
+The answer relies on the fact that a `ProgramDesc` is similar to an abstract syntax tree (AST) that describes a program.  So users just program a concurrent program that they do with any concurrent programming language, e.g., [Go](https://golang.org).
+## An Analogy
+The following table compares concepts in Fluid and Go
+| Go | Fluid |
+|----|-------|
+|user-defined functions | [layers](https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/v2/fluid) |
+| control-flow and built-in functions | [intrinsics/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators) |
+| goroutines, channels | [class ThreadPool](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework/thread_pool.h) |
+| runtime | [class Executor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h) |
+## An Example Concurrent Program
+To review all above concepts in an example, let us take a simple program and writes its distributed version.
+Suppose that we want to parallelize a naive Fluid program (written in Go and calling Fluid's Go binding) that multiplies two tensors.
+```go
+import "fluid"
+func paddlepaddle() {
+  X = fluid.read(...)
+  W = fluid.Tensor(...)
+  Y = fluid.mult(X, W)
+}
+```
+Please be aware that the Fluid's Go binding provides the default `main` function, which calls the `paddlepaddle` function, which, in this case, is defined in above program and creates the following `ProgramDesc` message.
+```protobuf
+message ProgramDesc {
+  block[0] = Block {
+    vars = [X, W, Y],
+    ops = [
+      read(output = X)
+      assign(input = ..., output = W)
+      mult(input = {X, W}, output = Y)
+    ],
+  }
+}
+```
+Then, the default `main` function calls `fluid.run()`, which creates an instance of the [`class Executor`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h) and calls `Executor.Run(block[0])`, where `block[0]` is the first and only block defined in above `ProgramDesc` message.
+The default `main` function is defined as follows:
+```go
+func main() {
+  paddlepaddle()
+  fluid.run()
+}
+```
+## The Concurrent Version
+By parallelizing the above program, we could support very big tensor X by splitting into small pieces {x_1, x_2, ...} and sent each piece to worker process/node for parallel multiplication.
+In this case, we can write a transpiler that takes a `ProgramDesc` message that represents the above example program and outputs two `ProgramDesc` messages, one for running on the master process/node, and the other one for worker processes/nodes.
+### The Master Program
+The master program could look like the following:
+```protobuf
+message ProgramDesc {
+  block[0] = Block {
+    vars = [X, L, Y],
+    ops = [
+      read(output = X)
+      kube_get_workers_addrs(output = L)
+      Y = tensor_array(len(L))
+      parallel_for(input = X, output = Y, 
+                   attrs = {L, block_id(1)}) # referring to block 1
+    ]
+  }
+  block[1] = Block {
+    parent = 0,
+    vars = [x, y, index],
+    ops = [
+      slice(input = [X, index], output = x) # index is initialized by parallel_for
+      send(input = x, attrs = L[index])
+      recv(outputs = y, attrs = L[index])
+      assign(input = y, output = Y[index])
+    ]
+  }
+}
+```
+The equivalent Fluid program (calling the Go binding) is:
+```go
+func main() {  //// block 0
+  X = fluid.read(...)
+  L = fluid.k8s.get_worker_addrs()
+  Y = fluid.tensor_array(len(L))
+  fluid.parallel_for(X, L, 
+                     func(index int) {  //// block 1
+                       x = X[index]
+                       fluid.send(L[index], x)
+                       y = fluid.recv(L[index])
+                       Y[index] = y
+                     })
+}
+```
+An explanation of the above program:
+- `fluid.k8s` is a package that provides access to Kubernetes API.  
+- `fluid.k8s.get_worker_addrs` returns the list of IP and ports of all pods of the current job except for the current one (the master pod).  
+- `fluid.tensor_array` creates a [tensor array](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor_array.h).  `fluid.parallel_for` creates a `ParallelFor` intrinsic, which, when executed, 
+  1. creates `len(L)` scopes, each for the concurrent running of the sub-block (block 1 in this case), and initializes a variable named "index" in the scope to an integer value in the range `[0, len(L)-1]`, and
+  2. creates `len(L)` threads by calling into the `ThreadPool` singleton, each thread  
+     1. creates an Executor instance, and
+     2. calls `Executor.Run(block)`, where `block` is block 1 as explained above.
+1. Please be aware that block 1 is a sub-block of block 0, so ops in block 1 could refer to variables defined in block 0.
+### The Worker Program
+The worker program looks like
+```go
+func main() {
+  W = Tensor(...)
+  x = fluid.listen_and_do(
+        fluid.k8s.self_addr(),
+        func(input Tensor) {
+          output = fluid.mult(input, W)
+        })
+}
+```
+where
+- `fluid.listen_and_do` creates a `ListenAndDo` intrinsic, which, when executed,
+  1. listens on the current pod's IP address, as returned by `fliud.k8s.self_addr()`,
+  2. once a connection is established,
+     1. creates a scope of two parameters, "input" and "output",
+     2. reads a [Fluid variable](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h) and saves it into "input",
+     3. creates an Executor instance and calls `Executor.Run(block)`, where the block is generated by running the lambda specified as the second parameter of `fluid.listen_and_do`.
+## Summarization
+From the above example, we see that:
+1. Fluid enables the imperative programming paradigm by:
+   1. letting users describe a program, but not a model (a sequence of layers, or a graph of operators), and
+   2. call the `fluid.run` function that runs the program implicitly.
+1. The program is described as a `ProgramDesc` protobuf message.
+2. Function `Executor.Run` takes a block, instead of a `ProgramDesc`, as its parameter.
+3. `fluid.run` calls `Executor.Run` to run the first block in the `ProgramDesc` message.
+4. `Executor.Run`'s implementation is extremely simple -- it doesn't plan the execution nor create threads; instead, it runs on the current thread and execute intrinsics/operators' `Run` method sequentially as they appear in the `Block.ops` array.
+5. Intrinsics/operators' `Run` method might create threads.  For example, the `ListenAndDo` operator creates a thread to handle each incoming request.
+6. Threads are not necessarily OS thread; instead, they could be [green threads](https://en.wikipedia.org/wiki/Green_threads) managed by ThreadPool.  Multiple green threads might run on the same OS thread.  An example green threads is Go's [goroutines](https://tour.golang.org/concurrency/1).
--- a/develop/doc/design/concurrent_programming.html
+++ b/develop/doc/design/concurrent_programming.html
+<!DOCTYPE html>
+<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>Design Doc: Concurrent Programming with Fluid &mdash; PaddlePaddle  documentation</title>
+    <link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
+        <link rel="index" title="Index"
+              href="../genindex.html"/>
+        <link rel="search" title="Search" href="../search.html"/>
+    <link rel="top" title="PaddlePaddle  documentation" href="../index.html"/> 
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" />
+  <link rel="stylesheet" href="../_static/css/override.css" type="text/css" />
+  <script>
+  var _hmt = _hmt || [];
+  (function() {
+    var hm = document.createElement("script");
+    hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
+    var s = document.getElementsByTagName("script")[0]; 
+    s.parentNode.insertBefore(hm, s);
+  })();
+  </script>
+  <script src="../_static/js/modernizr.min.js"></script>
+</head>
+<body class="wy-body-for-nav" role="document">
+  <header class="site-header">
+    <div class="site-logo">
+      <a href="/"><img src="../_static/images/PP_w.png"></a>
+    </div>
+    <div class="site-nav-links">
+      <div class="site-menu">
+        <a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a>
+        <div class="language-switcher dropdown">
+          <a type="button" data-toggle="dropdown">
+            <span>English</span>
+            <i class="fa fa-angle-up"></i>
+            <i class="fa fa-angle-down"></i>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/doc_cn">中文</a></li>
+            <li><a href="/doc">English</a></li>
+          </ul>
+        </div>
+        <ul class="site-page-links">
+          <li><a href="/">Home</a></li>
+        </ul>
+      </div>
+      <div class="doc-module">
+        <ul>
+<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_en.html">GET STARTED</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../howto/index_en.html">HOW TO</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index_en.html">API</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../mobile/index_en.html">MOBILE</a></li>
+</ul>
+<div role="search">
+  <form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
+    <input type="text" name="q" placeholder="Search docs" />
+    <input type="hidden" name="check_keywords" value="yes" />
+    <input type="hidden" name="area" value="default" />
+  </form>
+</div>        
+      </div>
+    </div>
+  </header>
+  <div class="main-content-wrap">
+    <nav class="doc-menu-vertical" role="navigation">
+          <ul>
+<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_en.html">GET STARTED</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/pip_install_en.html">Install Using pip</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/docker_install_en.html">Run in Docker Containers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/dev/build_en.html">Build using Docker</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/build_from_source_en.html">Build from Sources</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../howto/index_en.html">HOW TO</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cmd_parameter/index_en.html">Set Command-line Parameters</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/use_case_en.html">Use Case</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/arguments_en.html">Argument Outline</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/detail_introduction_en.html">Detail Description</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cluster/cluster_train_en.html">Distributed Training</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/fabric_en.html">fabric</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/openmpi_en.html">openmpi</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/k8s_en.html">kubernetes</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/k8s_aws_en.html">kubernetes on AWS</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/new_layer_en.html">Write New Layers</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/contribute_to_paddle_en.html">Contribute Code</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/write_docs_en.html">Contribute Documentation</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/deep_model/rnn/index_en.html">RNN Models</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/rnn_config_en.html">RNN Configuration</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/optimization/gpu_profiling_en.html">Tune GPU Performance</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index_en.html">API</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/model_configs.html">Model Configuration</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/activation.html">Activation</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/layer.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/evaluators.html">Evaluators</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/pooling.html">Pooling</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/networks.html">Networks</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/attr.html">Parameter Attribute</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/data.html">Data Reader Interface and DataSets</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/data_reader.html">Data Reader Interface</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/image.html">Image Interface</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/dataset.html">Dataset</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/run_logic.html">Training and Inference</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/fluid.html">Fluid</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/layers.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/data_feeder.html">DataFeeder</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/executor.html">Executor</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/initializer.html">Initializer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/evaluator.html">Evaluator</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/nets.html">Nets</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/param_attr.html">ParamAttr</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/profiler.html">Profiler</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/regularizer.html">Regularizer</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../mobile/index_en.html">MOBILE</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_android_en.html">Build PaddlePaddle for Android</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_ios_en.html">Build PaddlePaddle for iOS</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_raspberry_en.html">Build PaddlePaddle for Raspberry Pi</a></li>
+</ul>
+</li>
+</ul>
+    </nav>
+    <section class="doc-content-wrap">
+<div role="navigation" aria-label="breadcrumbs navigation">
+  <ul class="wy-breadcrumbs">
+    <li>Design Doc: Concurrent Programming with Fluid</li>
+  </ul>
+</div>
+      <div class="wy-nav-content" id="doc-content">
+        <div class="rst-content">
+          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
+           <div itemprop="articleBody">
+  <div class="section" id="design-doc-concurrent-programming-with-fluid">
+<span id="design-doc-concurrent-programming-with-fluid"></span><h1>Design Doc: Concurrent Programming with Fluid<a class="headerlink" href="#design-doc-concurrent-programming-with-fluid" title="Permalink to this headline">¶</a></h1>
+<p>With PaddlePaddle Fluid, users describe a program other than a model.  The program is a <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto"><code class="docutils literal"><span class="pre">ProgramDesc</span></code></a> protobuf message. TensorFlow/MxNet/Caffe2 applications generate protobuf messages too, but their protobuf messages represent the model, a graph of operators, but not the program that trains/uses the model.</p>
+<p>Many know that when we program TensorFlow, we can specify the device on which each operator runs.  This allows us to create a concurrent/parallel AI application.   An interesting questions is <strong>how does a <code class="docutils literal"><span class="pre">ProgramDesc</span></code> represents a concurrent program?</strong></p>
+<p>The answer relies on the fact that a <code class="docutils literal"><span class="pre">ProgramDesc</span></code> is similar to an abstract syntax tree (AST) that describes a program.  So users just program a concurrent program that they do with any concurrent programming language, e.g., <a class="reference external" href="https://golang.org">Go</a>.</p>
+<div class="section" id="an-analogy">
+<span id="an-analogy"></span><h2>An Analogy<a class="headerlink" href="#an-analogy" title="Permalink to this headline">¶</a></h2>
+<p>The following table compares concepts in Fluid and Go</p>
+<p>| Go | Fluid |
+|&#8212;-|&#8212;&#8212;-|
+|user-defined functions | <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/v2/fluid">layers</a> |
+| control-flow and built-in functions | <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators">intrinsics/operators</a> |
+| goroutines, channels | <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework/thread_pool.h">class ThreadPool</a> |
+| runtime | <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h">class Executor</a> |</p>
+</div>
+<div class="section" id="an-example-concurrent-program">
+<span id="an-example-concurrent-program"></span><h2>An Example Concurrent Program<a class="headerlink" href="#an-example-concurrent-program" title="Permalink to this headline">¶</a></h2>
+<p>To review all above concepts in an example, let us take a simple program and writes its distributed version.</p>
+<p>Suppose that we want to parallelize a naive Fluid program (written in Go and calling Fluid&#8217;s Go binding) that multiplies two tensors.</p>
+<div class="highlight-go"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="s">&quot;fluid&quot;</span>
+<span class="kd">func</span> <span class="nx">paddlepaddle</span><span class="p">()</span> <span class="p">{</span>
+  <span class="nx">X</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">read</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
+  <span class="nx">W</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">Tensor</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
+  <span class="nx">Y</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">mult</span><span class="p">(</span><span class="nx">X</span><span class="p">,</span> <span class="nx">W</span><span class="p">)</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+<p>Please be aware that the Fluid&#8217;s Go binding provides the default <code class="docutils literal"><span class="pre">main</span></code> function, which calls the <code class="docutils literal"><span class="pre">paddlepaddle</span></code> function, which, in this case, is defined in above program and creates the following <code class="docutils literal"><span class="pre">ProgramDesc</span></code> message.</p>
+<div class="highlight-protobuf"><div class="highlight"><pre><span></span><span class="kd">message</span> <span class="nc">ProgramDesc</span> <span class="p">{</span>
+  <span class="n">block</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">Block</span> <span class="p">{</span>
+    <span class="na">vars</span> <span class="o">=</span> <span class="p">[</span><span class="n">X</span><span class="p">,</span> <span class="n">W</span><span class="p">,</span> <span class="n">Y</span><span class="p">],</span>
+    <span class="na">ops</span> <span class="o">=</span> <span class="p">[</span>
+      <span class="n">read</span><span class="p">(</span><span class="na">output</span> <span class="o">=</span> <span class="n">X</span><span class="p">)</span>
+      <span class="n">assign</span><span class="p">(</span><span class="na">input</span> <span class="o">=</span> <span class="o">...</span><span class="p">,</span> <span class="na">output</span> <span class="o">=</span> <span class="n">W</span><span class="p">)</span>
+      <span class="n">mult</span><span class="p">(</span><span class="na">input</span> <span class="o">=</span> <span class="p">{</span><span class="n">X</span><span class="p">,</span> <span class="n">W</span><span class="p">},</span> <span class="na">output</span> <span class="o">=</span> <span class="n">Y</span><span class="p">)</span>
+    <span class="p">],</span>
+  <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+<p>Then, the default <code class="docutils literal"><span class="pre">main</span></code> function calls <code class="docutils literal"><span class="pre">fluid.run()</span></code>, which creates an instance of the <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h"><code class="docutils literal"><span class="pre">class</span> <span class="pre">Executor</span></code></a> and calls <code class="docutils literal"><span class="pre">Executor.Run(block[0])</span></code>, where <code class="docutils literal"><span class="pre">block[0]</span></code> is the first and only block defined in above <code class="docutils literal"><span class="pre">ProgramDesc</span></code> message.</p>
+<p>The default <code class="docutils literal"><span class="pre">main</span></code> function is defined as follows:</p>
+<div class="highlight-go"><div class="highlight"><pre><span></span><span class="kd">func</span> <span class="nx">main</span><span class="p">()</span> <span class="p">{</span>
+  <span class="nx">paddlepaddle</span><span class="p">()</span>
+  <span class="nx">fluid</span><span class="p">.</span><span class="nx">run</span><span class="p">()</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="the-concurrent-version">
+<span id="the-concurrent-version"></span><h2>The Concurrent Version<a class="headerlink" href="#the-concurrent-version" title="Permalink to this headline">¶</a></h2>
+<p>By parallelizing the above program, we could support very big tensor X by splitting into small pieces {x_1, x_2, ...} and sent each piece to worker process/node for parallel multiplication.</p>
+<p>In this case, we can write a transpiler that takes a <code class="docutils literal"><span class="pre">ProgramDesc</span></code> message that represents the above example program and outputs two <code class="docutils literal"><span class="pre">ProgramDesc</span></code> messages, one for running on the master process/node, and the other one for worker processes/nodes.</p>
+<div class="section" id="the-master-program">
+<span id="the-master-program"></span><h3>The Master Program<a class="headerlink" href="#the-master-program" title="Permalink to this headline">¶</a></h3>
+<p>The master program could look like the following:</p>
+<div class="highlight-protobuf"><div class="highlight"><pre><span></span>message ProgramDesc {
+  block[0] = Block {
+    vars = [X, L, Y],
+    ops = [
+      read(output = X)
+      kube_get_workers_addrs(output = L)
+      Y = tensor_array(len(L))
+      parallel_for(input = X, output = Y, 
+                   attrs = {L, block_id(1)}) # referring to block 1
+    ]
+  }
+  block[1] = Block {
+    parent = 0,
+    vars = [x, y, index],
+    ops = [
+      slice(input = [X, index], output = x) # index is initialized by parallel_for
+      send(input = x, attrs = L[index])
+      recv(outputs = y, attrs = L[index])
+      assign(input = y, output = Y[index])
+    ]
+  }
+}
+</pre></div>
+</div>
+<p>The equivalent Fluid program (calling the Go binding) is:</p>
+<div class="highlight-go"><div class="highlight"><pre><span></span><span class="kd">func</span> <span class="nx">main</span><span class="p">()</span> <span class="p">{</span>  <span class="c1">//// block 0</span>
+  <span class="nx">X</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">read</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
+  <span class="nx">L</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">k8s</span><span class="p">.</span><span class="nx">get_worker_addrs</span><span class="p">()</span>
+  <span class="nx">Y</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">tensor_array</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="nx">L</span><span class="p">))</span>
+  <span class="nx">fluid</span><span class="p">.</span><span class="nx">parallel_for</span><span class="p">(</span><span class="nx">X</span><span class="p">,</span> <span class="nx">L</span><span class="p">,</span> 
+                     <span class="kd">func</span><span class="p">(</span><span class="nx">index</span> <span class="kt">int</span><span class="p">)</span> <span class="p">{</span>  <span class="c1">//// block 1</span>
+                       <span class="nx">x</span> <span class="p">=</span> <span class="nx">X</span><span class="p">[</span><span class="nx">index</span><span class="p">]</span>
+                       <span class="nx">fluid</span><span class="p">.</span><span class="nx">send</span><span class="p">(</span><span class="nx">L</span><span class="p">[</span><span class="nx">index</span><span class="p">],</span> <span class="nx">x</span><span class="p">)</span>
+                       <span class="nx">y</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">recv</span><span class="p">(</span><span class="nx">L</span><span class="p">[</span><span class="nx">index</span><span class="p">])</span>
+                       <span class="nx">Y</span><span class="p">[</span><span class="nx">index</span><span class="p">]</span> <span class="p">=</span> <span class="nx">y</span>
+                     <span class="p">})</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+<p>An explanation of the above program:</p>
+<ul class="simple">
+<li><code class="docutils literal"><span class="pre">fluid.k8s</span></code> is a package that provides access to Kubernetes API.</li>
+<li><code class="docutils literal"><span class="pre">fluid.k8s.get_worker_addrs</span></code> returns the list of IP and ports of all pods of the current job except for the current one (the master pod).</li>
+<li><code class="docutils literal"><span class="pre">fluid.tensor_array</span></code> creates a <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor_array.h">tensor array</a>.  <code class="docutils literal"><span class="pre">fluid.parallel_for</span></code> creates a <code class="docutils literal"><span class="pre">ParallelFor</span></code> intrinsic, which, when executed,<ol>
+<li>creates <code class="docutils literal"><span class="pre">len(L)</span></code> scopes, each for the concurrent running of the sub-block (block 1 in this case), and initializes a variable named &#8220;index&#8221; in the scope to an integer value in the range <code class="docutils literal"><span class="pre">[0,</span> <span class="pre">len(L)-1]</span></code>, and</li>
+<li>creates <code class="docutils literal"><span class="pre">len(L)</span></code> threads by calling into the <code class="docutils literal"><span class="pre">ThreadPool</span></code> singleton, each thread<ol>
+<li>creates an Executor instance, and</li>
+<li>calls <code class="docutils literal"><span class="pre">Executor.Run(block)</span></code>, where <code class="docutils literal"><span class="pre">block</span></code> is block 1 as explained above.</li>
+</ol>
+</li>
+</ol>
+</li>
+</ul>
+<ol class="simple">
+<li>Please be aware that block 1 is a sub-block of block 0, so ops in block 1 could refer to variables defined in block 0.</li>
+</ol>
+</div>
+<div class="section" id="the-worker-program">
+<span id="the-worker-program"></span><h3>The Worker Program<a class="headerlink" href="#the-worker-program" title="Permalink to this headline">¶</a></h3>
+<p>The worker program looks like</p>
+<div class="highlight-go"><div class="highlight"><pre><span></span><span class="kd">func</span> <span class="nx">main</span><span class="p">()</span> <span class="p">{</span>
+  <span class="nx">W</span> <span class="p">=</span> <span class="nx">Tensor</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
+  <span class="nx">x</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">listen_and_do</span><span class="p">(</span>
+        <span class="nx">fluid</span><span class="p">.</span><span class="nx">k8s</span><span class="p">.</span><span class="nx">self_addr</span><span class="p">(),</span>
+        <span class="kd">func</span><span class="p">(</span><span class="nx">input</span> <span class="nx">Tensor</span><span class="p">)</span> <span class="p">{</span>
+          <span class="nx">output</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">mult</span><span class="p">(</span><span class="nx">input</span><span class="p">,</span> <span class="nx">W</span><span class="p">)</span>
+        <span class="p">})</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+<p>where</p>
+<ul class="simple">
+<li><code class="docutils literal"><span class="pre">fluid.listen_and_do</span></code> creates a <code class="docutils literal"><span class="pre">ListenAndDo</span></code> intrinsic, which, when executed,<ol>
+<li>listens on the current pod&#8217;s IP address, as returned by <code class="docutils literal"><span class="pre">fliud.k8s.self_addr()</span></code>,</li>
+<li>once a connection is established,<ol>
+<li>creates a scope of two parameters, &#8220;input&#8221; and &#8220;output&#8221;,</li>
+<li>reads a <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h">Fluid variable</a> and saves it into &#8220;input&#8221;,</li>
+<li>creates an Executor instance and calls <code class="docutils literal"><span class="pre">Executor.Run(block)</span></code>, where the block is generated by running the lambda specified as the second parameter of <code class="docutils literal"><span class="pre">fluid.listen_and_do</span></code>.</li>
+</ol>
+</li>
+</ol>
+</li>
+</ul>
+</div>
+</div>
+<div class="section" id="summarization">
+<span id="summarization"></span><h2>Summarization<a class="headerlink" href="#summarization" title="Permalink to this headline">¶</a></h2>
+<p>From the above example, we see that:</p>
+<ol class="simple">
+<li>Fluid enables the imperative programming paradigm by:<ol>
+<li>letting users describe a program, but not a model (a sequence of layers, or a graph of operators), and</li>
+<li>call the <code class="docutils literal"><span class="pre">fluid.run</span></code> function that runs the program implicitly.</li>
+</ol>
+</li>
+<li>The program is described as a <code class="docutils literal"><span class="pre">ProgramDesc</span></code> protobuf message.</li>
+<li>Function <code class="docutils literal"><span class="pre">Executor.Run</span></code> takes a block, instead of a <code class="docutils literal"><span class="pre">ProgramDesc</span></code>, as its parameter.</li>
+<li><code class="docutils literal"><span class="pre">fluid.run</span></code> calls <code class="docutils literal"><span class="pre">Executor.Run</span></code> to run the first block in the <code class="docutils literal"><span class="pre">ProgramDesc</span></code> message.</li>
+<li><code class="docutils literal"><span class="pre">Executor.Run</span></code>&#8216;s implementation is extremely simple &#8211; it doesn&#8217;t plan the execution nor create threads; instead, it runs on the current thread and execute intrinsics/operators&#8217; <code class="docutils literal"><span class="pre">Run</span></code> method sequentially as they appear in the <code class="docutils literal"><span class="pre">Block.ops</span></code> array.</li>
+<li>Intrinsics/operators&#8217; <code class="docutils literal"><span class="pre">Run</span></code> method might create threads.  For example, the <code class="docutils literal"><span class="pre">ListenAndDo</span></code> operator creates a thread to handle each incoming request.</li>
+<li>Threads are not necessarily OS thread; instead, they could be <a class="reference external" href="https://en.wikipedia.org/wiki/Green_threads">green threads</a> managed by ThreadPool.  Multiple green threads might run on the same OS thread.  An example green threads is Go&#8217;s <a class="reference external" href="https://tour.golang.org/concurrency/1">goroutines</a>.</li>
+</ol>
+</div>
+</div>
+           </div>
+          </div>
+          <footer>
+  <hr/>
+  <div role="contentinfo">
+    <p>
+        &copy; Copyright 2016, PaddlePaddle developers.
+    </p>
+  </div>
+  Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. 
+</footer>
+        </div>
+      </div>
+    </section>
+  </div>
+    <script type="text/javascript">
+        var DOCUMENTATION_OPTIONS = {
+            URL_ROOT:'../',
+            VERSION:'',
+            COLLAPSE_INDEX:false,
+            FILE_SUFFIX:'.html',
+            HAS_SOURCE:  true,
+            SOURCELINK_SUFFIX: ".txt",
+        };
+    </script>
+      <script type="text/javascript" src="../_static/jquery.js"></script>
+      <script type="text/javascript" src="../_static/underscore.js"></script>
+      <script type="text/javascript" src="../_static/doctools.js"></script>
+      <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
+    <script type="text/javascript" src="../_static/js/theme.js"></script>
+  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
+  <script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script>
+  <script src="../_static/js/paddle_doc_init.js"></script> 
+</body>
+</html>
\ No newline at end of file
--- a/develop/doc/design/fluid.html
+++ b/develop/doc/design/fluid.html
@@ -308,7 +308,7 @@
 <div class="section" id="towards-a-deep-learning-language-and-the-compiler">
 <span id="towards-a-deep-learning-language-and-the-compiler"></span><h2>Towards a Deep Learning Language and the Compiler<a class="headerlink" href="#towards-a-deep-learning-language-and-the-compiler" title="Permalink to this headline">¶</a></h2>
 <p>We can change the <code class="docutils literal"><span class="pre">if-then-else</span></code> and loop structure a little bit in the above Fluid example programs, to make it into a new programming language, different than Python.</p>
-<p>Even if we do not invent a new language, as long as we get the <code class="docutils literal"><span class="pre">ProgramDesc</span></code> message filled in, we can write a transpiler, which translates each invocation to an operator, into a C++ call to a kernel function of that operator. For example, a transpiler that weaves the CUDA kernels outputs an NVIDIA-friendly C++ program, which can be built using <code class="docutils literal"><span class="pre">nvcc</span></code>.  Another transpiler could generate MKL-friendly code that should be built using <code class="docutils literal"><span class="pre">icc</span></code> from Intel.  More interestingly, we can translate a Fluid program into its distributed version of two <code class="docutils literal"><span class="pre">ProgramDesc</span></code> messages, one for running on the trainer process, and the other one for the parameter server.  For more details of the last example, the <a class="reference external" href="design/concurrent_programming.md">concurrent programming design</a> document would be a good pointer.  The following figure explains the proposed two-stage process:</p>
+<p>Even if we do not invent a new language, as long as we get the <code class="docutils literal"><span class="pre">ProgramDesc</span></code> message filled in, we can write a transpiler, which translates each invocation to an operator, into a C++ call to a kernel function of that operator. For example, a transpiler that weaves the CUDA kernels outputs an NVIDIA-friendly C++ program, which can be built using <code class="docutils literal"><span class="pre">nvcc</span></code>.  Another transpiler could generate MKL-friendly code that should be built using <code class="docutils literal"><span class="pre">icc</span></code> from Intel.  More interestingly, we can translate a Fluid program into its distributed version of two <code class="docutils literal"><span class="pre">ProgramDesc</span></code> messages, one for running on the trainer process, and the other one for the parameter server.  For more details of the last example, the <a class="reference internal" href="concurrent_programming.html"><span class="doc">concurrent programming design</span></a> document would be a good pointer.  The following figure explains the proposed two-stage process:</p>
 <p><img alt="" src="../_images/fluid-compiler.png" /></p>
 </div>
 </div>

--- a/develop/doc/objects.inv
+++ b/develop/doc/objects.inv
--- a/develop/doc/searchindex.js
+++ b/develop/doc/searchindex.js
--- a/develop/doc_cn/_sources/design/concurrent_programming.md.txt
+++ b/develop/doc_cn/_sources/design/concurrent_programming.md.txt
+# Design Doc: Concurrent Programming with Fluid
+With PaddlePaddle Fluid, users describe a program other than a model.  The program is a [`ProgramDesc`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto) protobuf message. TensorFlow/MxNet/Caffe2 applications generate protobuf messages too, but their protobuf messages represent the model, a graph of operators, but not the program that trains/uses the model.   
+Many know that when we program TensorFlow, we can specify the device on which each operator runs.  This allows us to create a concurrent/parallel AI application.   An interesting questions is **how does a `ProgramDesc` represents a concurrent program?**  
+The answer relies on the fact that a `ProgramDesc` is similar to an abstract syntax tree (AST) that describes a program.  So users just program a concurrent program that they do with any concurrent programming language, e.g., [Go](https://golang.org).
+## An Analogy
+The following table compares concepts in Fluid and Go
+| Go | Fluid |
+|----|-------|
+|user-defined functions | [layers](https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/v2/fluid) |
+| control-flow and built-in functions | [intrinsics/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators) |
+| goroutines, channels | [class ThreadPool](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework/thread_pool.h) |
+| runtime | [class Executor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h) |
+## An Example Concurrent Program
+To review all above concepts in an example, let us take a simple program and writes its distributed version.
+Suppose that we want to parallelize a naive Fluid program (written in Go and calling Fluid's Go binding) that multiplies two tensors.
+```go
+import "fluid"
+func paddlepaddle() {
+  X = fluid.read(...)
+  W = fluid.Tensor(...)
+  Y = fluid.mult(X, W)
+}
+```
+Please be aware that the Fluid's Go binding provides the default `main` function, which calls the `paddlepaddle` function, which, in this case, is defined in above program and creates the following `ProgramDesc` message.
+```protobuf
+message ProgramDesc {
+  block[0] = Block {
+    vars = [X, W, Y],
+    ops = [
+      read(output = X)
+      assign(input = ..., output = W)
+      mult(input = {X, W}, output = Y)
+    ],
+  }
+}
+```
+Then, the default `main` function calls `fluid.run()`, which creates an instance of the [`class Executor`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h) and calls `Executor.Run(block[0])`, where `block[0]` is the first and only block defined in above `ProgramDesc` message.
+The default `main` function is defined as follows:
+```go
+func main() {
+  paddlepaddle()
+  fluid.run()
+}
+```
+## The Concurrent Version
+By parallelizing the above program, we could support very big tensor X by splitting into small pieces {x_1, x_2, ...} and sent each piece to worker process/node for parallel multiplication.
+In this case, we can write a transpiler that takes a `ProgramDesc` message that represents the above example program and outputs two `ProgramDesc` messages, one for running on the master process/node, and the other one for worker processes/nodes.
+### The Master Program
+The master program could look like the following:
+```protobuf
+message ProgramDesc {
+  block[0] = Block {
+    vars = [X, L, Y],
+    ops = [
+      read(output = X)
+      kube_get_workers_addrs(output = L)
+      Y = tensor_array(len(L))
+      parallel_for(input = X, output = Y, 
+                   attrs = {L, block_id(1)}) # referring to block 1
+    ]
+  }
+  block[1] = Block {
+    parent = 0,
+    vars = [x, y, index],
+    ops = [
+      slice(input = [X, index], output = x) # index is initialized by parallel_for
+      send(input = x, attrs = L[index])
+      recv(outputs = y, attrs = L[index])
+      assign(input = y, output = Y[index])
+    ]
+  }
+}
+```
+The equivalent Fluid program (calling the Go binding) is:
+```go
+func main() {  //// block 0
+  X = fluid.read(...)
+  L = fluid.k8s.get_worker_addrs()
+  Y = fluid.tensor_array(len(L))
+  fluid.parallel_for(X, L, 
+                     func(index int) {  //// block 1
+                       x = X[index]
+                       fluid.send(L[index], x)
+                       y = fluid.recv(L[index])
+                       Y[index] = y
+                     })
+}
+```
+An explanation of the above program:
+- `fluid.k8s` is a package that provides access to Kubernetes API.  
+- `fluid.k8s.get_worker_addrs` returns the list of IP and ports of all pods of the current job except for the current one (the master pod).  
+- `fluid.tensor_array` creates a [tensor array](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor_array.h).  `fluid.parallel_for` creates a `ParallelFor` intrinsic, which, when executed, 
+  1. creates `len(L)` scopes, each for the concurrent running of the sub-block (block 1 in this case), and initializes a variable named "index" in the scope to an integer value in the range `[0, len(L)-1]`, and
+  2. creates `len(L)` threads by calling into the `ThreadPool` singleton, each thread  
+     1. creates an Executor instance, and
+     2. calls `Executor.Run(block)`, where `block` is block 1 as explained above.
+1. Please be aware that block 1 is a sub-block of block 0, so ops in block 1 could refer to variables defined in block 0.
+### The Worker Program
+The worker program looks like
+```go
+func main() {
+  W = Tensor(...)
+  x = fluid.listen_and_do(
+        fluid.k8s.self_addr(),
+        func(input Tensor) {
+          output = fluid.mult(input, W)
+        })
+}
+```
+where
+- `fluid.listen_and_do` creates a `ListenAndDo` intrinsic, which, when executed,
+  1. listens on the current pod's IP address, as returned by `fliud.k8s.self_addr()`,
+  2. once a connection is established,
+     1. creates a scope of two parameters, "input" and "output",
+     2. reads a [Fluid variable](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h) and saves it into "input",
+     3. creates an Executor instance and calls `Executor.Run(block)`, where the block is generated by running the lambda specified as the second parameter of `fluid.listen_and_do`.
+## Summarization
+From the above example, we see that:
+1. Fluid enables the imperative programming paradigm by:
+   1. letting users describe a program, but not a model (a sequence of layers, or a graph of operators), and
+   2. call the `fluid.run` function that runs the program implicitly.
+1. The program is described as a `ProgramDesc` protobuf message.
+2. Function `Executor.Run` takes a block, instead of a `ProgramDesc`, as its parameter.
+3. `fluid.run` calls `Executor.Run` to run the first block in the `ProgramDesc` message.
+4. `Executor.Run`'s implementation is extremely simple -- it doesn't plan the execution nor create threads; instead, it runs on the current thread and execute intrinsics/operators' `Run` method sequentially as they appear in the `Block.ops` array.
+5. Intrinsics/operators' `Run` method might create threads.  For example, the `ListenAndDo` operator creates a thread to handle each incoming request.
+6. Threads are not necessarily OS thread; instead, they could be [green threads](https://en.wikipedia.org/wiki/Green_threads) managed by ThreadPool.  Multiple green threads might run on the same OS thread.  An example green threads is Go's [goroutines](https://tour.golang.org/concurrency/1).
--- a/develop/doc_cn/design/concurrent_programming.html
+++ b/develop/doc_cn/design/concurrent_programming.html
+<!DOCTYPE html>
+<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>Design Doc: Concurrent Programming with Fluid &mdash; PaddlePaddle  文档</title>
+    <link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
+        <link rel="index" title="索引"
+              href="../genindex.html"/>
+        <link rel="search" title="搜索" href="../search.html"/>
+    <link rel="top" title="PaddlePaddle  文档" href="../index.html"/> 
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" />
+  <link rel="stylesheet" href="../_static/css/override.css" type="text/css" />
+  <script>
+  var _hmt = _hmt || [];
+  (function() {
+    var hm = document.createElement("script");
+    hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
+    var s = document.getElementsByTagName("script")[0]; 
+    s.parentNode.insertBefore(hm, s);
+  })();
+  </script>
+  <script src="../_static/js/modernizr.min.js"></script>
+</head>
+<body class="wy-body-for-nav" role="document">
+  <header class="site-header">
+    <div class="site-logo">
+      <a href="/"><img src="../_static/images/PP_w.png"></a>
+    </div>
+    <div class="site-nav-links">
+      <div class="site-menu">
+        <a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a>
+        <div class="language-switcher dropdown">
+          <a type="button" data-toggle="dropdown">
+            <span>English</span>
+            <i class="fa fa-angle-up"></i>
+            <i class="fa fa-angle-down"></i>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/doc_cn">中文</a></li>
+            <li><a href="/doc">English</a></li>
+          </ul>
+        </div>
+        <ul class="site-page-links">
+          <li><a href="/">Home</a></li>
+        </ul>
+      </div>
+      <div class="doc-module">
+        <ul>
+<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_cn.html">新手入门</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../howto/index_cn.html">进阶指南</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index_cn.html">API</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../faq/index_cn.html">FAQ</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../mobile/index_cn.html">MOBILE</a></li>
+</ul>
+<div role="search">
+  <form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
+    <input type="text" name="q" placeholder="Search docs" />
+    <input type="hidden" name="check_keywords" value="yes" />
+    <input type="hidden" name="area" value="default" />
+  </form>
+</div>        
+      </div>
+    </div>
+  </header>
+  <div class="main-content-wrap">
+    <nav class="doc-menu-vertical" role="navigation">
+          <ul>
+<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_cn.html">新手入门</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../getstarted/build_and_install/index_cn.html">安装与编译</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/pip_install_cn.html">使用pip安装</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/docker_install_cn.html">使用Docker安装运行</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/dev/build_cn.html">用Docker编译和测试PaddlePaddle</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/build_from_source_cn.html">从源码编译</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../getstarted/concepts/use_concepts_cn.html">基本使用概念</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../howto/index_cn.html">进阶指南</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cmd_parameter/index_cn.html">设置命令行参数</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/use_case_cn.html">使用案例</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/arguments_cn.html">参数概述</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/detail_introduction_cn.html">细节描述</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cluster/cluster_train_cn.html">分布式训练</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/fabric_cn.html">fabric集群</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/openmpi_cn.html">openmpi集群</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/k8s_cn.html">kubernetes单机</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/k8s_distributed_cn.html">kubernetes distributed分布式</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/k8s_aws_cn.html">AWS上运行kubernetes集群训练</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/contribute_to_paddle_cn.html">如何贡献代码</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/write_docs_cn.html">如何贡献/修改文档</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/deep_model/rnn/index_cn.html">RNN相关模型</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/rnn_config_cn.html">RNN配置</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/recurrent_group_cn.html">Recurrent Group教程</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/hierarchical_layer_cn.html">支持双层序列作为输入的Layer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/hrnn_rnn_api_compare_cn.html">单双层RNN API对比介绍</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/optimization/gpu_profiling_cn.html">GPU性能分析与调优</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index_cn.html">API</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/model_configs.html">模型配置</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/activation.html">Activation</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/layer.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/evaluators.html">Evaluators</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/pooling.html">Pooling</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/networks.html">Networks</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/attr.html">Parameter Attribute</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/data.html">数据访问</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/data_reader.html">Data Reader Interface</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/image.html">Image Interface</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/dataset.html">Dataset</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/run_logic.html">训练与应用</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/fluid.html">Fluid</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/layers.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/data_feeder.html">DataFeeder</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/executor.html">Executor</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/initializer.html">Initializer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/evaluator.html">Evaluator</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/nets.html">Nets</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/param_attr.html">ParamAttr</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/profiler.html">Profiler</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/regularizer.html">Regularizer</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../faq/index_cn.html">FAQ</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../faq/build_and_install/index_cn.html">编译安装与单元测试</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../faq/model/index_cn.html">模型配置</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../faq/parameter/index_cn.html">参数设置</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../faq/local/index_cn.html">本地训练与预测</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../faq/cluster/index_cn.html">集群训练与预测</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../mobile/index_cn.html">MOBILE</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_android_cn.html">Android平台编译指南</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_ios_cn.html">iOS平台编译指南</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_raspberry_cn.html">Raspberry Pi平台编译指南</a></li>
+</ul>
+</li>
+</ul>
+    </nav>
+    <section class="doc-content-wrap">
+<div role="navigation" aria-label="breadcrumbs navigation">
+  <ul class="wy-breadcrumbs">
+    <li>Design Doc: Concurrent Programming with Fluid</li>
+  </ul>
+</div>
+      <div class="wy-nav-content" id="doc-content">
+        <div class="rst-content">
+          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
+           <div itemprop="articleBody">
+  <div class="section" id="design-doc-concurrent-programming-with-fluid">
+<span id="design-doc-concurrent-programming-with-fluid"></span><h1>Design Doc: Concurrent Programming with Fluid<a class="headerlink" href="#design-doc-concurrent-programming-with-fluid" title="永久链接至标题">¶</a></h1>
+<p>With PaddlePaddle Fluid, users describe a program other than a model.  The program is a <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto"><code class="docutils literal"><span class="pre">ProgramDesc</span></code></a> protobuf message. TensorFlow/MxNet/Caffe2 applications generate protobuf messages too, but their protobuf messages represent the model, a graph of operators, but not the program that trains/uses the model.</p>
+<p>Many know that when we program TensorFlow, we can specify the device on which each operator runs.  This allows us to create a concurrent/parallel AI application.   An interesting questions is <strong>how does a <code class="docutils literal"><span class="pre">ProgramDesc</span></code> represents a concurrent program?</strong></p>
+<p>The answer relies on the fact that a <code class="docutils literal"><span class="pre">ProgramDesc</span></code> is similar to an abstract syntax tree (AST) that describes a program.  So users just program a concurrent program that they do with any concurrent programming language, e.g., <a class="reference external" href="https://golang.org">Go</a>.</p>
+<div class="section" id="an-analogy">
+<span id="an-analogy"></span><h2>An Analogy<a class="headerlink" href="#an-analogy" title="永久链接至标题">¶</a></h2>
+<p>The following table compares concepts in Fluid and Go</p>
+<p>| Go | Fluid |
+|&#8212;-|&#8212;&#8212;-|
+|user-defined functions | <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/v2/fluid">layers</a> |
+| control-flow and built-in functions | <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators">intrinsics/operators</a> |
+| goroutines, channels | <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework/thread_pool.h">class ThreadPool</a> |
+| runtime | <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h">class Executor</a> |</p>
+</div>
+<div class="section" id="an-example-concurrent-program">
+<span id="an-example-concurrent-program"></span><h2>An Example Concurrent Program<a class="headerlink" href="#an-example-concurrent-program" title="永久链接至标题">¶</a></h2>
+<p>To review all above concepts in an example, let us take a simple program and writes its distributed version.</p>
+<p>Suppose that we want to parallelize a naive Fluid program (written in Go and calling Fluid&#8217;s Go binding) that multiplies two tensors.</p>
+<div class="highlight-go"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="s">&quot;fluid&quot;</span>
+<span class="kd">func</span> <span class="nx">paddlepaddle</span><span class="p">()</span> <span class="p">{</span>
+  <span class="nx">X</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">read</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
+  <span class="nx">W</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">Tensor</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
+  <span class="nx">Y</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">mult</span><span class="p">(</span><span class="nx">X</span><span class="p">,</span> <span class="nx">W</span><span class="p">)</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+<p>Please be aware that the Fluid&#8217;s Go binding provides the default <code class="docutils literal"><span class="pre">main</span></code> function, which calls the <code class="docutils literal"><span class="pre">paddlepaddle</span></code> function, which, in this case, is defined in above program and creates the following <code class="docutils literal"><span class="pre">ProgramDesc</span></code> message.</p>
+<div class="highlight-protobuf"><div class="highlight"><pre><span></span><span class="kd">message</span> <span class="nc">ProgramDesc</span> <span class="p">{</span>
+  <span class="n">block</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">Block</span> <span class="p">{</span>
+    <span class="na">vars</span> <span class="o">=</span> <span class="p">[</span><span class="n">X</span><span class="p">,</span> <span class="n">W</span><span class="p">,</span> <span class="n">Y</span><span class="p">],</span>
+    <span class="na">ops</span> <span class="o">=</span> <span class="p">[</span>
+      <span class="n">read</span><span class="p">(</span><span class="na">output</span> <span class="o">=</span> <span class="n">X</span><span class="p">)</span>
+      <span class="n">assign</span><span class="p">(</span><span class="na">input</span> <span class="o">=</span> <span class="o">...</span><span class="p">,</span> <span class="na">output</span> <span class="o">=</span> <span class="n">W</span><span class="p">)</span>
+      <span class="n">mult</span><span class="p">(</span><span class="na">input</span> <span class="o">=</span> <span class="p">{</span><span class="n">X</span><span class="p">,</span> <span class="n">W</span><span class="p">},</span> <span class="na">output</span> <span class="o">=</span> <span class="n">Y</span><span class="p">)</span>
+    <span class="p">],</span>
+  <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+<p>Then, the default <code class="docutils literal"><span class="pre">main</span></code> function calls <code class="docutils literal"><span class="pre">fluid.run()</span></code>, which creates an instance of the <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h"><code class="docutils literal"><span class="pre">class</span> <span class="pre">Executor</span></code></a> and calls <code class="docutils literal"><span class="pre">Executor.Run(block[0])</span></code>, where <code class="docutils literal"><span class="pre">block[0]</span></code> is the first and only block defined in above <code class="docutils literal"><span class="pre">ProgramDesc</span></code> message.</p>
+<p>The default <code class="docutils literal"><span class="pre">main</span></code> function is defined as follows:</p>
+<div class="highlight-go"><div class="highlight"><pre><span></span><span class="kd">func</span> <span class="nx">main</span><span class="p">()</span> <span class="p">{</span>
+  <span class="nx">paddlepaddle</span><span class="p">()</span>
+  <span class="nx">fluid</span><span class="p">.</span><span class="nx">run</span><span class="p">()</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="the-concurrent-version">
+<span id="the-concurrent-version"></span><h2>The Concurrent Version<a class="headerlink" href="#the-concurrent-version" title="永久链接至标题">¶</a></h2>
+<p>By parallelizing the above program, we could support very big tensor X by splitting into small pieces {x_1, x_2, ...} and sent each piece to worker process/node for parallel multiplication.</p>
+<p>In this case, we can write a transpiler that takes a <code class="docutils literal"><span class="pre">ProgramDesc</span></code> message that represents the above example program and outputs two <code class="docutils literal"><span class="pre">ProgramDesc</span></code> messages, one for running on the master process/node, and the other one for worker processes/nodes.</p>
+<div class="section" id="the-master-program">
+<span id="the-master-program"></span><h3>The Master Program<a class="headerlink" href="#the-master-program" title="永久链接至标题">¶</a></h3>
+<p>The master program could look like the following:</p>
+<div class="highlight-protobuf"><div class="highlight"><pre><span></span>message ProgramDesc {
+  block[0] = Block {
+    vars = [X, L, Y],
+    ops = [
+      read(output = X)
+      kube_get_workers_addrs(output = L)
+      Y = tensor_array(len(L))
+      parallel_for(input = X, output = Y, 
+                   attrs = {L, block_id(1)}) # referring to block 1
+    ]
+  }
+  block[1] = Block {
+    parent = 0,
+    vars = [x, y, index],
+    ops = [
+      slice(input = [X, index], output = x) # index is initialized by parallel_for
+      send(input = x, attrs = L[index])
+      recv(outputs = y, attrs = L[index])
+      assign(input = y, output = Y[index])
+    ]
+  }
+}
+</pre></div>
+</div>
+<p>The equivalent Fluid program (calling the Go binding) is:</p>
+<div class="highlight-go"><div class="highlight"><pre><span></span><span class="kd">func</span> <span class="nx">main</span><span class="p">()</span> <span class="p">{</span>  <span class="c1">//// block 0</span>
+  <span class="nx">X</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">read</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
+  <span class="nx">L</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">k8s</span><span class="p">.</span><span class="nx">get_worker_addrs</span><span class="p">()</span>
+  <span class="nx">Y</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">tensor_array</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="nx">L</span><span class="p">))</span>
+  <span class="nx">fluid</span><span class="p">.</span><span class="nx">parallel_for</span><span class="p">(</span><span class="nx">X</span><span class="p">,</span> <span class="nx">L</span><span class="p">,</span> 
+                     <span class="kd">func</span><span class="p">(</span><span class="nx">index</span> <span class="kt">int</span><span class="p">)</span> <span class="p">{</span>  <span class="c1">//// block 1</span>
+                       <span class="nx">x</span> <span class="p">=</span> <span class="nx">X</span><span class="p">[</span><span class="nx">index</span><span class="p">]</span>
+                       <span class="nx">fluid</span><span class="p">.</span><span class="nx">send</span><span class="p">(</span><span class="nx">L</span><span class="p">[</span><span class="nx">index</span><span class="p">],</span> <span class="nx">x</span><span class="p">)</span>
+                       <span class="nx">y</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">recv</span><span class="p">(</span><span class="nx">L</span><span class="p">[</span><span class="nx">index</span><span class="p">])</span>
+                       <span class="nx">Y</span><span class="p">[</span><span class="nx">index</span><span class="p">]</span> <span class="p">=</span> <span class="nx">y</span>
+                     <span class="p">})</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+<p>An explanation of the above program:</p>
+<ul class="simple">
+<li><code class="docutils literal"><span class="pre">fluid.k8s</span></code> is a package that provides access to Kubernetes API.</li>
+<li><code class="docutils literal"><span class="pre">fluid.k8s.get_worker_addrs</span></code> returns the list of IP and ports of all pods of the current job except for the current one (the master pod).</li>
+<li><code class="docutils literal"><span class="pre">fluid.tensor_array</span></code> creates a <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor_array.h">tensor array</a>.  <code class="docutils literal"><span class="pre">fluid.parallel_for</span></code> creates a <code class="docutils literal"><span class="pre">ParallelFor</span></code> intrinsic, which, when executed,<ol>
+<li>creates <code class="docutils literal"><span class="pre">len(L)</span></code> scopes, each for the concurrent running of the sub-block (block 1 in this case), and initializes a variable named &#8220;index&#8221; in the scope to an integer value in the range <code class="docutils literal"><span class="pre">[0,</span> <span class="pre">len(L)-1]</span></code>, and</li>
+<li>creates <code class="docutils literal"><span class="pre">len(L)</span></code> threads by calling into the <code class="docutils literal"><span class="pre">ThreadPool</span></code> singleton, each thread<ol>
+<li>creates an Executor instance, and</li>
+<li>calls <code class="docutils literal"><span class="pre">Executor.Run(block)</span></code>, where <code class="docutils literal"><span class="pre">block</span></code> is block 1 as explained above.</li>
+</ol>
+</li>
+</ol>
+</li>
+</ul>
+<ol class="simple">
+<li>Please be aware that block 1 is a sub-block of block 0, so ops in block 1 could refer to variables defined in block 0.</li>
+</ol>
+</div>
+<div class="section" id="the-worker-program">
+<span id="the-worker-program"></span><h3>The Worker Program<a class="headerlink" href="#the-worker-program" title="永久链接至标题">¶</a></h3>
+<p>The worker program looks like</p>
+<div class="highlight-go"><div class="highlight"><pre><span></span><span class="kd">func</span> <span class="nx">main</span><span class="p">()</span> <span class="p">{</span>
+  <span class="nx">W</span> <span class="p">=</span> <span class="nx">Tensor</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
+  <span class="nx">x</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">listen_and_do</span><span class="p">(</span>
+        <span class="nx">fluid</span><span class="p">.</span><span class="nx">k8s</span><span class="p">.</span><span class="nx">self_addr</span><span class="p">(),</span>
+        <span class="kd">func</span><span class="p">(</span><span class="nx">input</span> <span class="nx">Tensor</span><span class="p">)</span> <span class="p">{</span>
+          <span class="nx">output</span> <span class="p">=</span> <span class="nx">fluid</span><span class="p">.</span><span class="nx">mult</span><span class="p">(</span><span class="nx">input</span><span class="p">,</span> <span class="nx">W</span><span class="p">)</span>
+        <span class="p">})</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+<p>where</p>
+<ul class="simple">
+<li><code class="docutils literal"><span class="pre">fluid.listen_and_do</span></code> creates a <code class="docutils literal"><span class="pre">ListenAndDo</span></code> intrinsic, which, when executed,<ol>
+<li>listens on the current pod&#8217;s IP address, as returned by <code class="docutils literal"><span class="pre">fliud.k8s.self_addr()</span></code>,</li>
+<li>once a connection is established,<ol>
+<li>creates a scope of two parameters, &#8220;input&#8221; and &#8220;output&#8221;,</li>
+<li>reads a <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h">Fluid variable</a> and saves it into &#8220;input&#8221;,</li>
+<li>creates an Executor instance and calls <code class="docutils literal"><span class="pre">Executor.Run(block)</span></code>, where the block is generated by running the lambda specified as the second parameter of <code class="docutils literal"><span class="pre">fluid.listen_and_do</span></code>.</li>
+</ol>
+</li>
+</ol>
+</li>
+</ul>
+</div>
+</div>
+<div class="section" id="summarization">
+<span id="summarization"></span><h2>Summarization<a class="headerlink" href="#summarization" title="永久链接至标题">¶</a></h2>
+<p>From the above example, we see that:</p>
+<ol class="simple">
+<li>Fluid enables the imperative programming paradigm by:<ol>
+<li>letting users describe a program, but not a model (a sequence of layers, or a graph of operators), and</li>
+<li>call the <code class="docutils literal"><span class="pre">fluid.run</span></code> function that runs the program implicitly.</li>
+</ol>
+</li>
+<li>The program is described as a <code class="docutils literal"><span class="pre">ProgramDesc</span></code> protobuf message.</li>
+<li>Function <code class="docutils literal"><span class="pre">Executor.Run</span></code> takes a block, instead of a <code class="docutils literal"><span class="pre">ProgramDesc</span></code>, as its parameter.</li>
+<li><code class="docutils literal"><span class="pre">fluid.run</span></code> calls <code class="docutils literal"><span class="pre">Executor.Run</span></code> to run the first block in the <code class="docutils literal"><span class="pre">ProgramDesc</span></code> message.</li>
+<li><code class="docutils literal"><span class="pre">Executor.Run</span></code>&#8216;s implementation is extremely simple &#8211; it doesn&#8217;t plan the execution nor create threads; instead, it runs on the current thread and execute intrinsics/operators&#8217; <code class="docutils literal"><span class="pre">Run</span></code> method sequentially as they appear in the <code class="docutils literal"><span class="pre">Block.ops</span></code> array.</li>
+<li>Intrinsics/operators&#8217; <code class="docutils literal"><span class="pre">Run</span></code> method might create threads.  For example, the <code class="docutils literal"><span class="pre">ListenAndDo</span></code> operator creates a thread to handle each incoming request.</li>
+<li>Threads are not necessarily OS thread; instead, they could be <a class="reference external" href="https://en.wikipedia.org/wiki/Green_threads">green threads</a> managed by ThreadPool.  Multiple green threads might run on the same OS thread.  An example green threads is Go&#8217;s <a class="reference external" href="https://tour.golang.org/concurrency/1">goroutines</a>.</li>
+</ol>
+</div>
+</div>
+           </div>
+          </div>
+          <footer>
+  <hr/>
+  <div role="contentinfo">
+    <p>
+        &copy; Copyright 2016, PaddlePaddle developers.
+    </p>
+  </div>
+  Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. 
+</footer>
+        </div>
+      </div>
+    </section>
+  </div>
+    <script type="text/javascript">
+        var DOCUMENTATION_OPTIONS = {
+            URL_ROOT:'../',
+            VERSION:'',
+            COLLAPSE_INDEX:false,
+            FILE_SUFFIX:'.html',
+            HAS_SOURCE:  true,
+            SOURCELINK_SUFFIX: ".txt",
+        };
+    </script>
+      <script type="text/javascript" src="../_static/jquery.js"></script>
+      <script type="text/javascript" src="../_static/underscore.js"></script>
+      <script type="text/javascript" src="../_static/doctools.js"></script>
+      <script type="text/javascript" src="../_static/translations.js"></script>
+      <script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.0/MathJax.js"></script>
+    <script type="text/javascript" src="../_static/js/theme.js"></script>
+  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
+  <script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script>
+  <script src="../_static/js/paddle_doc_init.js"></script> 
+</body>
+</html>
\ No newline at end of file
--- a/develop/doc_cn/design/fluid.html
+++ b/develop/doc_cn/design/fluid.html
@@ -321,7 +321,7 @@
 <div class="section" id="towards-a-deep-learning-language-and-the-compiler">
 <span id="towards-a-deep-learning-language-and-the-compiler"></span><h2>Towards a Deep Learning Language and the Compiler<a class="headerlink" href="#towards-a-deep-learning-language-and-the-compiler" title="永久链接至标题">¶</a></h2>
 <p>We can change the <code class="docutils literal"><span class="pre">if-then-else</span></code> and loop structure a little bit in the above Fluid example programs, to make it into a new programming language, different than Python.</p>
-<p>Even if we do not invent a new language, as long as we get the <code class="docutils literal"><span class="pre">ProgramDesc</span></code> message filled in, we can write a transpiler, which translates each invocation to an operator, into a C++ call to a kernel function of that operator. For example, a transpiler that weaves the CUDA kernels outputs an NVIDIA-friendly C++ program, which can be built using <code class="docutils literal"><span class="pre">nvcc</span></code>.  Another transpiler could generate MKL-friendly code that should be built using <code class="docutils literal"><span class="pre">icc</span></code> from Intel.  More interestingly, we can translate a Fluid program into its distributed version of two <code class="docutils literal"><span class="pre">ProgramDesc</span></code> messages, one for running on the trainer process, and the other one for the parameter server.  For more details of the last example, the <a class="reference external" href="design/concurrent_programming.md">concurrent programming design</a> document would be a good pointer.  The following figure explains the proposed two-stage process:</p>
+<p>Even if we do not invent a new language, as long as we get the <code class="docutils literal"><span class="pre">ProgramDesc</span></code> message filled in, we can write a transpiler, which translates each invocation to an operator, into a C++ call to a kernel function of that operator. For example, a transpiler that weaves the CUDA kernels outputs an NVIDIA-friendly C++ program, which can be built using <code class="docutils literal"><span class="pre">nvcc</span></code>.  Another transpiler could generate MKL-friendly code that should be built using <code class="docutils literal"><span class="pre">icc</span></code> from Intel.  More interestingly, we can translate a Fluid program into its distributed version of two <code class="docutils literal"><span class="pre">ProgramDesc</span></code> messages, one for running on the trainer process, and the other one for the parameter server.  For more details of the last example, the <a class="reference internal" href="concurrent_programming.html"><span class="doc">concurrent programming design</span></a> document would be a good pointer.  The following figure explains the proposed two-stage process:</p>
 <p><img alt="" src="../_images/fluid-compiler.png" /></p>
 </div>
 </div>

--- a/develop/doc_cn/objects.inv
+++ b/develop/doc_cn/objects.inv
--- a/develop/doc_cn/searchindex.js
+++ b/develop/doc_cn/searchindex.js