With PaddlePaddle Fluid, users describe a program other than a model. The program is a [`ProgramDesc`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto) protobuf message. TensorFlow/MxNet/Caffe2 applications generate protobuf messages too, but their protobuf messages represent the model, a graph of operators, but not the program that trains/uses the model.
Many know that when we program TensorFlow, we can specify the device on which each operator runs. This allows us to create a concurrent/parallel AI application. An interesting questions is **how does a `ProgramDesc` represents a concurrent program?**
The answer relies on the fact that a `ProgramDesc` is similar to an abstract syntax tree (AST) that describes a program. So users just program a concurrent program that they do with any concurrent programming language, e.g., [Go](https://golang.org).
## An Analogy
The following table compares concepts in Fluid and Go
To review all above concepts in an example, let us take a simple program and writes its distributed version.
Suppose that we want to parallelize a naive Fluid program (written in Go and calling Fluid's Go binding) that multiplies two tensors.
```go
import "fluid"
func paddlepaddle() {
X = fluid.read(...)
W = fluid.Tensor(...)
Y = fluid.mult(X, W)
}
```
Please be aware that the Fluid's Go binding provides the default `main` function, which calls the `paddlepaddle` function, which, in this case, is defined in above program and creates the following `ProgramDesc` message.
```protobuf
message ProgramDesc {
block[0] = Block {
vars = [X, W, Y],
ops = [
read(output = X)
assign(input = ..., output = W)
mult(input = {X, W}, output = Y)
],
}
}
```
Then, the default `main` function calls `fluid.run()`, which creates an instance of the [`class Executor`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h) and calls `Executor.Run(block[0])`, where `block[0]` is the first and only block defined in above `ProgramDesc` message.
The default `main` function is defined as follows:
```go
func main() {
paddlepaddle()
fluid.run()
}
```
## The Concurrent Version
By parallelizing the above program, we could support very big tensor X by splitting into small pieces {x_1, x_2, ...} and sent each piece to worker process/node for parallel multiplication.
In this case, we can write a transpiler that takes a `ProgramDesc` message that represents the above example program and outputs two `ProgramDesc` messages, one for running on the master process/node, and the other one for worker processes/nodes.
### The Master Program
The master program could look like the following:
```protobuf
message ProgramDesc {
block[0] = Block {
vars = [X, L, Y],
ops = [
read(output = X)
kube_get_workers_addrs(output = L)
Y = tensor_array(len(L))
parallel_for(input = X, output = Y,
attrs = {L, block_id(1)}) # referring to block 1
]
}
block[1] = Block {
parent = 0,
vars = [x, y, index],
ops = [
slice(input = [X, index], output = x) # index is initialized by parallel_for
send(input = x, attrs = L[index])
recv(outputs = y, attrs = L[index])
assign(input = y, output = Y[index])
]
}
}
```
The equivalent Fluid program (calling the Go binding) is:
```go
func main() { //// block 0
X = fluid.read(...)
L = fluid.k8s.get_worker_addrs()
Y = fluid.tensor_array(len(L))
fluid.parallel_for(X, L,
func(index int) { //// block 1
x = X[index]
fluid.send(L[index], x)
y = fluid.recv(L[index])
Y[index] = y
})
}
```
An explanation of the above program:
- `fluid.k8s` is a package that provides access to Kubernetes API.
- `fluid.k8s.get_worker_addrs` returns the list of IP and ports of all pods of the current job except for the current one (the master pod).
- `fluid.tensor_array` creates a [tensor array](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor_array.h). `fluid.parallel_for` creates a `ParallelFor` intrinsic, which, when executed,
1. creates `len(L)` scopes, each for the concurrent running of the sub-block (block 1 in this case), and initializes a variable named "index" in the scope to an integer value in the range `[0, len(L)-1]`, and
2. creates `len(L)` threads by calling into the `ThreadPool` singleton, each thread
1. creates an Executor instance, and
2. calls `Executor.Run(block)`, where `block` is block 1 as explained above.
1. Please be aware that block 1 is a sub-block of block 0, so ops in block 1 could refer to variables defined in block 0.
### The Worker Program
The worker program looks like
```go
func main() {
W = Tensor(...)
x = fluid.listen_and_do(
fluid.k8s.self_addr(),
func(input Tensor) {
output = fluid.mult(input, W)
})
}
```
where
- `fluid.listen_and_do` creates a `ListenAndDo` intrinsic, which, when executed,
1. listens on the current pod's IP address, as returned by `fliud.k8s.self_addr()`,
2. once a connection is established,
1. creates a scope of two parameters, "input" and "output",
2. reads a [Fluid variable](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h) and saves it into "input",
3. creates an Executor instance and calls `Executor.Run(block)`, where the block is generated by running the lambda specified as the second parameter of `fluid.listen_and_do`.
## Summarization
From the above example, we see that:
1. Fluid enables the imperative programming paradigm by:
1. letting users describe a program, but not a model (a sequence of layers, or a graph of operators), and
2. call the `fluid.run` function that runs the program implicitly.
1. The program is described as a `ProgramDesc` protobuf message.
2. Function `Executor.Run` takes a block, instead of a `ProgramDesc`, as its parameter.
3. `fluid.run` calls `Executor.Run` to run the first block in the `ProgramDesc` message.
4. `Executor.Run`'s implementation is extremely simple -- it doesn't plan the execution nor create threads; instead, it runs on the current thread and execute intrinsics/operators' `Run` method sequentially as they appear in the `Block.ops` array.
5. Intrinsics/operators' `Run` method might create threads. For example, the `ListenAndDo` operator creates a thread to handle each incoming request.
6. Threads are not necessarily OS thread; instead, they could be [green threads](https://en.wikipedia.org/wiki/Green_threads) managed by ThreadPool. Multiple green threads might run on the same OS thread. An example green threads is Go's [goroutines](https://tour.golang.org/concurrency/1).
<spanid="design-doc-concurrent-programming-with-fluid"></span><h1>Design Doc: Concurrent Programming with Fluid<aclass="headerlink"href="#design-doc-concurrent-programming-with-fluid"title="Permalink to this headline">¶</a></h1>
<p>With PaddlePaddle Fluid, users describe a program other than a model. The program is a <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto"><codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code></a> protobuf message. TensorFlow/MxNet/Caffe2 applications generate protobuf messages too, but their protobuf messages represent the model, a graph of operators, but not the program that trains/uses the model.</p>
<p>Many know that when we program TensorFlow, we can specify the device on which each operator runs. This allows us to create a concurrent/parallel AI application. An interesting questions is <strong>how does a <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> represents a concurrent program?</strong></p>
<p>The answer relies on the fact that a <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> is similar to an abstract syntax tree (AST) that describes a program. So users just program a concurrent program that they do with any concurrent programming language, e.g., <aclass="reference external"href="https://golang.org">Go</a>.</p>
<divclass="section"id="an-analogy">
<spanid="an-analogy"></span><h2>An Analogy<aclass="headerlink"href="#an-analogy"title="Permalink to this headline">¶</a></h2>
<p>The following table compares concepts in Fluid and Go</p>
<spanid="an-example-concurrent-program"></span><h2>An Example Concurrent Program<aclass="headerlink"href="#an-example-concurrent-program"title="Permalink to this headline">¶</a></h2>
<p>To review all above concepts in an example, let us take a simple program and writes its distributed version.</p>
<p>Suppose that we want to parallelize a naive Fluid program (written in Go and calling Fluid’s Go binding) that multiplies two tensors.</p>
<p>Please be aware that the Fluid’s Go binding provides the default <codeclass="docutils literal"><spanclass="pre">main</span></code> function, which calls the <codeclass="docutils literal"><spanclass="pre">paddlepaddle</span></code> function, which, in this case, is defined in above program and creates the following <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> message.</p>
<p>Then, the default <codeclass="docutils literal"><spanclass="pre">main</span></code> function calls <codeclass="docutils literal"><spanclass="pre">fluid.run()</span></code>, which creates an instance of the <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h"><codeclass="docutils literal"><spanclass="pre">class</span><spanclass="pre">Executor</span></code></a> and calls <codeclass="docutils literal"><spanclass="pre">Executor.Run(block[0])</span></code>, where <codeclass="docutils literal"><spanclass="pre">block[0]</span></code> is the first and only block defined in above <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> message.</p>
<p>The default <codeclass="docutils literal"><spanclass="pre">main</span></code> function is defined as follows:</p>
<spanid="the-concurrent-version"></span><h2>The Concurrent Version<aclass="headerlink"href="#the-concurrent-version"title="Permalink to this headline">¶</a></h2>
<p>By parallelizing the above program, we could support very big tensor X by splitting into small pieces {x_1, x_2, ...} and sent each piece to worker process/node for parallel multiplication.</p>
<p>In this case, we can write a transpiler that takes a <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> message that represents the above example program and outputs two <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> messages, one for running on the master process/node, and the other one for worker processes/nodes.</p>
<divclass="section"id="the-master-program">
<spanid="the-master-program"></span><h3>The Master Program<aclass="headerlink"href="#the-master-program"title="Permalink to this headline">¶</a></h3>
<p>The master program could look like the following:</p>
<li><codeclass="docutils literal"><spanclass="pre">fluid.k8s</span></code> is a package that provides access to Kubernetes API.</li>
<li><codeclass="docutils literal"><spanclass="pre">fluid.k8s.get_worker_addrs</span></code> returns the list of IP and ports of all pods of the current job except for the current one (the master pod).</li>
<li><codeclass="docutils literal"><spanclass="pre">fluid.tensor_array</span></code> creates a <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor_array.h">tensor array</a>. <codeclass="docutils literal"><spanclass="pre">fluid.parallel_for</span></code> creates a <codeclass="docutils literal"><spanclass="pre">ParallelFor</span></code> intrinsic, which, when executed,<ol>
<li>creates <codeclass="docutils literal"><spanclass="pre">len(L)</span></code> scopes, each for the concurrent running of the sub-block (block 1 in this case), and initializes a variable named “index” in the scope to an integer value in the range <codeclass="docutils literal"><spanclass="pre">[0,</span><spanclass="pre">len(L)-1]</span></code>, and</li>
<li>creates <codeclass="docutils literal"><spanclass="pre">len(L)</span></code> threads by calling into the <codeclass="docutils literal"><spanclass="pre">ThreadPool</span></code> singleton, each thread<ol>
<li>creates an Executor instance, and</li>
<li>calls <codeclass="docutils literal"><spanclass="pre">Executor.Run(block)</span></code>, where <codeclass="docutils literal"><spanclass="pre">block</span></code> is block 1 as explained above.</li>
</ol>
</li>
</ol>
</li>
</ul>
<olclass="simple">
<li>Please be aware that block 1 is a sub-block of block 0, so ops in block 1 could refer to variables defined in block 0.</li>
</ol>
</div>
<divclass="section"id="the-worker-program">
<spanid="the-worker-program"></span><h3>The Worker Program<aclass="headerlink"href="#the-worker-program"title="Permalink to this headline">¶</a></h3>
<li><codeclass="docutils literal"><spanclass="pre">fluid.listen_and_do</span></code> creates a <codeclass="docutils literal"><spanclass="pre">ListenAndDo</span></code> intrinsic, which, when executed,<ol>
<li>listens on the current pod’s IP address, as returned by <codeclass="docutils literal"><spanclass="pre">fliud.k8s.self_addr()</span></code>,</li>
<li>once a connection is established,<ol>
<li>creates a scope of two parameters, “input” and “output”,</li>
<li>reads a <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h">Fluid variable</a> and saves it into “input”,</li>
<li>creates an Executor instance and calls <codeclass="docutils literal"><spanclass="pre">Executor.Run(block)</span></code>, where the block is generated by running the lambda specified as the second parameter of <codeclass="docutils literal"><spanclass="pre">fluid.listen_and_do</span></code>.</li>
</ol>
</li>
</ol>
</li>
</ul>
</div>
</div>
<divclass="section"id="summarization">
<spanid="summarization"></span><h2>Summarization<aclass="headerlink"href="#summarization"title="Permalink to this headline">¶</a></h2>
<p>From the above example, we see that:</p>
<olclass="simple">
<li>Fluid enables the imperative programming paradigm by:<ol>
<li>letting users describe a program, but not a model (a sequence of layers, or a graph of operators), and</li>
<li>call the <codeclass="docutils literal"><spanclass="pre">fluid.run</span></code> function that runs the program implicitly.</li>
</ol>
</li>
<li>The program is described as a <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> protobuf message.</li>
<li>Function <codeclass="docutils literal"><spanclass="pre">Executor.Run</span></code> takes a block, instead of a <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code>, as its parameter.</li>
<li><codeclass="docutils literal"><spanclass="pre">fluid.run</span></code> calls <codeclass="docutils literal"><spanclass="pre">Executor.Run</span></code> to run the first block in the <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> message.</li>
<li><codeclass="docutils literal"><spanclass="pre">Executor.Run</span></code>‘s implementation is extremely simple – it doesn’t plan the execution nor create threads; instead, it runs on the current thread and execute intrinsics/operators’<codeclass="docutils literal"><spanclass="pre">Run</span></code> method sequentially as they appear in the <codeclass="docutils literal"><spanclass="pre">Block.ops</span></code> array.</li>
<li>Intrinsics/operators’<codeclass="docutils literal"><spanclass="pre">Run</span></code> method might create threads. For example, the <codeclass="docutils literal"><spanclass="pre">ListenAndDo</span></code> operator creates a thread to handle each incoming request.</li>
<li>Threads are not necessarily OS thread; instead, they could be <aclass="reference external"href="https://en.wikipedia.org/wiki/Green_threads">green threads</a> managed by ThreadPool. Multiple green threads might run on the same OS thread. An example green threads is Go’s <aclass="reference external"href="https://tour.golang.org/concurrency/1">goroutines</a>.</li>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
<spanid="towards-a-deep-learning-language-and-the-compiler"></span><h2>Towards a Deep Learning Language and the Compiler<aclass="headerlink"href="#towards-a-deep-learning-language-and-the-compiler"title="Permalink to this headline">¶</a></h2>
<spanid="towards-a-deep-learning-language-and-the-compiler"></span><h2>Towards a Deep Learning Language and the Compiler<aclass="headerlink"href="#towards-a-deep-learning-language-and-the-compiler"title="Permalink to this headline">¶</a></h2>
<p>We can change the <codeclass="docutils literal"><spanclass="pre">if-then-else</span></code> and loop structure a little bit in the above Fluid example programs, to make it into a new programming language, different than Python.</p>
<p>We can change the <codeclass="docutils literal"><spanclass="pre">if-then-else</span></code> and loop structure a little bit in the above Fluid example programs, to make it into a new programming language, different than Python.</p>
<p>Even if we do not invent a new language, as long as we get the <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> message filled in, we can write a transpiler, which translates each invocation to an operator, into a C++ call to a kernel function of that operator. For example, a transpiler that weaves the CUDA kernels outputs an NVIDIA-friendly C++ program, which can be built using <codeclass="docutils literal"><spanclass="pre">nvcc</span></code>. Another transpiler could generate MKL-friendly code that should be built using <codeclass="docutils literal"><spanclass="pre">icc</span></code> from Intel. More interestingly, we can translate a Fluid program into its distributed version of two <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> messages, one for running on the trainer process, and the other one for the parameter server. For more details of the last example, the <aclass="reference external"href="design/concurrent_programming.md">concurrent programming design</a> document would be a good pointer. The following figure explains the proposed two-stage process:</p>
<p>Even if we do not invent a new language, as long as we get the <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> message filled in, we can write a transpiler, which translates each invocation to an operator, into a C++ call to a kernel function of that operator. For example, a transpiler that weaves the CUDA kernels outputs an NVIDIA-friendly C++ program, which can be built using <codeclass="docutils literal"><spanclass="pre">nvcc</span></code>. Another transpiler could generate MKL-friendly code that should be built using <codeclass="docutils literal"><spanclass="pre">icc</span></code> from Intel. More interestingly, we can translate a Fluid program into its distributed version of two <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> messages, one for running on the trainer process, and the other one for the parameter server. For more details of the last example, the <aclass="reference internal"href="concurrent_programming.html"><spanclass="doc">concurrent programming design</span></a> document would be a good pointer. The following figure explains the proposed two-stage process:</p>
With PaddlePaddle Fluid, users describe a program other than a model. The program is a [`ProgramDesc`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto) protobuf message. TensorFlow/MxNet/Caffe2 applications generate protobuf messages too, but their protobuf messages represent the model, a graph of operators, but not the program that trains/uses the model.
Many know that when we program TensorFlow, we can specify the device on which each operator runs. This allows us to create a concurrent/parallel AI application. An interesting questions is **how does a `ProgramDesc` represents a concurrent program?**
The answer relies on the fact that a `ProgramDesc` is similar to an abstract syntax tree (AST) that describes a program. So users just program a concurrent program that they do with any concurrent programming language, e.g., [Go](https://golang.org).
## An Analogy
The following table compares concepts in Fluid and Go
To review all above concepts in an example, let us take a simple program and writes its distributed version.
Suppose that we want to parallelize a naive Fluid program (written in Go and calling Fluid's Go binding) that multiplies two tensors.
```go
import "fluid"
func paddlepaddle() {
X = fluid.read(...)
W = fluid.Tensor(...)
Y = fluid.mult(X, W)
}
```
Please be aware that the Fluid's Go binding provides the default `main` function, which calls the `paddlepaddle` function, which, in this case, is defined in above program and creates the following `ProgramDesc` message.
```protobuf
message ProgramDesc {
block[0] = Block {
vars = [X, W, Y],
ops = [
read(output = X)
assign(input = ..., output = W)
mult(input = {X, W}, output = Y)
],
}
}
```
Then, the default `main` function calls `fluid.run()`, which creates an instance of the [`class Executor`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h) and calls `Executor.Run(block[0])`, where `block[0]` is the first and only block defined in above `ProgramDesc` message.
The default `main` function is defined as follows:
```go
func main() {
paddlepaddle()
fluid.run()
}
```
## The Concurrent Version
By parallelizing the above program, we could support very big tensor X by splitting into small pieces {x_1, x_2, ...} and sent each piece to worker process/node for parallel multiplication.
In this case, we can write a transpiler that takes a `ProgramDesc` message that represents the above example program and outputs two `ProgramDesc` messages, one for running on the master process/node, and the other one for worker processes/nodes.
### The Master Program
The master program could look like the following:
```protobuf
message ProgramDesc {
block[0] = Block {
vars = [X, L, Y],
ops = [
read(output = X)
kube_get_workers_addrs(output = L)
Y = tensor_array(len(L))
parallel_for(input = X, output = Y,
attrs = {L, block_id(1)}) # referring to block 1
]
}
block[1] = Block {
parent = 0,
vars = [x, y, index],
ops = [
slice(input = [X, index], output = x) # index is initialized by parallel_for
send(input = x, attrs = L[index])
recv(outputs = y, attrs = L[index])
assign(input = y, output = Y[index])
]
}
}
```
The equivalent Fluid program (calling the Go binding) is:
```go
func main() { //// block 0
X = fluid.read(...)
L = fluid.k8s.get_worker_addrs()
Y = fluid.tensor_array(len(L))
fluid.parallel_for(X, L,
func(index int) { //// block 1
x = X[index]
fluid.send(L[index], x)
y = fluid.recv(L[index])
Y[index] = y
})
}
```
An explanation of the above program:
- `fluid.k8s` is a package that provides access to Kubernetes API.
- `fluid.k8s.get_worker_addrs` returns the list of IP and ports of all pods of the current job except for the current one (the master pod).
- `fluid.tensor_array` creates a [tensor array](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor_array.h). `fluid.parallel_for` creates a `ParallelFor` intrinsic, which, when executed,
1. creates `len(L)` scopes, each for the concurrent running of the sub-block (block 1 in this case), and initializes a variable named "index" in the scope to an integer value in the range `[0, len(L)-1]`, and
2. creates `len(L)` threads by calling into the `ThreadPool` singleton, each thread
1. creates an Executor instance, and
2. calls `Executor.Run(block)`, where `block` is block 1 as explained above.
1. Please be aware that block 1 is a sub-block of block 0, so ops in block 1 could refer to variables defined in block 0.
### The Worker Program
The worker program looks like
```go
func main() {
W = Tensor(...)
x = fluid.listen_and_do(
fluid.k8s.self_addr(),
func(input Tensor) {
output = fluid.mult(input, W)
})
}
```
where
- `fluid.listen_and_do` creates a `ListenAndDo` intrinsic, which, when executed,
1. listens on the current pod's IP address, as returned by `fliud.k8s.self_addr()`,
2. once a connection is established,
1. creates a scope of two parameters, "input" and "output",
2. reads a [Fluid variable](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h) and saves it into "input",
3. creates an Executor instance and calls `Executor.Run(block)`, where the block is generated by running the lambda specified as the second parameter of `fluid.listen_and_do`.
## Summarization
From the above example, we see that:
1. Fluid enables the imperative programming paradigm by:
1. letting users describe a program, but not a model (a sequence of layers, or a graph of operators), and
2. call the `fluid.run` function that runs the program implicitly.
1. The program is described as a `ProgramDesc` protobuf message.
2. Function `Executor.Run` takes a block, instead of a `ProgramDesc`, as its parameter.
3. `fluid.run` calls `Executor.Run` to run the first block in the `ProgramDesc` message.
4. `Executor.Run`'s implementation is extremely simple -- it doesn't plan the execution nor create threads; instead, it runs on the current thread and execute intrinsics/operators' `Run` method sequentially as they appear in the `Block.ops` array.
5. Intrinsics/operators' `Run` method might create threads. For example, the `ListenAndDo` operator creates a thread to handle each incoming request.
6. Threads are not necessarily OS thread; instead, they could be [green threads](https://en.wikipedia.org/wiki/Green_threads) managed by ThreadPool. Multiple green threads might run on the same OS thread. An example green threads is Go's [goroutines](https://tour.golang.org/concurrency/1).
<spanid="design-doc-concurrent-programming-with-fluid"></span><h1>Design Doc: Concurrent Programming with Fluid<aclass="headerlink"href="#design-doc-concurrent-programming-with-fluid"title="永久链接至标题">¶</a></h1>
<p>With PaddlePaddle Fluid, users describe a program other than a model. The program is a <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto"><codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code></a> protobuf message. TensorFlow/MxNet/Caffe2 applications generate protobuf messages too, but their protobuf messages represent the model, a graph of operators, but not the program that trains/uses the model.</p>
<p>Many know that when we program TensorFlow, we can specify the device on which each operator runs. This allows us to create a concurrent/parallel AI application. An interesting questions is <strong>how does a <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> represents a concurrent program?</strong></p>
<p>The answer relies on the fact that a <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> is similar to an abstract syntax tree (AST) that describes a program. So users just program a concurrent program that they do with any concurrent programming language, e.g., <aclass="reference external"href="https://golang.org">Go</a>.</p>
<spanid="an-example-concurrent-program"></span><h2>An Example Concurrent Program<aclass="headerlink"href="#an-example-concurrent-program"title="永久链接至标题">¶</a></h2>
<p>To review all above concepts in an example, let us take a simple program and writes its distributed version.</p>
<p>Suppose that we want to parallelize a naive Fluid program (written in Go and calling Fluid’s Go binding) that multiplies two tensors.</p>
<p>Please be aware that the Fluid’s Go binding provides the default <codeclass="docutils literal"><spanclass="pre">main</span></code> function, which calls the <codeclass="docutils literal"><spanclass="pre">paddlepaddle</span></code> function, which, in this case, is defined in above program and creates the following <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> message.</p>
<p>Then, the default <codeclass="docutils literal"><spanclass="pre">main</span></code> function calls <codeclass="docutils literal"><spanclass="pre">fluid.run()</span></code>, which creates an instance of the <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h"><codeclass="docutils literal"><spanclass="pre">class</span><spanclass="pre">Executor</span></code></a> and calls <codeclass="docutils literal"><spanclass="pre">Executor.Run(block[0])</span></code>, where <codeclass="docutils literal"><spanclass="pre">block[0]</span></code> is the first and only block defined in above <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> message.</p>
<p>The default <codeclass="docutils literal"><spanclass="pre">main</span></code> function is defined as follows:</p>
<p>By parallelizing the above program, we could support very big tensor X by splitting into small pieces {x_1, x_2, ...} and sent each piece to worker process/node for parallel multiplication.</p>
<p>In this case, we can write a transpiler that takes a <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> message that represents the above example program and outputs two <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> messages, one for running on the master process/node, and the other one for worker processes/nodes.</p>
<li><codeclass="docutils literal"><spanclass="pre">fluid.k8s</span></code> is a package that provides access to Kubernetes API.</li>
<li><codeclass="docutils literal"><spanclass="pre">fluid.k8s.get_worker_addrs</span></code> returns the list of IP and ports of all pods of the current job except for the current one (the master pod).</li>
<li><codeclass="docutils literal"><spanclass="pre">fluid.tensor_array</span></code> creates a <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor_array.h">tensor array</a>. <codeclass="docutils literal"><spanclass="pre">fluid.parallel_for</span></code> creates a <codeclass="docutils literal"><spanclass="pre">ParallelFor</span></code> intrinsic, which, when executed,<ol>
<li>creates <codeclass="docutils literal"><spanclass="pre">len(L)</span></code> scopes, each for the concurrent running of the sub-block (block 1 in this case), and initializes a variable named “index” in the scope to an integer value in the range <codeclass="docutils literal"><spanclass="pre">[0,</span><spanclass="pre">len(L)-1]</span></code>, and</li>
<li>creates <codeclass="docutils literal"><spanclass="pre">len(L)</span></code> threads by calling into the <codeclass="docutils literal"><spanclass="pre">ThreadPool</span></code> singleton, each thread<ol>
<li>creates an Executor instance, and</li>
<li>calls <codeclass="docutils literal"><spanclass="pre">Executor.Run(block)</span></code>, where <codeclass="docutils literal"><spanclass="pre">block</span></code> is block 1 as explained above.</li>
</ol>
</li>
</ol>
</li>
</ul>
<olclass="simple">
<li>Please be aware that block 1 is a sub-block of block 0, so ops in block 1 could refer to variables defined in block 0.</li>
<li><codeclass="docutils literal"><spanclass="pre">fluid.listen_and_do</span></code> creates a <codeclass="docutils literal"><spanclass="pre">ListenAndDo</span></code> intrinsic, which, when executed,<ol>
<li>listens on the current pod’s IP address, as returned by <codeclass="docutils literal"><spanclass="pre">fliud.k8s.self_addr()</span></code>,</li>
<li>once a connection is established,<ol>
<li>creates a scope of two parameters, “input” and “output”,</li>
<li>reads a <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h">Fluid variable</a> and saves it into “input”,</li>
<li>creates an Executor instance and calls <codeclass="docutils literal"><spanclass="pre">Executor.Run(block)</span></code>, where the block is generated by running the lambda specified as the second parameter of <codeclass="docutils literal"><spanclass="pre">fluid.listen_and_do</span></code>.</li>
<li>Fluid enables the imperative programming paradigm by:<ol>
<li>letting users describe a program, but not a model (a sequence of layers, or a graph of operators), and</li>
<li>call the <codeclass="docutils literal"><spanclass="pre">fluid.run</span></code> function that runs the program implicitly.</li>
</ol>
</li>
<li>The program is described as a <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> protobuf message.</li>
<li>Function <codeclass="docutils literal"><spanclass="pre">Executor.Run</span></code> takes a block, instead of a <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code>, as its parameter.</li>
<li><codeclass="docutils literal"><spanclass="pre">fluid.run</span></code> calls <codeclass="docutils literal"><spanclass="pre">Executor.Run</span></code> to run the first block in the <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> message.</li>
<li><codeclass="docutils literal"><spanclass="pre">Executor.Run</span></code>‘s implementation is extremely simple – it doesn’t plan the execution nor create threads; instead, it runs on the current thread and execute intrinsics/operators’<codeclass="docutils literal"><spanclass="pre">Run</span></code> method sequentially as they appear in the <codeclass="docutils literal"><spanclass="pre">Block.ops</span></code> array.</li>
<li>Intrinsics/operators’<codeclass="docutils literal"><spanclass="pre">Run</span></code> method might create threads. For example, the <codeclass="docutils literal"><spanclass="pre">ListenAndDo</span></code> operator creates a thread to handle each incoming request.</li>
<li>Threads are not necessarily OS thread; instead, they could be <aclass="reference external"href="https://en.wikipedia.org/wiki/Green_threads">green threads</a> managed by ThreadPool. Multiple green threads might run on the same OS thread. An example green threads is Go’s <aclass="reference external"href="https://tour.golang.org/concurrency/1">goroutines</a>.</li>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
<spanid="towards-a-deep-learning-language-and-the-compiler"></span><h2>Towards a Deep Learning Language and the Compiler<aclass="headerlink"href="#towards-a-deep-learning-language-and-the-compiler"title="永久链接至标题">¶</a></h2>
<spanid="towards-a-deep-learning-language-and-the-compiler"></span><h2>Towards a Deep Learning Language and the Compiler<aclass="headerlink"href="#towards-a-deep-learning-language-and-the-compiler"title="永久链接至标题">¶</a></h2>
<p>We can change the <codeclass="docutils literal"><spanclass="pre">if-then-else</span></code> and loop structure a little bit in the above Fluid example programs, to make it into a new programming language, different than Python.</p>
<p>We can change the <codeclass="docutils literal"><spanclass="pre">if-then-else</span></code> and loop structure a little bit in the above Fluid example programs, to make it into a new programming language, different than Python.</p>
<p>Even if we do not invent a new language, as long as we get the <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> message filled in, we can write a transpiler, which translates each invocation to an operator, into a C++ call to a kernel function of that operator. For example, a transpiler that weaves the CUDA kernels outputs an NVIDIA-friendly C++ program, which can be built using <codeclass="docutils literal"><spanclass="pre">nvcc</span></code>. Another transpiler could generate MKL-friendly code that should be built using <codeclass="docutils literal"><spanclass="pre">icc</span></code> from Intel. More interestingly, we can translate a Fluid program into its distributed version of two <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> messages, one for running on the trainer process, and the other one for the parameter server. For more details of the last example, the <aclass="reference external"href="design/concurrent_programming.md">concurrent programming design</a> document would be a good pointer. The following figure explains the proposed two-stage process:</p>
<p>Even if we do not invent a new language, as long as we get the <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> message filled in, we can write a transpiler, which translates each invocation to an operator, into a C++ call to a kernel function of that operator. For example, a transpiler that weaves the CUDA kernels outputs an NVIDIA-friendly C++ program, which can be built using <codeclass="docutils literal"><spanclass="pre">nvcc</span></code>. Another transpiler could generate MKL-friendly code that should be built using <codeclass="docutils literal"><spanclass="pre">icc</span></code> from Intel. More interestingly, we can translate a Fluid program into its distributed version of two <codeclass="docutils literal"><spanclass="pre">ProgramDesc</span></code> messages, one for running on the trainer process, and the other one for the parameter server. For more details of the last example, the <aclass="reference internal"href="concurrent_programming.html"><spanclass="doc">concurrent programming design</span></a> document would be a good pointer. The following figure explains the proposed two-stage process:</p>