提交 7728c534 编写于 作者: T tensor-tang

Merge remote-tracking branch 'upstream/develop' into fluid

Conflicts:
	paddle/platform/place.h
# Design Doc: PaddlePaddle Fluid
## Why Fluid
When Baidu developed PaddlePaddle in 2013, the only well-known open source deep learning system at the time was Caffe. However, when PaddlePaddle was open-sourced in 2016, many other choices were available. There was a challenge -- what is the need for open sourcing yet another deep learning framework?
Fluid is the answer. Fluid is similar to PyTorch and TensorFlow Eager Execution, which describes the "process" of training or inference using the concept of a model. In fact in PyTorch, TensorFlow Eager Execution and Fluid, there is no concept of a model at all. The details are covered in the sections below. Fluid is currently more extreme in the above mentioned idea than PyTorch and Eager Execution, and we are trying to push Fluid towards the directions of a compiler and a new programming language for deep learning.
## The Evolution of Deep Learning Systems
Deep learning infrastructure is one of the fastest evolving technologies. Within four years, there have already been three generations of technologies invented.
| Existed since | model as sequence of layers | model as graph of operators | No model |
|--|--|--|--|
| 2013 | Caffe, Theano, Torch, PaddlePaddle | | |
| 2015 | | TensorFlow, MxNet, Caffe2, ONNX, n-graph | |
| 2016 | | | PyTorch, TensorFlow Eager Execution, PaddlePaddle Fluid |
From the above table, we see that the deep learning technology is evolving towards getting rid of the concept of a model. To understand the reasons behind this direction, a comparison of the *programming paradigms* or the ways to program deep learning applications using these systems, would be helpful. The following section goes over these.
## Deep Learning Programming Paradigms
With the systems listed as the first or second generation, e.g., Caffe or TensorFlow, an AI application training program looks like the following:
```python
x = layer.data("image")
l = layer.data("label")
f = layer.fc(x, W)
s = layer.softmax(f)
c = layer.mse(l, s)
for i in xrange(1000): # train for 1000 iterations
m = read_minibatch()
forward({input=x, data=m}, minimize=c)
backward(...)
print W # print the trained model parameters.
```
The above program includes two parts:
1. The first part describes the model, and
2. The second part describes the training process (or inference process) for the model.
This paradigm has a well-known problem that limits the productivity of programmers. If the programmer made a mistake in configuring the model, the error messages wouldn't show up until the second part is executed and `forward` and `backward` propagations are performed. This makes it difficult for the programmer to debug and locate a mistake that is located blocks away from the actual error prompt.
This problem of being hard to debug and re-iterate fast on a program is the primary reason that programmers, in general, prefer PyTorch over the older systems. Using PyTorch, we would write the above program as following:
```python
W = tensor(...)
for i in xrange(1000): # train for 1000 iterations
m = read_minibatch()
x = m["image"]
l = m["label"]
f = layer.fc(x, W)
s = layer.softmax(f)
c = layer.mse(l, s)
backward()
print W # print the trained model parameters.
```
We can see that the main difference is the moving the model configuration part (the first step) into the training loop. This change would allow the mistakes in model configuration to be reported where they actually appear in the programming block. This change also represents the model better, or its forward pass, by keeping the configuration process in the training loop.
## Describe Arbitrary Models for the Future
Describing the process instead of the model also brings Fluid, the flexibility to define different non-standard models that haven't been invented yet.
As we write out the program for the process, we can write an RNN as a loop, instead of an RNN as a layer or as an operator. A PyTorch example would look like the following:
```python
for i in xrange(1000):
m = read_minibatch()
x = m["sentence"]
for t in xrange x.len():
h[t] = the_step(x[t])
```
With Fluid, the training loop and the RNN in the above program are not really Python loops, but just a "loop structure" provided by Fluid and implemented in C++ as the following:
```python
train_loop = layers.While(cond)
with train_loop.block():
m = read_minibatch()
x = m["sentence"]
rnn = layers.While(...)
with rnn.block():
h[t] = the_step(input[t])
```
An actual Fluid example is described [here](https://github.com/PaddlePaddle/Paddle/blob/a91efdde6910ce92a78e3aa7157412c4c88d9ee8/python/paddle/v2/fluid/tests/test_while_op.py#L36-L44).
From the example, the Fluid programs look very similar to their PyTorch equivalent programs, except that Fluid's loop structure, wrapped with Python's `with` statement, could run much faster than just a Python loop.
We have more examples of the [`if-then-else`](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/if_else_op.md) structure of Fluid.
## Turing Completeness
In computability theory, a system of data-manipulation rules, such as a programming language, is said to be Turing complete if it can be used to simulate any Turing machine. For a programming language, if it provides if-then-else and loop, it is Turing complete. From the above examples, Fluid seems to be Turing complete; however, it is noteworthy to notice that there is a slight difference between the `if-then-else` of Fluid and that of a programming language. The difference being that the former runs both of its branches and splits the input mini-batch into two -- one for the True condition and another for the False condition. This hasn't been researched in depth if this is equivalent to the `if-then-else` in programming languages that makes them Turing-complete. Based on a conversation with [Yuang Yu](https://research.google.com/pubs/104812.html), it seems to be the case but this needs to be looked into in-depth.
## The Execution of a Fluid Program
There are two ways to execute a Fluid program. When a program is executed, it creates a protobuf message [`ProgramDesc`](https://github.com/PaddlePaddle/Paddle/blob/a91efdde6910ce92a78e3aa7157412c4c88d9ee8/paddle/framework/framework.proto#L145) that describes the process and is conceptually like an [abstract syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree).
There is a C++ class [`Executor`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h), which runs a `ProgramDesc`, similar to how an interpreter runs a Python program.
Fluid is moving towards the direction of a compiler, which is explain in more detail later in this article.
## Backward Compatibility of Fluid
Given all the advantages from the removal of the concept of a *model*, hardware manufacturers might still prefer the existence of the concept of a model, so it would be easier for them to support multiple frameworks all at once and could run a trained model during inference. For example, Nervana, a startup company acquired by Intel, has been working on an XPU that reads the models in the format known as [n-graph](https://github.com/NervanaSystems/ngraph). Similarly, [Movidius](https://www.movidius.com/) is producing a mobile deep learning chip that reads and runs graphs of operators. The well-known [ONNX](https://github.com/onnx/onnx) is also a file format of graphs of operators.
For Fluid, we can write a converter that extracts the parts in the `ProgramDesc` protobuf message, converts them into a graph of operators, and exports the graph into the ONNX or n-graph format.
## Towards a Deep Learning Language and the Compiler
We can change the `if-then-else` and loop structure a little bit in the above Fluid example programs, to make it into a new programming language, different than Python.
Even if we do not invent a new language, as long as we get the `ProgramDesc` message filled in, we can write a transpiler, which translates each invocation to an operator, into a C++ call to a kernel function of that operator. For example, a transpiler that weaves the CUDA kernels outputs an NVIDIA-friendly C++ program, which can be built using `nvcc`. Another transpiler could generate MKL-friendly code that should be built using `icc` from Intel. More interestingly, we can translate a Fluid program into its distributed version of two `ProgramDesc` messages, one for running on the trainer process, and the other one for the parameter server. For more details of the last example, the [concurrent programming design](concurrent_programming.md) document would be a good pointer. The following figure explains the proposed two-stage process:
![](fluid-compiler.png)
# Design Doc: Support new Device/Library # Design Doc: Supporting new Device/Library
## Background ## Background
Deep learning has a high demand for computing resources. New high-performance device and computing library are coming constantly. The deep learning framework has to integrate these high-performance device and computing library flexibly. Deep learning has a high demand for computing resources. New high-performance devices and computing libraries are appearing very frequently. Deep learning frameworks have to integrate these high-performance devices and computing libraries flexibly and efficiently.
On the one hand, hardware and computing library are not usually one-to-one coresponding relations. For example, in Intel CPU, there are Eigen and MKL computing library. And in Nvidia GPU, there are Eigen and cuDNN computing library. We have to implement specific kernels for an operator for each computing library. On one hand, hardware and computing libraries usually do not have a one-to-one correspondence. For example,Intel CPUs support Eigen and MKL computing libraries while Nvidia GPUs support Eigen and cuDNN computing libraries. We have to implement operator specific kernels for each computing library.
On the other hand, users usually do not want to care about the low-level hardware and computing library when writing a neural network configuration. In Fluid, `Layer` is exposed in `Python`, and `Operator` is exposed in `C++`. Both `Layer` and `Operator` are independent on hardwares. On the other hand, users usually do not want to care about the low-level hardware and computing libraries when writing a neural network configuration. In Fluid, `Layer` is exposed in `Python`, and `Operator` is exposed in `C++`. Both `Layer` and `Operator` are hardware independent.
So, how to support a new Device/Library in Fluid becomes a challenge. So, how to support a new Device/Library in Fluid becomes a challenge.
## Basic: Integrate A New Device/Library ## Basic: Integrate A New Device/Library
For a general overview of fluid, please refer to [overview doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/read_source.md). For a general overview of fluid, please refer to the [overview doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/read_source.md).
There are mainly there parts we have to consider in integrating a new device/library: There are mainly three parts that we have to consider while integrating a new device/library:
- Place and DeviceContext: indicates the device id and manages hardware resources - Place and DeviceContext: indicates the device id and manages hardware resources
- Memory and Tensor: malloc/free data on certain device - Memory and Tensor: malloc/free data on certain device
- Math Functor and OpKernel: implement computing unit on certain device/library - Math Functor and OpKernel: implement computing unit on certain devices/libraries
### Place and DeviceContext ### Place and DeviceContext
#### Place #### Place
Fluid use class [Place](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L55) to represent specific device and computing library. There are inheritance relationships between different kinds of `Place`. Fluid uses class [Place](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L55) to represent different devices and computing libraries. There are inheritance relationships between different kinds of `Place`.
``` ```
| CPUPlace --> MKLDNNPlace | CPUPlace --> MKLDNNPlace
...@@ -43,7 +43,7 @@ typedef boost::variant<CUDAPlace, CPUPlace, FPGAPlace> Place; ...@@ -43,7 +43,7 @@ typedef boost::variant<CUDAPlace, CPUPlace, FPGAPlace> Place;
#### DeviceContext #### DeviceContext
Fluid use class [DeviceContext](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L30) to manage the resources in certain hardware, such as CUDA stream in `CDUADeviceContext`. There are also inheritance relationships between different kinds of `DeviceContext`. Fluid uses class [DeviceContext](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L30) to manage the resources in different hardwares, such as CUDA stream in `CDUADeviceContext`. There are also inheritance relationships between different kinds of `DeviceContext`.
``` ```
...@@ -52,7 +52,7 @@ DeviceContext ----> CUDADeviceContext --> CUDNNDeviceContext ...@@ -52,7 +52,7 @@ DeviceContext ----> CUDADeviceContext --> CUDNNDeviceContext
\-> FPGADeviceContext \-> FPGADeviceContext
``` ```
A example of Nvidia GPU is as follows: An example of Nvidia GPU is as follows:
- DeviceContext - DeviceContext
...@@ -93,7 +93,7 @@ class CUDNNDeviceContext : public CUDADeviceContext { ...@@ -93,7 +93,7 @@ class CUDNNDeviceContext : public CUDADeviceContext {
#### memory module #### memory module
Fluid provide following [memory interfaces](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/memory/memory.h#L36): Fluid provides the following [memory interfaces](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/memory/memory.h#L36):
``` ```
template <typename Place> template <typename Place>
...@@ -106,12 +106,12 @@ template <typename Place> ...@@ -106,12 +106,12 @@ template <typename Place>
size_t Used(Place place); size_t Used(Place place);
``` ```
To implementing these interfaces, we have to implement MemoryAllocator for specific Device To implementing these interfaces, we have to implement MemoryAllocator for different Devices
#### Tensor #### Tensor
[Tensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/tensor.h#L36) holds data with some shape in certain Place. [Tensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/tensor.h#L36) holds data with some shape in a specific Place.
```cpp ```cpp
class Tensor { class Tensor {
...@@ -168,7 +168,7 @@ t.mutable_data(place); ...@@ -168,7 +168,7 @@ t.mutable_data(place);
### Math Functor and OpKernel ### Math Functor and OpKernel
Fluid implements computing unit based on different DeviceContext. Some computing unit is shared between operators. These common part will be put in operators/math directory as basic Functors. Fluid implements computing units based on different DeviceContexts. Some computing units are shared between operators. This common part will be put in operators/math directory as basic Functors.
Let's take [MaxOutFunctor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/math/maxouting.h#L27) as an example: Let's take [MaxOutFunctor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/math/maxouting.h#L27) as an example:
...@@ -183,7 +183,7 @@ class MaxOutFunctor { ...@@ -183,7 +183,7 @@ class MaxOutFunctor {
}; };
``` ```
CPU implement in .cc file CPU implemention is in .cc file
``` ```
template <typename T> template <typename T>
...@@ -197,7 +197,7 @@ class MaxOutFunctor<platform::CPUDeviceContext, T> { ...@@ -197,7 +197,7 @@ class MaxOutFunctor<platform::CPUDeviceContext, T> {
}; };
``` ```
CUDA implement in .cu file CUDA implemention is in .cu file
``` ```
template <typename T> template <typename T>
...@@ -212,11 +212,11 @@ class MaxOutFunctor<platform::CUDADeviceContext, T> { ...@@ -212,11 +212,11 @@ class MaxOutFunctor<platform::CUDADeviceContext, T> {
``` ```
We get computing handle from concrete DeviceContext, and make compution on tensors. We get computing handle from a concrete DeviceContext, and make compution on tensors.
The implement of `OpKernel` is similar to math functors, the extra thing we need to do is registering the OpKernel to global map. The implemention of `OpKernel` is similar to math functors, the extra thing we need to do is to register the OpKernel in a global map.
Fluid provides different register interface in op_registry.h Fluid provides different register interfaces in op_registry.h
Let's take [Crop](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/crop_op.cc#L134) operator as an example: Let's take [Crop](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/crop_op.cc#L134) operator as an example:
...@@ -240,7 +240,7 @@ REGISTER_OP_CUDA_KERNEL( ...@@ -240,7 +240,7 @@ REGISTER_OP_CUDA_KERNEL(
## Advanced topics: How to switch between different Device/Library ## Advanced topics: How to switch between different Device/Library
Generally, we will impelement OpKernel for all Device/Library of an Operator. We can easily train a Convolutional Neural Network in GPU. However, some OpKernel is not sutibale in a specific Device. For example, crf operator can be only run at CPU, whereas most other operators can be run at GPU. To achieve high performance in such circumstance, we have to switch between different Device/Library. Generally, we will impelement OpKernel for all Device/Library of an Operator. We can easily train a Convolutional Neural Network in GPU. However, some OpKernel is not sutibale on a specific Device. For example, crf operator can only run on CPU, whereas most other operators can run at GPU. To achieve high performance in such circumstance, we have to switch between different Device/Library.
We will discuss how to implement an efficient OpKernel switch policy. We will discuss how to implement an efficient OpKernel switch policy.
......
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
$ export CUDA_SO="$(\ls usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')" $ export CUDA_SO="$(\ls usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')"
$ export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}') $ export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
$ docker run ${CUDA_SO} ${DEVICES} -it paddledev/paddlepaddle:latest-gpu $ docker run ${CUDA_SO} ${DEVICES} -it paddlepaddle/paddle:latest-gpu
更多关于Docker的安装与使用, 请参考 `PaddlePaddle Docker 文档 <http://www.paddlepaddle.org/doc_cn/build_and_install/install/docker_install.html>`_ 。 更多关于Docker的安装与使用, 请参考 `PaddlePaddle Docker 文档 <http://www.paddlepaddle.org/doc_cn/build_and_install/install/docker_install.html>`_ 。
......
...@@ -114,7 +114,7 @@ PaddlePaddle Book是为用户和开发者制作的一个交互式的Jupyter Note ...@@ -114,7 +114,7 @@ PaddlePaddle Book是为用户和开发者制作的一个交互式的Jupyter Note
.. code-block:: bash .. code-block:: bash
nvidia-docker run -it -v $PWD:/work paddledev/paddle:latest-gpu /bin/bash nvidia-docker run -it -v $PWD:/work paddlepaddle/paddle:latest-gpu /bin/bash
**注: 如果没有安装nvidia-docker,可以尝试以下的方法,将CUDA库和Linux设备挂载到Docker容器内:** **注: 如果没有安装nvidia-docker,可以尝试以下的方法,将CUDA库和Linux设备挂载到Docker容器内:**
...@@ -122,7 +122,7 @@ PaddlePaddle Book是为用户和开发者制作的一个交互式的Jupyter Note ...@@ -122,7 +122,7 @@ PaddlePaddle Book是为用户和开发者制作的一个交互式的Jupyter Note
export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')" export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')"
export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}') export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
docker run ${CUDA_SO} ${DEVICES} -it paddledev/paddle:latest-gpu docker run ${CUDA_SO} ${DEVICES} -it paddlepaddle/paddle:latest-gpu
**关于AVX:** **关于AVX:**
......
...@@ -122,7 +122,7 @@ GPU driver installed before move on. ...@@ -122,7 +122,7 @@ GPU driver installed before move on.
.. code-block:: bash .. code-block:: bash
nvidia-docker run -it -v $PWD:/work paddledev/paddle:latest-gpu /bin/bash nvidia-docker run -it -v $PWD:/work paddlepaddle/paddle:latest-gpu /bin/bash
**NOTE: If you don't have nvidia-docker installed, try the following method to mount CUDA libs and devices into the container.** **NOTE: If you don't have nvidia-docker installed, try the following method to mount CUDA libs and devices into the container.**
...@@ -130,7 +130,7 @@ GPU driver installed before move on. ...@@ -130,7 +130,7 @@ GPU driver installed before move on.
export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')" export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')"
export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}') export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
docker run ${CUDA_SO} ${DEVICES} -it paddledev/paddle:latest-gpu docker run ${CUDA_SO} ${DEVICES} -it paddlepaddle/paddle:latest-gpu
**About AVX:** **About AVX:**
......
...@@ -261,8 +261,12 @@ class GemmConvGradKernel : public framework::OpKernel<T> { ...@@ -261,8 +261,12 @@ class GemmConvGradKernel : public framework::OpKernel<T> {
if (input_grad) { if (input_grad) {
input_grad->mutable_data<T>(context.GetPlace()); input_grad->mutable_data<T>(context.GetPlace());
set_zero(dev_ctx, input_grad, static_cast<T>(0));
// if is_expand is false, the operation of set_zero is unnecessary,
// because math::matmul will reset input_grad.
if (is_expand) {
set_zero(dev_ctx, input_grad, static_cast<T>(0));
}
math::Col2VolFunctor<DeviceContext, T> col2vol; math::Col2VolFunctor<DeviceContext, T> col2vol;
math::Col2ImFunctor<math::ColFormat::kCFO, DeviceContext, T> col2im; math::Col2ImFunctor<math::ColFormat::kCFO, DeviceContext, T> col2im;
......
...@@ -225,7 +225,6 @@ class GemmConvTransposeGradKernel : public framework::OpKernel<T> { ...@@ -225,7 +225,6 @@ class GemmConvTransposeGradKernel : public framework::OpKernel<T> {
if (input_grad) { if (input_grad) {
input_grad->mutable_data<T>(context.GetPlace()); input_grad->mutable_data<T>(context.GetPlace());
set_zero(dev_ctx, input_grad, static_cast<T>(0));
} }
if (filter_grad) { // filter size (m, c, k_h, k_w) if (filter_grad) { // filter size (m, c, k_h, k_w)
filter_grad->mutable_data<T>(context.GetPlace()); filter_grad->mutable_data<T>(context.GetPlace());
......
...@@ -273,6 +273,13 @@ void set_constant_with_place<platform::GPUPlace>( ...@@ -273,6 +273,13 @@ void set_constant_with_place<platform::GPUPlace>(
TensorSetConstantGPU(context, tensor, value)); TensorSetConstantGPU(context, tensor, value));
} }
template <>
void set_constant_with_place<platform::CudnnPlace>(
const platform::DeviceContext& context, framework::Tensor* tensor,
float value) {
set_constant_with_place<platform::GPUPlace>(context, tensor, value);
}
template struct RowwiseAdd<platform::CUDADeviceContext, float>; template struct RowwiseAdd<platform::CUDADeviceContext, float>;
template struct RowwiseAdd<platform::CUDADeviceContext, double>; template struct RowwiseAdd<platform::CUDADeviceContext, double>;
template struct ColwiseSum<platform::CUDADeviceContext, float>; template struct ColwiseSum<platform::CUDADeviceContext, float>;
......
...@@ -37,18 +37,23 @@ class ReduceOp : public framework::OperatorWithKernel { ...@@ -37,18 +37,23 @@ class ReduceOp : public framework::OperatorWithKernel {
PADDLE_ENFORCE_LT( PADDLE_ENFORCE_LT(
dim, x_rank, dim, x_rank,
"The dim should be in the range [-rank(input), rank(input))."); "The dim should be in the range [-rank(input), rank(input)).");
bool keep_dim = ctx->Attrs().Get<bool>("keep_dim"); bool reduce_all = ctx->Attrs().Get<bool>("reduce_all");
auto dims_vector = vectorize(x_dims); if (reduce_all) {
if (keep_dim || x_rank == 1) { ctx->SetOutputDim("Out", {1});
dims_vector[dim] = 1;
} else { } else {
dims_vector.erase(dims_vector.begin() + dim); bool keep_dim = ctx->Attrs().Get<bool>("keep_dim");
} auto dims_vector = vectorize(x_dims);
auto out_dims = framework::make_ddim(dims_vector); if (keep_dim || x_rank == 1) {
ctx->SetOutputDim("Out", out_dims); dims_vector[dim] = 1;
if (dim != 0) { } else {
// Only pass LoD when not reducing on the first dim. dims_vector.erase(dims_vector.begin() + dim);
ctx->ShareLoD("X", /*->*/ "Out"); }
auto out_dims = framework::make_ddim(dims_vector);
ctx->SetOutputDim("Out", out_dims);
if (dim != 0) {
// Only pass LoD when not reducing on the first dim.
ctx->ShareLoD("X", /*->*/ "Out");
}
} }
} }
}; };
...@@ -95,11 +100,16 @@ class ReduceOpMaker : public framework::OpProtoAndCheckerMaker { ...@@ -95,11 +100,16 @@ class ReduceOpMaker : public framework::OpProtoAndCheckerMaker {
"(bool, default false) " "(bool, default false) "
"If true, retain the reduced dimension with length 1.") "If true, retain the reduced dimension with length 1.")
.SetDefault(false); .SetDefault(false);
AddAttr<bool>("reduce_all",
"(bool, default false) "
"If true, output a scalar reduced along all dimensions.")
.SetDefault(false);
comment_ = R"DOC( comment_ = R"DOC(
{ReduceOp} Operator. {ReduceOp} Operator.
This operator computes the {reduce} of input tensor along the given dimension. This operator computes the {reduce} of input tensor along the given dimension.
The result tensor has 1 fewer dimension than the input unless keep_dim is true. The result tensor has 1 fewer dimension than the input unless keep_dim is true.
If reduce_all is true, just reduce along all dimensions and output a scalar.
)DOC"; )DOC";
AddComment(comment_); AddComment(comment_);
......
...@@ -26,10 +26,12 @@ using DDim = framework::DDim; ...@@ -26,10 +26,12 @@ using DDim = framework::DDim;
template <typename T, size_t D, int MajorType = Eigen::RowMajor, template <typename T, size_t D, int MajorType = Eigen::RowMajor,
typename IndexType = Eigen::DenseIndex> typename IndexType = Eigen::DenseIndex>
using EigenTensor = framework::EigenTensor<T, D, MajorType, IndexType>; using EigenTensor = framework::EigenTensor<T, D, MajorType, IndexType>;
template <typename T, int MajorType = Eigen::RowMajor, template <typename T, int MajorType = Eigen::RowMajor,
typename IndexType = Eigen::DenseIndex> typename IndexType = Eigen::DenseIndex>
using EigenScalar = framework::EigenScalar<T, MajorType, IndexType>; using EigenScalar = framework::EigenScalar<T, MajorType, IndexType>;
template <typename T, int MajorType = Eigen::RowMajor,
typename IndexType = Eigen::DenseIndex>
using EigenVector = framework::EigenVector<T, MajorType, IndexType>;
struct SumFunctor { struct SumFunctor {
template <typename DeviceContext, typename X, typename Y, typename Dim> template <typename DeviceContext, typename X, typename Y, typename Dim>
...@@ -95,26 +97,41 @@ template <typename DeviceContext, typename T, typename Functor> ...@@ -95,26 +97,41 @@ template <typename DeviceContext, typename T, typename Functor>
class ReduceKernel : public framework::OpKernel<T> { class ReduceKernel : public framework::OpKernel<T> {
public: public:
void Compute(const framework::ExecutionContext& context) const override { void Compute(const framework::ExecutionContext& context) const override {
int rank = context.Input<Tensor>("X")->dims().size(); bool reduce_all = context.Attr<bool>("reduce_all");
switch (rank) { if (reduce_all) {
case 1: // Flatten and reduce 1-D tensor
ReduceCompute<1>(context); auto* input = context.Input<Tensor>("X");
break; auto* output = context.Output<Tensor>("Out");
case 2: output->mutable_data<T>(context.GetPlace());
ReduceCompute<2>(context); auto x = EigenVector<T>::Flatten(*input);
break; auto out = EigenScalar<T>::From(*output);
case 3: auto& place =
ReduceCompute<3>(context); *context.template device_context<DeviceContext>().eigen_device();
break; auto reduce_dim = Eigen::array<int, 1>({{0}});
case 4: Functor functor;
ReduceCompute<4>(context); functor(place, x, out, reduce_dim);
break; } else {
case 5: int rank = context.Input<Tensor>("X")->dims().size();
ReduceCompute<5>(context); switch (rank) {
break; case 1:
case 6: ReduceCompute<1>(context);
ReduceCompute<6>(context); break;
break; case 2:
ReduceCompute<2>(context);
break;
case 3:
ReduceCompute<3>(context);
break;
case 4:
ReduceCompute<4>(context);
break;
case 5:
ReduceCompute<5>(context);
break;
case 6:
ReduceCompute<6>(context);
break;
}
} }
} }
...@@ -157,26 +174,46 @@ template <typename DeviceContext, typename T, typename Functor> ...@@ -157,26 +174,46 @@ template <typename DeviceContext, typename T, typename Functor>
class ReduceGradKernel : public framework::OpKernel<T> { class ReduceGradKernel : public framework::OpKernel<T> {
public: public:
void Compute(const framework::ExecutionContext& context) const override { void Compute(const framework::ExecutionContext& context) const override {
int rank = context.Input<Tensor>("X")->dims().size(); bool reduce_all = context.Attr<bool>("reduce_all");
switch (rank) { if (reduce_all) {
case 1: auto* input0 = context.Input<Tensor>("X");
ReduceGradCompute<1>(context); auto* input1 = context.Input<Tensor>("Out");
break; auto* input2 = context.Input<Tensor>(framework::GradVarName("Out"));
case 2: auto* output = context.Output<Tensor>(framework::GradVarName("X"));
ReduceGradCompute<2>(context); output->mutable_data<T>(context.GetPlace());
break; auto x = EigenVector<T>::Flatten(*input0);
case 3: auto x_reduce = EigenVector<T>::From(*input1);
ReduceGradCompute<3>(context); auto x_reduce_grad = EigenVector<T>::From(*input2);
break; auto x_grad = EigenVector<T>::Flatten(*output);
case 4: auto& place =
ReduceGradCompute<4>(context); *context.template device_context<DeviceContext>().eigen_device();
break; auto broadcast_dim =
case 5: Eigen::array<int, 1>({{static_cast<int>(input0->numel())}});
ReduceGradCompute<5>(context); Functor functor;
break; functor(place, x, x_reduce, x_grad, x_reduce_grad, broadcast_dim,
case 6: broadcast_dim[0]);
ReduceGradCompute<6>(context); } else {
break; int rank = context.Input<Tensor>("X")->dims().size();
switch (rank) {
case 1:
ReduceGradCompute<1>(context);
break;
case 2:
ReduceGradCompute<2>(context);
break;
case 3:
ReduceGradCompute<3>(context);
break;
case 4:
ReduceGradCompute<4>(context);
break;
case 5:
ReduceGradCompute<5>(context);
break;
case 6:
ReduceGradCompute<6>(context);
break;
}
} }
} }
......
...@@ -125,6 +125,22 @@ cudnnHandle_t CUDADeviceContext::cudnn_handle() const { return cudnn_handle_; } ...@@ -125,6 +125,22 @@ cudnnHandle_t CUDADeviceContext::cudnn_handle() const { return cudnn_handle_; }
cudaStream_t CUDADeviceContext::stream() const { return stream_; } cudaStream_t CUDADeviceContext::stream() const { return stream_; }
CudnnDeviceContext::CudnnDeviceContext(CudnnPlace place)
: CUDADeviceContext(place), place_(place) {
PADDLE_ENFORCE(dynload::cudnnCreate(&cudnn_handle_));
PADDLE_ENFORCE(dynload::cudnnSetStream(cudnn_handle_, stream()));
}
CudnnDeviceContext::~CudnnDeviceContext() {
SetDeviceId(place_.device);
Wait();
PADDLE_ENFORCE(dynload::cudnnDestroy(cudnn_handle_));
}
Place CudnnDeviceContext::GetPlace() const { return CudnnPlace(); }
cudnnHandle_t CudnnDeviceContext::cudnn_handle() const { return cudnn_handle_; }
#endif #endif
} // namespace platform } // namespace platform
......
...@@ -86,6 +86,22 @@ class CUDADeviceContext : public DeviceContext { ...@@ -86,6 +86,22 @@ class CUDADeviceContext : public DeviceContext {
cublasHandle_t cublas_handle_; cublasHandle_t cublas_handle_;
}; };
class CudnnDeviceContext : public CUDADeviceContext {
public:
explicit CudnnDeviceContext(CudnnPlace place);
virtual ~CudnnDeviceContext();
/*! \brief Return place in the device context. */
Place GetPlace() const final;
/*! \brief Return cudnn handle in the device context. */
cudnnHandle_t cudnn_handle() const;
private:
cudnnHandle_t cudnn_handle_;
CudnnPlace place_;
};
#endif #endif
} // namespace platform } // namespace platform
......
...@@ -46,3 +46,19 @@ TEST(Device, CUDADeviceContext) { ...@@ -46,3 +46,19 @@ TEST(Device, CUDADeviceContext) {
delete device_context; delete device_context;
} }
} }
TEST(Device, CudnnDeviceContext) {
using paddle::platform::CudnnDeviceContext;
using paddle::platform::CudnnPlace;
if (paddle::platform::dynload::HasCUDNN()) {
int count = paddle::platform::GetCUDADeviceCount();
for (int i = 0; i < count; ++i) {
CudnnDeviceContext* device_context =
new CudnnDeviceContext(CudnnPlace(i));
cudnnHandle_t cudnn_handle = device_context->cudnn_handle();
ASSERT_NE(nullptr, cudnn_handle);
ASSERT_NE(nullptr, device_context->stream());
delete device_context;
}
}
}
...@@ -51,6 +51,11 @@ struct GPUPlace { ...@@ -51,6 +51,11 @@ struct GPUPlace {
int device; int device;
}; };
struct CudnnPlace : public GPUPlace {
CudnnPlace() : GPUPlace() {}
explicit CudnnPlace(int d) : GPUPlace(d) {}
};
struct IsGPUPlace : public boost::static_visitor<bool> { struct IsGPUPlace : public boost::static_visitor<bool> {
bool operator()(const CPUPlace &) const { return false; } bool operator()(const CPUPlace &) const { return false; }
bool operator()(const MKLDNNPlace &) const { return false; } bool operator()(const MKLDNNPlace &) const { return false; }
...@@ -67,7 +72,7 @@ struct IsMKLDNNPlace : public boost::static_visitor<bool> { ...@@ -67,7 +72,7 @@ struct IsMKLDNNPlace : public boost::static_visitor<bool> {
// should be less equal than 2^(NUM_PLACE_TYPE_LIMIT_IN_BIT) // should be less equal than 2^(NUM_PLACE_TYPE_LIMIT_IN_BIT)
#define NUM_PLACE_TYPE_LIMIT_IN_BIT 4 #define NUM_PLACE_TYPE_LIMIT_IN_BIT 4
typedef boost::variant<GPUPlace, CPUPlace, MKLDNNPlace> Place; typedef boost::variant<CudnnPlace, GPUPlace, CPUPlace> Place;
// static check number of place types is less equal than // static check number of place types is less equal than
// 2^(NUM_PLACE_TYPE_LIMIT_IN_BIT) // 2^(NUM_PLACE_TYPE_LIMIT_IN_BIT)
......
# Build this image: docker build -t mpi . # Build this image: docker build -t mpi .
# #
FROM paddledev/paddle:0.10.0rc3 FROM paddlepaddle/paddle:0.10.0rc3
ENV DEBIAN_FRONTEND noninteractive ENV DEBIAN_FRONTEND noninteractive
......
...@@ -5,4 +5,4 @@ docker run --rm \ ...@@ -5,4 +5,4 @@ docker run --rm \
-e "WITH_AVX=ON" \ -e "WITH_AVX=ON" \
-e "WITH_DOC=ON" \ -e "WITH_DOC=ON" \
-e "WOBOQ=ON" \ -e "WOBOQ=ON" \
${1:-"paddledev/paddle:dev"} ${1:-"paddlepaddle/paddle:latest-dev"}
...@@ -85,5 +85,19 @@ class Test1DReduce(OpTest): ...@@ -85,5 +85,19 @@ class Test1DReduce(OpTest):
self.check_grad(['X'], 'Out') self.check_grad(['X'], 'Out')
class TestReduceAll(OpTest):
def setUp(self):
self.op_type = "reduce_sum"
self.inputs = {'X': np.random.random((5, 6, 2, 10)).astype("float32")}
self.attrs = {'reduce_all': True}
self.outputs = {'Out': self.inputs['X'].sum()}
def test_check_output(self):
self.check_output()
def test_check_grad(self):
self.check_grad(['X'], 'Out')
if __name__ == '__main__': if __name__ == '__main__':
unittest.main() unittest.main()
...@@ -383,19 +383,22 @@ class Parameters(object): ...@@ -383,19 +383,22 @@ class Parameters(object):
params.deserialize(param_name, f) params.deserialize(param_name, f)
return params return params
def init_from_tar(self, f): def init_from_tar(self, f, exclude_params=[]):
""" """
Different from `from_tar`, this interface can be used to Different from `from_tar`, this interface can be used to
init partial network parameters from another saved model. init partial network parameters from another saved model.
:param f: the initialized model file. :param f: the initialized model file.
:type f: tar file :type f: tar file
:param exclude_params: the names of parameters that should
not be initialized from the model file.
:type exclude_params: list of strings
:return: Nothing. :return: Nothing.
""" """
tar_param = Parameters.from_tar(f) tar_param = Parameters.from_tar(f)
for pname in tar_param.names(): for pname in tar_param.names():
if pname in self.names(): if pname in self.names() and pname not in exclude_params:
self.set(pname, tar_param.get(pname)) self.set(pname, tar_param.get(pname))
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册