提交 048254da 编写于 作者: G guosheng

Merge branch 'develop' of https://github.com/PaddlePaddle/paddle into enhance-exclude-param-init

# 如何写新的Operator # 如何写新的Operator
- [概念简介](#概念简介) - [概念简介](#概念简介)
- [实现C++类](#实现C++) - [实现C++类](#实现c)
- [定义ProtoMaker类](#定义ProtoMaker类) - [定义ProtoMaker类](#定义protomaker类)
- [定义Operator类](#定义Operator类) - [定义Operator类](#定义operator类)
- [定义OpKernel类](#定义OpKernel类) - [定义OpKernel类](#定义opkernel类)
- [注册Operator](#注册Operator) - [注册Operator](#注册operator)
- [编译](#编译) - [编译](#编译)
- [绑定Python](#绑定Python) - [绑定Python](#绑定python)
- [实现单元测试](#实现单元测试) - [实现单元测试](#实现单元测试)
- [前向Operator单测](#前向Operator单测) - [前向Operator单测](#前向operator单测)
- [反向Operator单测](#反向Operator单测) - [反向Operator单测](#反向operator单测)
- [编译和执行](#编译和执行) - [编译和执行](#编译和执行)
- [注意事项](#注意事项)
## 概念简介 ## 概念简介
...@@ -43,7 +44,7 @@ Kernel实现 | CPU、CUDA共享Kernel实现在`.h`文件中,否则,CPU ...@@ -43,7 +44,7 @@ Kernel实现 | CPU、CUDA共享Kernel实现在`.h`文件中,否则,CPU
## 实现C++类 ## 实现C++类
### 1. 定义ProtoMaker类 ### 定义ProtoMaker类
矩阵乘法的公式:$Out = X * Y$, 可见该计算由两个输入,一个输出组成。 矩阵乘法的公式:$Out = X * Y$, 可见该计算由两个输入,一个输出组成。
...@@ -100,7 +101,7 @@ The equation is: Out = scale*X ...@@ -100,7 +101,7 @@ The equation is: Out = scale*X
- `AddAttr<AttrType>("scale", "...").SetDefault(1.0);` : 增加`scale`系数,作为参数属性,并且设置默认值为1.0。 - `AddAttr<AttrType>("scale", "...").SetDefault(1.0);` : 增加`scale`系数,作为参数属性,并且设置默认值为1.0。
### 2. 定义Operator类 ### 定义Operator类
下面的点实现了MulOp的定义: 下面的点实现了MulOp的定义:
...@@ -149,7 +150,7 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs, ...@@ -149,7 +150,7 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
通常`OpProtoMaker``Op`类的定义写在`.cc`文件中,和下面将要介绍的注册函数一起放在`.cc` 通常`OpProtoMaker``Op`类的定义写在`.cc`文件中,和下面将要介绍的注册函数一起放在`.cc`
### 3. 定义OpKernel类 ### 定义OpKernel类
`MulKernel`继承自`framework::OpKernel`,带有下面两个模板参数: `MulKernel`继承自`framework::OpKernel`,带有下面两个模板参数:
...@@ -177,6 +178,7 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs, ...@@ -177,6 +178,7 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
math::matmul<DeviceContext, T>(*X, false, *Y, false, 1, Z, 0, device_context); math::matmul<DeviceContext, T>(*X, false, *Y, false, 1, Z, 0, device_context);
} }
}; };
```
需要注意:**不同设备(CPU、CUDA)共享一个Op定义,是否则共享同一个`OpKernel`,取决于`Compute`调用的函数是否支持不同设备。** 需要注意:**不同设备(CPU、CUDA)共享一个Op定义,是否则共享同一个`OpKernel`,取决于`Compute`调用的函数是否支持不同设备。**
...@@ -188,7 +190,7 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs, ...@@ -188,7 +190,7 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
到此,前向Op实现完成。接下来,需要在`.cc`文件中注册该op和kernel。 到此,前向Op实现完成。接下来,需要在`.cc`文件中注册该op和kernel。
反向Op类的定义,反向OpKernel的定义与前向Op类似,这里不再赘述。**但需注意反向Op没有`ProtoMaker`** 反向Op类的定义,反向OpKernel的定义与前向Op类似,这里不再赘述。**但需注意反向Op没有`ProtoMaker`**
### 4. 注册Operator ### 注册Operator
-`.cc`文件中注册前向、反向Op类,注册CPU Kernel。 -`.cc`文件中注册前向、反向Op类,注册CPU Kernel。
...@@ -220,7 +222,7 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs, ...@@ -220,7 +222,7 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
ops::MulGradKernel<paddle::platform::CUDADeviceContext, float>); ops::MulGradKernel<paddle::platform::CUDADeviceContext, float>);
``` ```
### 5. 编译 ### 编译
运行下面命令可以进行编译: 运行下面命令可以进行编译:
...@@ -236,6 +238,7 @@ make mul_op ...@@ -236,6 +238,7 @@ make mul_op
单测包括对比前向Op不同设备(CPU、CUDA)的实现、对比反向OP不同设备(CPU、CUDA)的实现、反向Op的梯度测试。下面介绍介绍[`MulOp`的单元测试](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/framework/tests/test_mul_op.py) 单测包括对比前向Op不同设备(CPU、CUDA)的实现、对比反向OP不同设备(CPU、CUDA)的实现、反向Op的梯度测试。下面介绍介绍[`MulOp`的单元测试](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/framework/tests/test_mul_op.py)
### 前向Operator单测
Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp`里完成。测试Operator,需要: Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp`里完成。测试Operator,需要:
...@@ -273,8 +276,7 @@ Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp ...@@ -273,8 +276,7 @@ Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp
def test_check_grad_ingore_y(self): def test_check_grad_ingore_y(self):
self.check_grad( self.check_grad(
['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y')) ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
```
```
上面的代码首先导入依赖的包,下面是对`setUp`函数中操作的重要变量的详细解释: 上面的代码首先导入依赖的包,下面是对`setUp`函数中操作的重要变量的详细解释:
...@@ -282,6 +284,8 @@ Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp ...@@ -282,6 +284,8 @@ Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp
- `self.inputs` : 定义输入,类型为`numpy.array`,并初始化。 - `self.inputs` : 定义输入,类型为`numpy.array`,并初始化。
- `self.outputs` : 定义输出,并在Python脚本中完成与operator同样的计算逻辑,返回Python端的计算结果。 - `self.outputs` : 定义输出,并在Python脚本中完成与operator同样的计算逻辑,返回Python端的计算结果。
### 反向operator单测
而反向测试中: 而反向测试中:
- `test_check_grad_normal`中调用`check_grad`使用数值法检测梯度正确性和稳定性。 - `test_check_grad_normal`中调用`check_grad`使用数值法检测梯度正确性和稳定性。
- 第一个参数`["X", "Y"]` : 指定对输入变量`X``Y`做梯度检测。 - 第一个参数`["X", "Y"]` : 指定对输入变量`X``Y`做梯度检测。
...@@ -290,7 +294,7 @@ Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp ...@@ -290,7 +294,7 @@ Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp
- `test_check_grad_ingore_x``test_check_grad_ingore_y`分支用来测试只需要计算一个输入梯度的情况。 - `test_check_grad_ingore_x``test_check_grad_ingore_y`分支用来测试只需要计算一个输入梯度的情况。
### 编译和执行单元测试 ### 编译和执行
`python/paddle/v2/framework/tests` 目录下新增的 `test_*.py` 单元测试会被自动加入工程进行编译。 `python/paddle/v2/framework/tests` 目录下新增的 `test_*.py` 单元测试会被自动加入工程进行编译。
......
# How to write a new operator # How to write a new operator
- [Background](#background) - [Background](#background)
- [Implementing C++ Types](#implementing-c++-types) - [Implementing C++ Types](#implementing-c-types)
- [Defining ProtoMaker](#defining-protoMaker) - [Defining ProtoMaker](#defining-protomaker)
- [Defining Operator](#defining-operator) - [Defining Operator](#defining-operator)
- [Registering Operator](#registering-operator) - [Registering Operator](#registering-operator)
- [Compilation](#compilation) - [Compilation](#compilation)
...@@ -41,7 +41,7 @@ Let's take matrix multiplication operator, [MulOp](https://github.com/PaddlePadd ...@@ -41,7 +41,7 @@ Let's take matrix multiplication operator, [MulOp](https://github.com/PaddlePadd
## Implementing C++ Types ## Implementing C++ Types
### 1. Defining Class ProtoMaker ### Defining ProtoMaker
Matrix Multiplication can be written as $Out = X * Y$, meaning that the operation consists of two inputs and pne output. Matrix Multiplication can be written as $Out = X * Y$, meaning that the operation consists of two inputs and pne output.
...@@ -98,7 +98,7 @@ There are two changes in this example: ...@@ -98,7 +98,7 @@ There are two changes in this example:
- `AddAttr<AttrType>("scale", "...").SetDefault(1.0);` adds `scale`constant as an attribute, and sets the default value to 1.0. - `AddAttr<AttrType>("scale", "...").SetDefault(1.0);` adds `scale`constant as an attribute, and sets the default value to 1.0.
### 2. Defining Operator ### Defining Operator
The following code defines the interface for MulOp: The following code defines the interface for MulOp:
...@@ -147,7 +147,7 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs, ...@@ -147,7 +147,7 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
Usually `OpProtoMaker` and `Op`'s type definitions are written in `.cc` files, which also include the registration methods introduced later. Usually `OpProtoMaker` and `Op`'s type definitions are written in `.cc` files, which also include the registration methods introduced later.
### 3. Defining OpKernel ### Defining OpKernel
`MulKernel` inherits `framework::OpKernel`, which includes the following templates: `MulKernel` inherits `framework::OpKernel`, which includes the following templates:
...@@ -188,7 +188,7 @@ This concludes the forward implementation of an operator. Next its operation and ...@@ -188,7 +188,7 @@ This concludes the forward implementation of an operator. Next its operation and
The definition of its corresponding backward operator, if applicable, is similar to that of an forward operator. **Note that a backward operator does not include a `ProtoMaker`**. The definition of its corresponding backward operator, if applicable, is similar to that of an forward operator. **Note that a backward operator does not include a `ProtoMaker`**.
### 4. Registering Operator ### Registering Operator
- In `.cc` files, register forward and backward operator classes and the CPU kernel. - In `.cc` files, register forward and backward operator classes and the CPU kernel.
...@@ -220,7 +220,7 @@ The definition of its corresponding backward operator, if applicable, is similar ...@@ -220,7 +220,7 @@ The definition of its corresponding backward operator, if applicable, is similar
ops::MulGradKernel<paddle::platform::CUDADeviceContext, float>); ops::MulGradKernel<paddle::platform::CUDADeviceContext, float>);
``` ```
### 5. Compilation ### Compilation
Run the following commands to compile. Run the following commands to compile.
...@@ -284,8 +284,7 @@ A forward operator unit test inherits `unittest.TestCase` and defines metaclass ...@@ -284,8 +284,7 @@ A forward operator unit test inherits `unittest.TestCase` and defines metaclass
def test_check_grad_ingore_y(self): def test_check_grad_ingore_y(self):
self.check_grad( self.check_grad(
['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y')) ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
```
```
Get its output, and compare it with the forward operator's own output. Get its output, and compare it with the forward operator's own output.
The code above first loads required packages. In addition, we have The code above first loads required packages. In addition, we have
...@@ -294,6 +293,8 @@ The code above first loads required packages. In addition, we have ...@@ -294,6 +293,8 @@ The code above first loads required packages. In addition, we have
- `self.inputs` defines input, with type `numpy.array` and initializes it. - `self.inputs` defines input, with type `numpy.array` and initializes it.
- `self.outputs` defines output and completes the same operator computation in the Python script, and returns its result from the Python script. - `self.outputs` defines output and completes the same operator computation in the Python script, and returns its result from the Python script.
### Testing Backward Operators
Some key points in checking gradient above include: Some key points in checking gradient above include:
- `test_normal` calls `check_grad` to validate scaling tests' correctness and stability through numeric methods. - `test_normal` calls `check_grad` to validate scaling tests' correctness and stability through numeric methods.
......
...@@ -28,6 +28,7 @@ limitations under the License. */ ...@@ -28,6 +28,7 @@ limitations under the License. */
#include "hl_top_k.h" #include "hl_top_k.h"
#include "paddle/utils/Logging.h" #include "paddle/utils/Logging.h"
#include "NEONFunctions.h"
#include "paddle/function/GemmFunctor.h" #include "paddle/function/GemmFunctor.h"
#include "paddle/utils/ThreadLocal.h" #include "paddle/utils/ThreadLocal.h"
...@@ -4165,16 +4166,36 @@ void CpuMatrix::print(std::ostream& os) const { ...@@ -4165,16 +4166,36 @@ void CpuMatrix::print(std::ostream& os) const {
void CpuMatrix::paramReluForward(Matrix& data, Matrix& W) { void CpuMatrix::paramReluForward(Matrix& data, Matrix& W) {
real* input = data.getData(); real* input = data.getData();
real* w = W.getData(); real* w = W.getData();
real* output = data_;
size_t numElements = data.getWidth(); size_t numElements = data.getWidth();
size_t numSamples = data.getHeight(); size_t numSamples = data.getHeight();
size_t paraSize = W.getHeight() * W.getWidth(); size_t paraSize = W.getHeight() * W.getWidth();
CHECK(!(numElements % paraSize)); // this check from ParameterReluLayer::init CHECK(!(numElements % paraSize)); // this check from ParameterReluLayer::init
size_t partial_sum = numElements / paraSize; size_t partial_sum = numElements / paraSize;
if (paraSize == numElements) {
for (size_t n = 0; n < numSamples * numElements; ++n) {
output[n] = input[n] > 0 ? input[n] : input[n] * w[n % numElements];
}
return;
}
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
for (size_t n = 0; n < numSamples; ++n) {
for (size_t i = 0; i < paraSize; i++) {
neon::prelu(
input + i * partial_sum, w[i], output + i * partial_sum, partial_sum);
}
input = input + numElements;
output = output + numElements;
}
#else
for (size_t n = 0, k = 0; n < numSamples; ++n) { for (size_t n = 0, k = 0; n < numSamples; ++n) {
for (size_t i = 0; i < numElements; ++i, ++k) { for (size_t i = 0; i < numElements; ++i, ++k) {
data_[k] = input[k] > 0 ? input[k] : input[k] * w[i / partial_sum]; output[k] = input[k] > 0 ? input[k] : input[k] * w[i / partial_sum];
} }
} }
#endif
} }
void CpuMatrix::paramReluBackwardW(Matrix& oGrad, Matrix& data) { void CpuMatrix::paramReluBackwardW(Matrix& oGrad, Matrix& data) {
......
...@@ -49,6 +49,46 @@ void relu(const float* a, float* b, int len) { ...@@ -49,6 +49,46 @@ void relu(const float* a, float* b, int len) {
} }
} }
// b[i] = a[i] > 0.0f ? a[i] : a[i] * w
void prelu(const float* a, float w, float* b, int len) {
int offset = len % 16;
float32x4_t ma0, ma1, ma2, ma3;
float32x4_t zero = vdupq_n_f32(0.f);
float32x4_t vw = vdupq_n_f32(w);
for (int k = 0; k < len / 16; k++, a += 16, b += 16) {
ma0 = vld1q_f32(a);
ma1 = vld1q_f32(a + 4);
ma2 = vld1q_f32(a + 8);
ma3 = vld1q_f32(a + 12);
uint32x4_t flag0 = vcgtq_f32(ma0, zero);
uint32x4_t flag1 = vcgtq_f32(ma1, zero);
uint32x4_t flag2 = vcgtq_f32(ma2, zero);
uint32x4_t flag3 = vcgtq_f32(ma3, zero);
float32x4_t mul0 = vmulq_f32(ma0, vw);
float32x4_t mul1 = vmulq_f32(ma1, vw);
float32x4_t mul2 = vmulq_f32(ma2, vw);
float32x4_t mul3 = vmulq_f32(ma3, vw);
ma0 = vbslq_f32(flag0, ma0, mul0);
ma1 = vbslq_f32(flag1, ma1, mul1);
ma2 = vbslq_f32(flag2, ma2, mul2);
ma3 = vbslq_f32(flag3, ma3, mul3);
vst1q_f32(b, ma0);
vst1q_f32(b + 4, ma1);
vst1q_f32(b + 8, ma2);
vst1q_f32(b + 12, ma3);
}
for (int i = 0; i < offset; i++) {
b[i] = a[i] > 0.0f ? a[i] : a[i] * w;
}
}
} // namespace neon } // namespace neon
} // namespace paddle } // namespace paddle
......
...@@ -18,6 +18,7 @@ namespace paddle { ...@@ -18,6 +18,7 @@ namespace paddle {
namespace neon { namespace neon {
void relu(const float* a, float* b, int len); void relu(const float* a, float* b, int len);
void prelu(const float* a, float w, float* b, int len);
} // namespace neon } // namespace neon
} // namespace paddle } // namespace paddle
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册