提交 832e763e 编写于 作者: R Ray Liu 提交者: GitHub

Merge branch 'develop' into develop

...@@ -10,6 +10,7 @@ option(LOG_PROFILE "log profile" OFF) ...@@ -10,6 +10,7 @@ option(LOG_PROFILE "log profile" OFF)
option(CPU "armv7 with neon" ON) option(CPU "armv7 with neon" ON)
option(GPU_MALI "mali gpu" OFF) option(GPU_MALI "mali gpu" OFF)
option(GPU_CL "opencl gpu" OFF) option(GPU_CL "opencl gpu" OFF)
option(FPGA "fpga" OFF) option(FPGA "fpga" OFF)
if(FPGA) if(FPGA)
option(FPGAV1 "fpga v1" ON) option(FPGAV1 "fpga v1" ON)
...@@ -144,7 +145,7 @@ if(FPGA) ...@@ -144,7 +145,7 @@ if(FPGA)
endforeach() endforeach()
file(GLOB_RECURSE _tmp_list src/operators/kernel/fpga/V2/*.h src/fpga/V2/*.h) file(GLOB_RECURSE _tmp_list src/operators/kernel/fpga/V2/*.h src/fpga/V2/*.h)
foreach(f ${_tmp_list}) foreach(f ${_tmp_list})
list(REMOVE_ITEM PADDLE_MOBILE_CC ${f}) list(REMOVE_ITEM PADDLE_MOBILE_H ${f})
endforeach() endforeach()
endif() endif()
if(FPGAV2) if(FPGAV2)
...@@ -156,7 +157,7 @@ if(FPGA) ...@@ -156,7 +157,7 @@ if(FPGA)
endforeach() endforeach()
file(GLOB_RECURSE _tmp_list src/operators/kernel/fpga/V1/*.h src/fpga/V1/*.h) file(GLOB_RECURSE _tmp_list src/operators/kernel/fpga/V1/*.h src/fpga/V1/*.h)
foreach(f ${_tmp_list}) foreach(f ${_tmp_list})
list(REMOVE_ITEM PADDLE_MOBILE_CC ${f}) list(REMOVE_ITEM PADDLE_MOBILE_H ${f})
endforeach() endforeach()
endif() endif()
......
...@@ -7,11 +7,21 @@ ...@@ -7,11 +7,21 @@
<!--[![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle-Mobile.svg)](https://github.com/PaddlePaddle/Paddle-Mobile/releases) <!--[![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle-Mobile.svg)](https://github.com/PaddlePaddle/Paddle-Mobile/releases)
[![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)--> [![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)-->
Welcome to Paddle-Mobile GitHub project。Paddle-Mobile is a project of PaddlePaddle as well as a deep learning framework for embedded platforms.
欢迎来到 Paddle-Mobile GitHub 项目。Paddle-Mobile是PaddlePaddle组织下的项目,是一个致力于嵌入式平台的深度学习的框架。 欢迎来到 Paddle-Mobile GitHub 项目。Paddle-Mobile是PaddlePaddle组织下的项目,是一个致力于嵌入式平台的深度学习的框架。
## Features ## Features
- high performance in support of ARM CPU
- support Mali GPU
- support Andreno GPU
- support the realization of GPU Metal on Apple devices
- support implementation on ZU5、ZU9 and other FPGA-based development boards
- support implementation on Raspberry Pi and other arm-linux development boards
## Features
- 高性能支持ARM CPU - 高性能支持ARM CPU
- 支持Mali GPU - 支持Mali GPU
- 支持Andreno GPU - 支持Andreno GPU
...@@ -19,6 +29,7 @@ ...@@ -19,6 +29,7 @@
- 支持ZU5、ZU9等FPGA开发板 - 支持ZU5、ZU9等FPGA开发板
- 支持树莓派等arm-linux开发板 - 支持树莓派等arm-linux开发板
## Demo ## Demo
- [ANDROID](https://github.com/xiebaiyuan/paddle-mobile-demo) - [ANDROID](https://github.com/xiebaiyuan/paddle-mobile-demo)
...@@ -26,6 +37,27 @@ ...@@ -26,6 +37,27 @@
[https://github.com/PaddlePaddle/paddle-mobile/tree/develop/demo](https://github.com/PaddlePaddle/paddle-mobile/tree/develop/demo) [https://github.com/PaddlePaddle/paddle-mobile/tree/develop/demo](https://github.com/PaddlePaddle/paddle-mobile/tree/develop/demo)
## Documentation
### Documentation of design
If you want to know more details about the documentation of paddle-mobile design, please refer to the link as follows. There are many previous designs and discussion: [issue](https://github.com/PaddlePaddle/paddle-mobile/issues).
[link of documentation of design](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/design_doc.md)
### Documentation of development
Documentation of development is mainly about building, running and other tasks.As a developer,you can use it with the help of contributed documents.
* [iOS](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_ios.md)
* [Android_CPU](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_android.md)
* [Android_GPU](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_android_GPU.md)
* [FPGA](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_fpga.md)
* [ARM_LINUX](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_arm_linux.md)
### How to contribute your documents
- [tutorial link to contribute documents](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/CONTRIBUTING.md)
- Main procedure of contributing code is covered in the document above.If you have other problems during the procedure,please send them as [issue](https://github.com/PaddlePaddle/paddle-mobile/issues). We will deal with it as quickly as possible.
## 文档 ## 文档
### 设计文档 ### 设计文档
...@@ -46,6 +78,24 @@ ...@@ -46,6 +78,24 @@
- [贡献文档链接](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/CONTRIBUTING.md) - [贡献文档链接](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/CONTRIBUTING.md)
- 上面文档中涵盖了主要的贡献代码流程,如果在实践中您还遇到了其他问题,可以发[issue](https://github.com/PaddlePaddle/paddle-mobile/issues)。我们看到后会尽快处理。 - 上面文档中涵盖了主要的贡献代码流程,如果在实践中您还遇到了其他问题,可以发[issue](https://github.com/PaddlePaddle/paddle-mobile/issues)。我们看到后会尽快处理。
## Acquision of Models
At present Paddle-Mobile only supports Paddle fluid training model. Models wiil be operated regularly after transformation if you have various models.
### 1. Use Paddle Fluid directly to train
It is the most reliable method to be recommanded
### 2. Transform Caffe to Paddle Fluid model
[https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/caffe2fluid](https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/caffe2fluid)
### 3. ONNX
ONNX is expanded as Open Neural Network Exchange. The project is aimed to make a full communication and usage among diffrent nerual network development frameworks.
Except for directly using fluid models trained by PaddlePaddle,you can also get certain Paddle fluid models through onnx transformation.
At present,work in support of onnx is also under operation in Baidu. Related tranformation project can be referred to here:
[https://github.com/PaddlePaddle/paddle-onnx](https://github.com/PaddlePaddle/paddle-onnx)
### 4. Download parts of testing models and testing pictures
[http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip](http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip)
## 模型获得 ## 模型获得
目前Paddle-Mobile仅支持Paddle fluid训练的模型。如果你手中的模型是不同种类的模型,需要进行模型转换才可以运行。 目前Paddle-Mobile仅支持Paddle fluid训练的模型。如果你手中的模型是不同种类的模型,需要进行模型转换才可以运行。
...@@ -64,6 +114,22 @@ ONNX全称为“Open Neural Network Exchange”,即“开放的神经网络切 ...@@ -64,6 +114,22 @@ ONNX全称为“Open Neural Network Exchange”,即“开放的神经网络切
### 4. 部分测试模型和测试图片下载 ### 4. 部分测试模型和测试图片下载
[http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip](http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip) [http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip](http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip)
<!--## Online output of simple search
Gif as following is the application output of online main part detection of simple search app
![ezgif-1-050a733dfb](http://otkwwi4x8.bkt.clouddn.com/2018-07-05-ezgif-1-050a733dfb.gif)-->
## Ask Question
Welcome to put forward or tackle with our problems,You can post your question in our issue modular on github. [Github Issues](https://github.com/PaddlePaddle/paddle-mobile/issues).
## Copyright and License
Paddle-Mobile provide relatively unstricted Apache-2.0 Open source agreement [Apache-2.0 license](LICENSE).
## Old version Mobile-Deep-Learning
Original MDL(Mobile-Deep-Learning) project has been transferred to [Mobile-Deep-Learning](https://github.com/allonli/mobile-deep-learning)
<!--## 简单搜索线上效果 <!--## 简单搜索线上效果
如下gif是简单搜索app的线上主体检测应用效果 如下gif是简单搜索app的线上主体检测应用效果
......
...@@ -16,9 +16,9 @@ limitations under the License. */ ...@@ -16,9 +16,9 @@ limitations under the License. */
#ifdef ENABLE_EXCEPTION #ifdef ENABLE_EXCEPTION
#include <stdio.h> #include <stdio.h>
#include <stdlib.h>
#include <exception> #include <exception>
#include <string> #include <string>
#endif #endif
namespace paddle_mobile { namespace paddle_mobile {
......
...@@ -22,9 +22,10 @@ const char *G_OP_TYPE_BATCHNORM = "batch_norm"; ...@@ -22,9 +22,10 @@ const char *G_OP_TYPE_BATCHNORM = "batch_norm";
const char *G_OP_TYPE_BOX_CODER = "box_coder"; const char *G_OP_TYPE_BOX_CODER = "box_coder";
const char *G_OP_TYPE_CONCAT = "concat"; const char *G_OP_TYPE_CONCAT = "concat";
const char *G_OP_TYPE_ELEMENTWISE_ADD = "elementwise_add"; const char *G_OP_TYPE_ELEMENTWISE_ADD = "elementwise_add";
const char *G_OP_TYPE_ELEMENTWISE_SUB = "elementwise_sub";
const char *G_OP_TYPE_ELEMENTWISE_MUL = "elementwise_mul";
const char *G_OP_TYPE_FILL_CONSTANT = "fill_constant"; const char *G_OP_TYPE_FILL_CONSTANT = "fill_constant";
const char *G_OP_TYPE_FUSION_CONV_ADD_RELU = "fusion_conv_add_relu"; const char *G_OP_TYPE_FUSION_CONV_ADD_RELU = "fusion_conv_add_relu";
const char *G_OP_TYPE_FUSION_CONV_ADD_RELU_INT8 = "fusion_conv_add_relu_int8";
const char *G_OP_TYPE_FUSION_CONV_ADD_PRELU = "fusion_conv_add_prelu"; const char *G_OP_TYPE_FUSION_CONV_ADD_PRELU = "fusion_conv_add_prelu";
const char *G_OP_TYPE_FUSION_CONV_ADD_ADD_PRELU = "fusion_conv_add_add_prelu"; const char *G_OP_TYPE_FUSION_CONV_ADD_ADD_PRELU = "fusion_conv_add_add_prelu";
const char *G_OP_TYPE_FUSION_CONV_ADD_BN_RELU = "fusion_conv_add_bn_relu"; const char *G_OP_TYPE_FUSION_CONV_ADD_BN_RELU = "fusion_conv_add_bn_relu";
...@@ -32,7 +33,6 @@ const char *G_OP_TYPE_FUSION_CONV_BN_ADD_RELU = "fusion_conv_bn_add_relu"; ...@@ -32,7 +33,6 @@ const char *G_OP_TYPE_FUSION_CONV_BN_ADD_RELU = "fusion_conv_bn_add_relu";
const char *G_OP_TYPE_FUSION_DWCONV_BN_RELU = "fusion_dwconv_bn_relu"; const char *G_OP_TYPE_FUSION_DWCONV_BN_RELU = "fusion_dwconv_bn_relu";
const char *G_OP_TYPE_FUSION_CONV_BN_RELU = "fusion_conv_bn_relu"; const char *G_OP_TYPE_FUSION_CONV_BN_RELU = "fusion_conv_bn_relu";
const char *G_OP_TYPE_FC = "fusion_fc"; const char *G_OP_TYPE_FC = "fusion_fc";
const char *G_OP_TYPE_FC_INT8 = "fusion_fc_int8";
const char *G_OP_TYPE_FUSION_CONV_ADD = "fusion_conv_add"; const char *G_OP_TYPE_FUSION_CONV_ADD = "fusion_conv_add";
const char *G_OP_TYPE_LRN = "lrn"; const char *G_OP_TYPE_LRN = "lrn";
const char *G_OP_TYPE_MUL = "mul"; const char *G_OP_TYPE_MUL = "mul";
...@@ -41,6 +41,7 @@ const char *G_OP_TYPE_POLYGON_BOX_TRANSFORM = "polygon_box_transform"; ...@@ -41,6 +41,7 @@ const char *G_OP_TYPE_POLYGON_BOX_TRANSFORM = "polygon_box_transform";
const char *G_OP_TYPE_POOL2D = "pool2d"; const char *G_OP_TYPE_POOL2D = "pool2d";
const char *G_OP_TYPE_PRIOR_BOX = "prior_box"; const char *G_OP_TYPE_PRIOR_BOX = "prior_box";
const char *G_OP_TYPE_RELU = "relu"; const char *G_OP_TYPE_RELU = "relu";
const char *G_OP_TYPE_RELU6 = "relu6";
const char *G_OP_TYPE_RESHAPE = "reshape"; const char *G_OP_TYPE_RESHAPE = "reshape";
const char *G_OP_TYPE_RESHAPE2 = "reshape2"; const char *G_OP_TYPE_RESHAPE2 = "reshape2";
const char *G_OP_TYPE_SIGMOID = "sigmoid"; const char *G_OP_TYPE_SIGMOID = "sigmoid";
...@@ -68,14 +69,20 @@ const char *G_OP_TYPE_CRF = "crf_decoding"; ...@@ -68,14 +69,20 @@ const char *G_OP_TYPE_CRF = "crf_decoding";
const char *G_OP_TYPE_BILINEAR_INTERP = "bilinear_interp"; const char *G_OP_TYPE_BILINEAR_INTERP = "bilinear_interp";
const char *G_OP_TYPE_FLATTEN = "flatten"; const char *G_OP_TYPE_FLATTEN = "flatten";
const char *G_OP_TYPE_SHAPE = "shape"; const char *G_OP_TYPE_SHAPE = "shape";
const char *G_OP_TYPE_ELEMENTWISE_MUL = "elementwise_mul";
const char *G_OP_TYPE_SUM = "sum"; const char *G_OP_TYPE_SUM = "sum";
const char *G_OP_TYPE_TOP_K = "top_k";
const char *G_OP_TYPE_CAST = "cast";
const char *G_OP_TYPE_QUANTIZE = "quantize"; const char *G_OP_TYPE_QUANTIZE = "quantize";
const char *G_OP_TYPE_DEQUANTIZE = "dequantize"; const char *G_OP_TYPE_DEQUANTIZE = "dequantize";
const char *G_OP_TYPE_FUSION_DEQUANT_BN = "fusion_dequant_bn";
const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN = "fusion_dequant_add_bn"; const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN = "fusion_dequant_add_bn";
const char *G_OP_TYPE_FUSION_DEQUANT_BN_RELU = "fusion_dequant_bn_relu"; const char *G_OP_TYPE_FUSION_DEQUANT_BN_RELU = "fusion_dequant_bn_relu";
const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU = "fusion_dequant_add_bn_relu"; const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU = "fusion_dequant_add_bn_relu";
const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_QUANT =
"fusion_dequant_add_bn_quant";
const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU_QUANT =
"fusion_dequant_add_bn_relu_quant";
const char *G_OP_TYPE_TANH = "tanh"; const char *G_OP_TYPE_TANH = "tanh";
const char *G_OP_TYPE_FUSION_DECONV_RELU = "fusion_deconv_relu"; const char *G_OP_TYPE_FUSION_DECONV_RELU = "fusion_deconv_relu";
...@@ -91,10 +98,13 @@ std::unordered_map< ...@@ -91,10 +98,13 @@ std::unordered_map<
{G_OP_TYPE_PRELU, {{"X", "Alpha"}, {"Out"}}}, {G_OP_TYPE_PRELU, {{"X", "Alpha"}, {"Out"}}},
{G_OP_TYPE_FUSION_CONV_ADD, {{"Input"}, {"Out"}}}, {G_OP_TYPE_FUSION_CONV_ADD, {{"Input"}, {"Out"}}},
{G_OP_TYPE_RELU, {{"X"}, {"Out"}}}, {G_OP_TYPE_RELU, {{"X"}, {"Out"}}},
{G_OP_TYPE_RELU6, {{"X"}, {"Out"}}},
{G_OP_TYPE_SOFTMAX, {{"X"}, {"Out"}}}, {G_OP_TYPE_SOFTMAX, {{"X"}, {"Out"}}},
{G_OP_TYPE_SIGMOID, {{"X"}, {"Out"}}}, {G_OP_TYPE_SIGMOID, {{"X"}, {"Out"}}},
{G_OP_TYPE_MUL, {{"X"}, {"Out"}}}, {G_OP_TYPE_MUL, {{"X"}, {"Out"}}},
{G_OP_TYPE_ELEMENTWISE_ADD, {{"X", "Y"}, {"Out"}}}, {G_OP_TYPE_ELEMENTWISE_ADD, {{"X", "Y"}, {"Out"}}},
{G_OP_TYPE_ELEMENTWISE_SUB, {{"X", "Y"}, {"Out"}}},
{G_OP_TYPE_ELEMENTWISE_MUL, {{"X", "Y"}, {"Out"}}},
{G_OP_TYPE_POOL2D, {{"X"}, {"Out"}}}, {G_OP_TYPE_POOL2D, {{"X"}, {"Out"}}},
{G_OP_TYPE_BATCHNORM, {{"X"}, {"Y"}}}, {G_OP_TYPE_BATCHNORM, {{"X"}, {"Y"}}},
{G_OP_TYPE_LRN, {{"X"}, {"Out"}}}, {G_OP_TYPE_LRN, {{"X"}, {"Out"}}},
...@@ -112,13 +122,11 @@ std::unordered_map< ...@@ -112,13 +122,11 @@ std::unordered_map<
{G_OP_TYPE_MULTICLASS_NMS, {{"BBoxes", "Scores"}, {"Out"}}}, {G_OP_TYPE_MULTICLASS_NMS, {{"BBoxes", "Scores"}, {"Out"}}},
{G_OP_TYPE_POLYGON_BOX_TRANSFORM, {{"Input"}, {"Output"}}}, {G_OP_TYPE_POLYGON_BOX_TRANSFORM, {{"Input"}, {"Output"}}},
{G_OP_TYPE_FC, {{"X", "Y", "Z"}, {"Out"}}}, {G_OP_TYPE_FC, {{"X", "Y", "Z"}, {"Out"}}},
{G_OP_TYPE_FC_INT8, {{"X", "Y", "Z", "Scale"}, {"Out"}}},
{G_OP_TYPE_RESHAPE, {{"X"}, {"Out"}}}, {G_OP_TYPE_RESHAPE, {{"X"}, {"Out"}}},
{G_OP_TYPE_RESHAPE2, {{"X"}, {"Out", "XShape"}}}, {G_OP_TYPE_RESHAPE2, {{"X"}, {"Out", "XShape"}}},
{G_OP_TYPE_DEPTHWISE_CONV, {{"Input"}, {"Output"}}}, {G_OP_TYPE_DEPTHWISE_CONV, {{"Input"}, {"Output"}}},
{G_OP_TYPE_FILL_CONSTANT, {{}, {"Out"}}}, {G_OP_TYPE_FILL_CONSTANT, {{}, {"Out"}}},
{G_OP_TYPE_FUSION_CONV_ADD_RELU, {{"Input"}, {"Out"}}}, {G_OP_TYPE_FUSION_CONV_ADD_RELU, {{"Input"}, {"Out"}}},
{G_OP_TYPE_FUSION_CONV_ADD_RELU_INT8, {{"Input", "Scale"}, {"Out"}}},
{G_OP_TYPE_FUSION_CONV_ADD_PRELU, {{"Input"}, {"Out"}}}, {G_OP_TYPE_FUSION_CONV_ADD_PRELU, {{"Input"}, {"Out"}}},
{G_OP_TYPE_FUSION_CONV_ADD_ADD_PRELU, {{"Input"}, {"Out"}}}, {G_OP_TYPE_FUSION_CONV_ADD_ADD_PRELU, {{"Input"}, {"Out"}}},
{G_OP_TYPE_IM2SEQUENCE, {{"X"}, {"Out"}}}, {G_OP_TYPE_IM2SEQUENCE, {{"X"}, {"Out"}}},
...@@ -139,12 +147,18 @@ std::unordered_map< ...@@ -139,12 +147,18 @@ std::unordered_map<
{G_OP_TYPE_SHAPE, {{"Input"}, {"Out"}}}, {G_OP_TYPE_SHAPE, {{"Input"}, {"Out"}}},
{G_OP_TYPE_CONV_TRANSPOSE, {{"Input"}, {"Output"}}}, {G_OP_TYPE_CONV_TRANSPOSE, {{"Input"}, {"Output"}}},
{G_OP_TYPE_SUM, {{"X"}, {"Out"}}}, {G_OP_TYPE_SUM, {{"X"}, {"Out"}}},
{G_OP_TYPE_ELEMENTWISE_MUL, {{"X", "Y"}, {"Out"}}}, {G_OP_TYPE_TOP_K, {{"X"}, {"Out", "Indices"}}},
{G_OP_TYPE_CAST, {{"X"}, {"Out"}}},
{G_OP_TYPE_QUANTIZE, {{"X"}, {"Out", "OutScale"}}}, {G_OP_TYPE_QUANTIZE, {{"X"}, {"Out", "OutScale"}}},
{G_OP_TYPE_DEQUANTIZE, {{"X", "Scale"}, {"Out"}}}, {G_OP_TYPE_DEQUANTIZE, {{"X", "Scale"}, {"Out"}}},
{G_OP_TYPE_FUSION_DEQUANT_ADD_BN, {{"X", "Scale"}, {"Y"}}}, {G_OP_TYPE_FUSION_DEQUANT_BN, {{"X", "Scale"}, {"Out"}}},
{G_OP_TYPE_FUSION_DEQUANT_ADD_BN, {{"X", "Scale"}, {"Out"}}},
{G_OP_TYPE_FUSION_DEQUANT_BN_RELU, {{"X", "Scale"}, {"Out"}}}, {G_OP_TYPE_FUSION_DEQUANT_BN_RELU, {{"X", "Scale"}, {"Out"}}},
{G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU, {{"X", "Scale"}, {"Out"}}}, {G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU, {{"X", "Scale"}, {"Out"}}},
{G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU_QUANT,
{{"X", "Scale"}, {"Out", "OutScale"}}},
{G_OP_TYPE_FUSION_DEQUANT_ADD_BN_QUANT,
{{"X", "Scale"}, {"Out", "OutScale"}}},
{G_OP_TYPE_TANH, {{"X"}, {"Out"}}}, {G_OP_TYPE_TANH, {{"X"}, {"Out"}}},
{G_OP_TYPE_FUSION_DECONV_RELU, {{"Input"}, {"Out"}}}, {G_OP_TYPE_FUSION_DECONV_RELU, {{"Input"}, {"Out"}}},
{G_OP_TYPE_FUSION_DECONV_ADD, {{"Input"}, {"Out"}}}, {G_OP_TYPE_FUSION_DECONV_ADD, {{"Input"}, {"Out"}}},
......
...@@ -87,10 +87,24 @@ enum PMStatus { ...@@ -87,10 +87,24 @@ enum PMStatus {
}; };
enum RoundType { enum RoundType {
ROUND_UNK = 0, ROUND_NEAREST_AWAY_ZERO = 0,
ROUND_NEAREST_AWAY_ZERO = 1, ROUND_NEAREST_TOWARDS_ZERO = 1,
ROUND_NEAREST_TOWARDS_ZERO = 2, ROUND_NEAREST_TO_EVEN = 2,
ROUND_NEAREST_TO_EVEN = 3 };
enum ActivationType {
IDENTITY = 0,
RELU = 1,
RELU6 = 2,
PRELU = 3,
LEAKY_RELU = 4,
TANH = 5,
SIGMOID = 6,
};
enum PoolingType {
MAX = 0,
AVG = 1,
}; };
extern const char *G_OP_TYPE_CONV; extern const char *G_OP_TYPE_CONV;
...@@ -98,12 +112,12 @@ extern const char *G_OP_TYPE_BATCHNORM; ...@@ -98,12 +112,12 @@ extern const char *G_OP_TYPE_BATCHNORM;
extern const char *G_OP_TYPE_BOX_CODER; extern const char *G_OP_TYPE_BOX_CODER;
extern const char *G_OP_TYPE_CONCAT; extern const char *G_OP_TYPE_CONCAT;
extern const char *G_OP_TYPE_ELEMENTWISE_ADD; extern const char *G_OP_TYPE_ELEMENTWISE_ADD;
extern const char *G_OP_TYPE_ELEMENTWISE_SUB;
extern const char *G_OP_TYPE_ELEMENTWISE_MUL;
extern const char *G_OP_TYPE_FUSION_CONV_ADD_RELU; extern const char *G_OP_TYPE_FUSION_CONV_ADD_RELU;
extern const char *G_OP_TYPE_FUSION_CONV_ADD_RELU_INT8;
extern const char *G_OP_TYPE_FUSION_CONV_ADD_PRELU; extern const char *G_OP_TYPE_FUSION_CONV_ADD_PRELU;
extern const char *G_OP_TYPE_FUSION_CONV_ADD_ADD_PRELU; extern const char *G_OP_TYPE_FUSION_CONV_ADD_ADD_PRELU;
extern const char *G_OP_TYPE_FC; extern const char *G_OP_TYPE_FC;
extern const char *G_OP_TYPE_FC_INT8;
extern const char *G_OP_TYPE_FUSION_CONV_ADD; extern const char *G_OP_TYPE_FUSION_CONV_ADD;
extern const char *G_OP_TYPE_FUSION_CONV_ADD_BN_RELU; extern const char *G_OP_TYPE_FUSION_CONV_ADD_BN_RELU;
extern const char *G_OP_TYPE_FUSION_CONV_BN_ADD_RELU; extern const char *G_OP_TYPE_FUSION_CONV_BN_ADD_RELU;
...@@ -116,6 +130,7 @@ extern const char *G_OP_TYPE_MULTICLASS_NMS; ...@@ -116,6 +130,7 @@ extern const char *G_OP_TYPE_MULTICLASS_NMS;
extern const char *G_OP_TYPE_POOL2D; extern const char *G_OP_TYPE_POOL2D;
extern const char *G_OP_TYPE_PRIOR_BOX; extern const char *G_OP_TYPE_PRIOR_BOX;
extern const char *G_OP_TYPE_RELU; extern const char *G_OP_TYPE_RELU;
extern const char *G_OP_TYPE_RELU6;
extern const char *G_OP_TYPE_RESHAPE; extern const char *G_OP_TYPE_RESHAPE;
extern const char *G_OP_TYPE_SIGMOID; extern const char *G_OP_TYPE_SIGMOID;
extern const char *G_OP_TYPE_SOFTMAX; extern const char *G_OP_TYPE_SOFTMAX;
...@@ -136,13 +151,17 @@ extern const char *G_OP_TYPE_FUSION_CONV_BN; ...@@ -136,13 +151,17 @@ extern const char *G_OP_TYPE_FUSION_CONV_BN;
extern const char *G_OP_TYPE_CONV_TRANSPOSE; extern const char *G_OP_TYPE_CONV_TRANSPOSE;
extern const char *G_OP_TYPE_PRELU; extern const char *G_OP_TYPE_PRELU;
extern const char *G_OP_TYPE_SUM; extern const char *G_OP_TYPE_SUM;
extern const char *G_OP_TYPE_ELEMENTWISE_MUL; extern const char *G_OP_TYPE_TOP_K;
extern const char *G_OP_TYPE_CAST;
extern const char *G_OP_TYPE_QUANTIZE; extern const char *G_OP_TYPE_QUANTIZE;
extern const char *G_OP_TYPE_DEQUANTIZE; extern const char *G_OP_TYPE_DEQUANTIZE;
extern const char *G_OP_TYPE_FUSION_DEQUANT_BN;
extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN; extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN;
extern const char *G_OP_TYPE_FUSION_DEQUANT_BN_RELU; extern const char *G_OP_TYPE_FUSION_DEQUANT_BN_RELU;
extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU; extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU;
extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_QUANT;
extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU_QUANT;
extern const char *G_OP_TYPE_TANH; extern const char *G_OP_TYPE_TANH;
extern const char *G_OP_TYPE_FUSION_DECONV_RELU; extern const char *G_OP_TYPE_FUSION_DECONV_RELU;
......
...@@ -24,8 +24,6 @@ namespace fpga { ...@@ -24,8 +24,6 @@ namespace fpga {
#define USE_RELU 1 #define USE_RELU 1
#define USE_BIAS 2 #define USE_BIAS 2
int get_align_image_cw(int cw) { return align_to_x(cw, IMAGE_ALIGNMENT); }
void format_image(framework::Tensor *image_tensor) { void format_image(framework::Tensor *image_tensor) {
auto dims = image_tensor->dims(); auto dims = image_tensor->dims();
auto channel = dims[1], height = dims[2], width = dims[3]; auto channel = dims[1], height = dims[2], width = dims[3];
...@@ -83,6 +81,13 @@ int get_plit_num(framework::Tensor *filter_tensor) { ...@@ -83,6 +81,13 @@ int get_plit_num(framework::Tensor *filter_tensor) {
int div_capacity = filter::calc_division_capacity(chw); int div_capacity = filter::calc_division_capacity(chw);
return filter::calc_split_num(num, div_capacity); return filter::calc_split_num(num, div_capacity);
} }
int get_deconv_plit_num(framework::Tensor *filter_tensor, int stride) {
auto dims = filter_tensor->dims();
auto chw = dims[1] * dims[2] / stride * dims[3] / stride;
auto num = dims[0] * stride;
int div_capacity = filter::calc_division_capacity(chw);
return filter::calc_split_num(num, div_capacity);
}
int get_filter_num_per_div(framework::Tensor *filter_tensor, int group_num) { int get_filter_num_per_div(framework::Tensor *filter_tensor, int group_num) {
auto dims = filter_tensor->dims(); auto dims = filter_tensor->dims();
...@@ -92,12 +97,17 @@ int get_filter_num_per_div(framework::Tensor *filter_tensor, int group_num) { ...@@ -92,12 +97,17 @@ int get_filter_num_per_div(framework::Tensor *filter_tensor, int group_num) {
return filter::calc_num_per_div(num, group_num, div_capacity); return filter::calc_num_per_div(num, group_num, div_capacity);
} }
int get_aligned_filter_element_num(int chw) { int get_deconv_filter_num_per_div(framework::Tensor *filter_tensor,
return align_to_x(chw, FILTER_ELEMENT_ALIGNMENT); int group_num, int stride) {
auto dims = filter_tensor->dims();
auto chw = dims[1] * dims[2] / stride * dims[3] / stride;
auto num = dims[0] * stride;
int div_capacity = filter::calc_division_capacity(chw);
return filter::calc_num_per_div(num, group_num, div_capacity);
} }
int get_aligned_filter_num(int num) { int get_aligned_filter_element_num(int chw) {
return align_to_x(num, FILTER_NUM_ALIGNMENT); return align_to_x(chw, FILTER_ELEMENT_ALIGNMENT);
} }
void format_filter(framework::Tensor *filter_tensor, float max_value, void format_filter(framework::Tensor *filter_tensor, float max_value,
...@@ -177,46 +187,37 @@ void format_concat_output(framework::Tensor *out, int height, int width, ...@@ -177,46 +187,37 @@ void format_concat_output(framework::Tensor *out, int height, int width,
void expand_conv_arg(ConvArgs *arg) { void expand_conv_arg(ConvArgs *arg) {
ConvArgs args = *arg; ConvArgs args = *arg;
uint64_t filterlen = (uint64_t)args.kernel.width *
(uint64_t)args.kernel.height * auto fpga_bias_scale_len =
(uint64_t)args.image.channels;
filterlen = align_to_x(filterlen, FILTER_ELEMENT_ALIGNMENT);
filterlen *= align_to_x((uint64_t)args.filter_num, FILTER_NUM_ALIGNMENT);
uint64_t fpga_bias_scale_len =
align_to_x(args.filter_num / args.group_num, 8) * args.group_num; align_to_x(args.filter_num / args.group_num, 8) * args.group_num;
uint64_t output_height = auto output_height =
(args.image.height + args.image.pad_height * 2 - args.kernel.height) / (args.image.height + args.image.pad_height * 2 - args.kernel.height) /
args.kernel.stride_h + args.kernel.stride_h +
1; 1;
uint64_t output_width = auto output_width =
(args.image.width + args.image.pad_width * 2 - args.kernel.width) / (args.image.width + args.image.pad_width * 2 - args.kernel.width) /
args.kernel.stride_w + args.kernel.stride_w +
1; 1;
uint64_t output_size =
output_height * output_width * (uint64_t)args.filter_num; auto filter_per_group = args.filter_num / args.group_num;
auto channel_per_group = args.image.channels / args.group_num;
auto filter_per_group = (uint64_t)(args.filter_num / args.group_num);
auto channel_per_group = (uint64_t)(args.image.channels / args.group_num); auto image_row_count = args.image.width * args.image.channels;
auto image_amount_per_row = align_to_x(image_row_count, IMAGE_ALIGNMENT);
uint64_t image_row_count = ((uint64_t)args.image.width) * auto image_one_pad_per_row = align_to_x(image_row_count, IMAGE_ALIGNMENT) +
((uint64_t)args.image.channels); // without align args.image.pad_width * args.image.channels;
uint64_t image_amount_per_row = align_to_x(image_row_count, IMAGE_ALIGNMENT); auto filter_amount_all =
uint64_t image_one_pad_per_row = align_to_x(args.kernel.height * args.kernel.width * channel_per_group,
align_to_x(image_row_count, IMAGE_ALIGNMENT) +
((uint64_t)args.image.pad_width) * ((uint64_t)args.image.channels);
uint64_t filter_amount_all =
align_to_x(((uint64_t)args.kernel.height) *
((uint64_t)args.kernel.width) * channel_per_group,
FILTER_ELEMENT_ALIGNMENT); FILTER_ELEMENT_ALIGNMENT);
uint64_t output_amount_per_row = auto output_amount_per_row =
align_to_x(output_width * ((uint64_t)args.filter_num), IMAGE_ALIGNMENT); align_to_x(output_width * args.filter_num, IMAGE_ALIGNMENT);
// find the opt partition strategy // find the opt partition strategy
uint64_t res_win; uint64_t res_win;
uint64_t res_fit = 0; uint64_t res_fit = 0;
for (res_win = 1; res_win <= output_width; res_win = res_win + 1) { for (res_win = 1; res_win <= output_width; res_win++) {
if ((align_to_x( if ((align_to_x(
(args.image.channels * (args.image.channels *
(args.kernel.width + (res_win - 1) * args.kernel.stride_w)), (args.kernel.width + (res_win - 1) * args.kernel.stride_w)),
...@@ -238,48 +239,48 @@ void expand_conv_arg(ConvArgs *arg) { ...@@ -238,48 +239,48 @@ void expand_conv_arg(ConvArgs *arg) {
} }
res_fit = res_win; res_fit = res_win;
uint64_t block_num = (output_width + res_fit - 1) / res_fit; auto block_num = (output_width + res_fit - 1) / res_fit;
uint64_t block_len = res_fit; auto block_len = res_fit;
uint64_t block_last = output_width - res_fit * (block_num - 1); auto block_last = output_width - res_fit * (block_num - 1);
uint64_t res_amount_per_row = output_width * args.filter_num; auto res_amount_per_row = output_width * args.filter_num;
uint64_t res_amount_per_row_pad = output_amount_per_row - res_amount_per_row; auto res_amount_per_row_pad = output_amount_per_row - res_amount_per_row;
uint64_t image_block_amount_per_row = auto image_block_amount_per_row =
args.kernel.stride_w * (res_fit)*args.image.channels; args.kernel.stride_w * res_fit * args.image.channels;
uint64_t filter_pad_width_mul_channel = auto filter_pad_width_mul_channel =
args.image.pad_width * args.image.channels; args.image.pad_width * args.image.channels;
uint64_t image_amount_per_row_multi_win_first = auto image_amount_per_row_multi_win_first =
image_amount_per_row * (4 * args.kernel.stride_h - args.image.pad_height); image_amount_per_row * (4 * args.kernel.stride_h - args.image.pad_height);
uint64_t image_amount_per_row_multi_win = auto image_amount_per_row_multi_win =
image_amount_per_row * (4 * args.kernel.stride_h); image_amount_per_row * (4 * args.kernel.stride_h);
uint64_t image_block_num = block_num; auto image_block_num = block_num;
uint64_t image_block_len = auto image_block_len =
align_to_x((args.image.channels * align_to_x((args.image.channels *
(args.kernel.width + (block_len - 1) * args.kernel.stride_w)), (args.kernel.width + (block_len - 1) * args.kernel.stride_w)),
IMAGE_ALIGNMENT) / IMAGE_ALIGNMENT) /
16 + 16 +
1; 1;
uint64_t image_block_len_last = auto image_block_len_last =
align_to_x( align_to_x(
(args.image.channels * (args.image.channels *
(args.kernel.width + (block_last - 1) * args.kernel.stride_w)), (args.kernel.width + (block_last - 1) * args.kernel.stride_w)),
IMAGE_ALIGNMENT) / IMAGE_ALIGNMENT) /
16 + 16 +
1; 1;
uint64_t image_win_cnt = block_len; auto image_win_cnt = block_len;
uint64_t image_win_cnt_last = block_last; auto image_win_cnt_last = block_last;
uint64_t res_row_data_align4_pad = res_amount_per_row_pad / 8; auto res_row_data_align4_pad = res_amount_per_row_pad / 8;
uint64_t prog_full_cnt = 2048 / (filter_amount_all / 16 * 2) - 1; auto prog_full_cnt = 2048 / (filter_amount_all / 16 * 2) - 1;
if (prog_full_cnt == 1023) { if (prog_full_cnt == 1023) {
prog_full_cnt--; prog_full_cnt--;
} }
uint64_t post_prog_full_cnt = auto post_prog_full_cnt =
(512 / (align_to_x(args.filter_num, 4) / 4 * 2) > 2) (512 / (align_to_x(args.filter_num, 4) / 4 * 2) > 2)
? (512 / (align_to_x(args.filter_num, 4) / 4 * 2) - 2) ? (512 / (align_to_x(args.filter_num, 4) / 4 * 2) - 2)
: 0; : 0;
uint64_t cmd = 0UL | (args.relu_enabled ? USE_RELU : 0) | USE_BIAS; auto cmd = 0UL | (args.relu_enabled ? USE_RELU : 0) | USE_BIAS;
(*arg).driver.image_address_phy = vaddr_to_paddr(args.image.address); (*arg).driver.image_address_phy = vaddr_to_paddr(args.image.address);
(*arg).driver.sb_address_phy = vaddr_to_paddr(args.sb_address); (*arg).driver.sb_address_phy = vaddr_to_paddr(args.sb_address);
...@@ -449,7 +450,6 @@ void fill_deconv_arg(struct DeconvArgs *arg, framework::Tensor *input, ...@@ -449,7 +450,6 @@ void fill_deconv_arg(struct DeconvArgs *arg, framework::Tensor *input,
arg->sub_conv_num = (uint32_t)stride_h; arg->sub_conv_num = (uint32_t)stride_h;
arg->filter_num = (uint32_t)filter->dims()[0]; arg->filter_num = (uint32_t)filter->dims()[0];
int sub_conv_num = arg->sub_conv_num; int sub_conv_num = arg->sub_conv_num;
int sub_stride = 1;
int sub_pad = deconv_filter::deconv_calc_sub_pad((int)filter->dims()[3], int sub_pad = deconv_filter::deconv_calc_sub_pad((int)filter->dims()[3],
padding_w, stride_w); padding_w, stride_w);
int sub_filter_width = deconv_filter::deconv_get_sub_filter_axis( int sub_filter_width = deconv_filter::deconv_get_sub_filter_axis(
...@@ -464,7 +464,9 @@ void fill_deconv_arg(struct DeconvArgs *arg, framework::Tensor *input, ...@@ -464,7 +464,9 @@ void fill_deconv_arg(struct DeconvArgs *arg, framework::Tensor *input,
arg->sub_output_height = (uint32_t)sub_output_height; arg->sub_output_height = (uint32_t)sub_output_height;
arg->omit_size = (uint32_t)deconv_filter::deconv_get_omit( arg->omit_size = (uint32_t)deconv_filter::deconv_get_omit(
stride_w, (int)filter->dims()[3], padding_w); stride_w, (int)filter->dims()[3], padding_w);
arg->conv_args = (ConvArgs *)fpga_malloc(sub_conv_num * sizeof(ConvArgs));
arg->output.address = out_ptr;
arg->output.scale_address = out->scale;
int sub_channels = (int)input->dims()[1]; int sub_channels = (int)input->dims()[1];
int omit_size = arg->omit_size; int omit_size = arg->omit_size;
...@@ -484,50 +486,160 @@ void fill_deconv_arg(struct DeconvArgs *arg, framework::Tensor *input, ...@@ -484,50 +486,160 @@ void fill_deconv_arg(struct DeconvArgs *arg, framework::Tensor *input,
int align_conv_sub_filter_count = int align_conv_sub_filter_count =
align_sub_filter_count * align_sub_filter_num; align_sub_filter_count * align_sub_filter_num;
int split_num =
group_num == 1 ? (uint32_t)get_deconv_plit_num(filter, sub_conv_num) : 1;
arg->split_conv_args =
(SplitConvArgs *)fpga_malloc(sub_conv_num * sizeof(SplitConvArgs));
for (int i = 0; i < sub_conv_num; ++i) { for (int i = 0; i < sub_conv_num; ++i) {
arg->conv_args[i].filter_num = (arg->sub_conv_num) * (arg->filter_num); arg->split_conv_args[i].filter_num =
arg->conv_args[i].group_num = (uint32_t)group_num; (arg->sub_conv_num) * (arg->filter_num);
arg->split_conv_args[i].group_num = (uint32_t)group_num;
arg->conv_args[i].filter_scale_address = filter->scale; arg->split_conv_args[i].split_num = split_num;
arg->conv_args[i].relu_enabled = relu_enabled; arg->split_conv_args[i].conv_arg =
(ConvArgs *)fpga_malloc(split_num * sizeof(ConvArgs));
arg->conv_args[i].kernel.width = (uint32_t)sub_filter_width;
arg->conv_args[i].kernel.height = (uint32_t)sub_filter_width; arg->split_conv_args[i].concat_arg.height = sub_output_height;
arg->conv_args[i].kernel.stride_w = 1; arg->split_conv_args[i].concat_arg.width = sub_output_width;
arg->conv_args[i].kernel.stride_h = 1; arg->split_conv_args[i].concat_arg.image_num = split_num;
arg->split_conv_args[i].concat_arg.images_in =
// DeconvParam.conv_args[i].image.address = (void*)ptr_image; (half **)fpga_malloc(split_num * sizeof(half *));
arg->conv_args[i].image.scale_address = input->scale; arg->split_conv_args[i].concat_arg.scales_in =
arg->conv_args[i].image.channels = (uint32_t)sub_channels; (float **)fpga_malloc(split_num * sizeof(float *));
arg->conv_args[i].image.width = (uint32_t)input->dims()[3]; arg->split_conv_args[i].concat_arg.channel_num =
arg->conv_args[i].image.height = (uint32_t)input->dims()[2]; (uint32_t *)fpga_malloc(split_num * sizeof(uint32_t));
arg->conv_args[i].image.pad_width = (uint32_t)sub_pad; // arg->split_conv_args[i].concat_arg.image_out =
arg->conv_args[i].image.pad_height = (uint32_t)sub_pad; // fpga_malloc(conv_output_size * sizeof(half));
arg->conv_args[i].image.address = input_ptr; // arg->split_conv_args[i].concat_arg.scale_out = fpga_malloc(2 *
arg->conv_args[i].sb_address = (void *)bs_ptr; // sizeof(float));
}
auto filter_sub_space =
(char *)fpga_malloc(align_conv_sub_filter_count * sizeof(char));
fpga_copy(filter_sub_space,
(char *)filter_ptr + i * align_conv_sub_filter_count,
(size_t)align_conv_sub_filter_count);
arg->conv_args[i].filter_address = (void *)(filter_sub_space);
fpga_flush(filter_sub_space, (size_t)align_conv_sub_filter_count);
int filter_num_per_div =
get_deconv_filter_num_per_div(filter, group_num, stride_w);
int element_num = get_aligned_filter_element_num(
(int)(sub_channels * sub_filter_width * sub_filter_width));
int chw = sub_channels * sub_filter_width * sub_filter_width;
int division_capacity = filter::calc_division_capacity(chw);
int num_per_div_before_alignment =
filter::calc_num_per_div(sub_filter_num, group_num, division_capacity);
int num_per_div_after_alignment =
align_to_x(num_per_div_before_alignment, FILTER_NUM_ALIGNMENT);
int div_num = (sub_filter_num + num_per_div_before_alignment - 1) /
num_per_div_before_alignment;
int residual = sub_filter_num % num_per_div_before_alignment;
int num_after_alignment = num_per_div_after_alignment *
((residual == 0) ? div_num : (div_num - 1)) +
align_to_x(residual, FILTER_NUM_ALIGNMENT);
int filter_sub_conv_offset = element_num * num_after_alignment;
for (int i = 0; i < sub_conv_num; ++i) {
if (sub_conv_num == 1) { if (sub_conv_num == 1) {
arg->conv_args[i].output.address = out_ptr; arg->split_conv_args[i].output.address = arg->output.address;
arg->conv_args[i].output.scale_address = out->scale; arg->split_conv_args[i].output.scale_address = arg->output.scale_address;
} else { } else {
auto ptr_output = (half *)fpga_malloc(conv_output_size * sizeof(half)); auto ptr_output = (half *)fpga_malloc(conv_output_size * sizeof(half));
arg->conv_args[i].output.address = (void *)((half *)ptr_output); arg->split_conv_args[i].output.address = (void *)((half *)ptr_output);
auto ptr_output_scale = (float *)fpga_malloc(2 * sizeof(float)); auto ptr_output_scale = (float *)fpga_malloc(2 * sizeof(float));
arg->conv_args[i].output.scale_address = ptr_output_scale; arg->split_conv_args[i].output.scale_address = ptr_output_scale;
} }
}
arg->output.address = out_ptr; for (int j = 0; j < split_num; ++j) {
arg->output.scale_address = out->scale; arg->split_conv_args[i].conv_arg[j].relu_enabled = relu_enabled;
// fpga_free(filter_ptr); arg->split_conv_args[i].conv_arg[j].group_num = (uint32_t)group_num;
arg->split_conv_args[i].conv_arg[j].kernel.width =
(uint32_t)sub_filter_width;
arg->split_conv_args[i].conv_arg[j].kernel.height =
(uint32_t)sub_filter_width;
arg->split_conv_args[i].conv_arg[j].kernel.stride_w = 1;
arg->split_conv_args[i].conv_arg[j].kernel.stride_h = 1;
arg->split_conv_args[i].conv_arg[j].image.scale_address = input->scale;
arg->split_conv_args[i].conv_arg[j].image.channels =
(uint32_t)sub_channels;
arg->split_conv_args[i].conv_arg[j].image.width =
(uint32_t)input->dims()[3];
arg->split_conv_args[i].conv_arg[j].image.height =
(uint32_t)input->dims()[2];
arg->split_conv_args[i].conv_arg[j].image.pad_width = (uint32_t)sub_pad;
arg->split_conv_args[i].conv_arg[j].image.pad_height = (uint32_t)sub_pad;
arg->split_conv_args[i].conv_arg[j].image.address = input_ptr;
arg->split_conv_args[i].conv_arg[j].filter_scale_address = filter->scale;
arg->split_conv_args[i].conv_arg[j].filter_num = (uint32_t)(
j == split_num - 1
? sub_filter_num - (split_num - 1) * filter_num_per_div // NOLINT
: filter_num_per_div);
size_t filter_size =
element_num *
align_to_x(arg->split_conv_args[i].conv_arg[j].filter_num,
FILTER_NUM_ALIGNMENT) *
sizeof(int8_t);
auto filter_head =
&((int8_t *)filter_ptr)[j * element_num * filter_num_per_div +
i * filter_sub_conv_offset];
arg->split_conv_args[i].conv_arg[j].filter_address =
fpga_malloc(filter_size);
memcpy(arg->split_conv_args[i].conv_arg[j].filter_address, filter_head,
filter_size);
fpga_flush(arg->split_conv_args[i].conv_arg[j].filter_address,
filter_size);
{
static int test_cnt = 0;
signed char result = 0;
if (test_cnt <= 1) {
std::string filename = "deconv_split_flt" + std::to_string(test_cnt);
fpga::savefile<signed char>(
filename, arg->split_conv_args[i].conv_arg[j].filter_address,
filter_size, result);
test_cnt++;
}
}
size_t bs_align_num = align_to_x(
arg->split_conv_args[i].conv_arg[j].filter_num, BS_NUM_ALIGNMENT);
size_t bs_size = 2 * bs_align_num * sizeof(float);
auto bs_head = &bs_ptr[j * filter_num_per_div * 2];
arg->split_conv_args[i].conv_arg[j].sb_address = fpga_malloc(bs_size);
memcpy(arg->split_conv_args[i].conv_arg[j].sb_address, bs_head, bs_size);
fpga_flush(arg->split_conv_args[i].conv_arg[j].sb_address, bs_size);
if (split_num == 1) {
arg->split_conv_args[i].conv_arg[j].output.address =
arg->split_conv_args[i].output.address;
arg->split_conv_args[i].conv_arg[j].output.scale_address =
arg->split_conv_args[i].output.scale_address;
} else {
auto ptr_output = (half *)fpga_malloc(conv_output_size * sizeof(half));
arg->split_conv_args[i].conv_arg[j].output.address =
(void *)((half *)ptr_output);
auto ptr_output_scale = (float *)fpga_malloc(2 * sizeof(float));
arg->split_conv_args[i].conv_arg[j].output.scale_address =
ptr_output_scale;
}
arg->split_conv_args[i].concat_arg.images_in[j] =
(half *)arg->split_conv_args[i].conv_arg[j].output.address; // NOLINT
arg->split_conv_args[i].concat_arg.scales_in[j] =
arg->split_conv_args[i].conv_arg[j].output.scale_address;
arg->split_conv_args[i].concat_arg.channel_num[j] =
arg->split_conv_args[i].conv_arg[j].filter_num;
expand_conv_arg(&(arg->split_conv_args[i].conv_arg[j]));
}
arg->split_conv_args[i].concat_arg.image_out =
arg->split_conv_args[i].output.address;
arg->split_conv_args[i].concat_arg.scale_out =
arg->split_conv_args[i].output.scale_address;
}
filter->reset_data_ptr(nullptr);
fpga_free(bs_ptr);
} // fill_deconv_arg } // fill_deconv_arg
} // namespace fpga } // namespace fpga
......
...@@ -21,16 +21,19 @@ limitations under the License. */ ...@@ -21,16 +21,19 @@ limitations under the License. */
namespace paddle_mobile { namespace paddle_mobile {
namespace fpga { namespace fpga {
int get_align_image_cw(int cw);
void format_image(framework::Tensor* image_tensor); void format_image(framework::Tensor* image_tensor);
void format_fp16_ofm(framework::Tensor* ofm_tensor); // only allocate memory void format_fp16_ofm(framework::Tensor* ofm_tensor); // only allocate memory
void format_fp32_ofm(framework::Tensor* ofm_tensor); void format_fp32_ofm(framework::Tensor* ofm_tensor);
float filter_find_max(framework::Tensor* filter_tensor); float filter_find_max(framework::Tensor* filter_tensor);
int get_filter_num_per_div(framework::Tensor* filter_tensor, int group_num); int get_filter_num_per_div(framework::Tensor* filter_tensor, int group_num);
int get_deconv_filter_num_per_div(framework::Tensor* filter_tensor,
int group_num, int stride);
int get_plit_num(framework::Tensor* filter_tensor); int get_plit_num(framework::Tensor* filter_tensor);
int get_deconv_plit_num(framework::Tensor* filter_tensor, int stride);
int get_aligned_filter_element_num(int chw); int get_aligned_filter_element_num(int chw);
int get_aligned_filter_num(int num);
void format_filter(framework::Tensor* filter_tensor, float max_value, void format_filter(framework::Tensor* filter_tensor, float max_value,
int group_num); int group_num);
void format_fc_filter(framework::Tensor* filter_tensor, float max_value); void format_fc_filter(framework::Tensor* filter_tensor, float max_value);
......
...@@ -40,10 +40,9 @@ inverse kernel weights of each channel for every filter ...@@ -40,10 +40,9 @@ inverse kernel weights of each channel for every filter
void deconv_inverse_filter(float** data_in, int num, int channel, int width, void deconv_inverse_filter(float** data_in, int num, int channel, int width,
int height) { int height) {
float* tmp = *data_in; float* tmp = *data_in;
// float fix_range = 127;// float scale = fix_range / max;
int data_size = num * channel * width * height; int data_size = num * channel * width * height;
int hw_len = height * width; int hw_len = height * width;
float* tmp_data = (float*)fpga_malloc(data_size * sizeof(float)); auto tmp_data = (float*)fpga_malloc(data_size * sizeof(float));
for (int i = 0; i < num; ++i) { for (int i = 0; i < num; ++i) {
for (int j = 0; j < channel; ++j) { for (int j = 0; j < channel; ++j) {
for (int k = 0; k < hw_len; ++k) { for (int k = 0; k < hw_len; ++k) {
...@@ -52,7 +51,7 @@ void deconv_inverse_filter(float** data_in, int num, int channel, int width, ...@@ -52,7 +51,7 @@ void deconv_inverse_filter(float** data_in, int num, int channel, int width,
} }
} }
} }
*data_in = (float*)tmp_data; // *data_in = tmp_data;
fpga_free(tmp); fpga_free(tmp);
} }
...@@ -61,8 +60,7 @@ void deconv_inverse_filter(float** data_in, int num, int channel, int width, ...@@ -61,8 +60,7 @@ void deconv_inverse_filter(float** data_in, int num, int channel, int width,
*/ */
int deconv_calc_sub_pad(int filter_axis, int pad, int stride) { int deconv_calc_sub_pad(int filter_axis, int pad, int stride) {
if (stride == 0 || ((filter_axis - pad - 1) < 0)) { if (stride == 0 || ((filter_axis - pad - 1) < 0)) {
// error PADDLE_MOBILE_ENFORCE(false, "Wrong deconv parameters");
return 0;
} }
return (filter_axis - pad - 1) / stride; return (filter_axis - pad - 1) / stride;
} }
...@@ -79,11 +77,8 @@ int deconv_get_sub_out_axis(int image_axis, int sub_pad, int sub_filter_axis) { ...@@ -79,11 +77,8 @@ int deconv_get_sub_out_axis(int image_axis, int sub_pad, int sub_filter_axis) {
position. so the omit rows or columns is (stride - ) position. so the omit rows or columns is (stride - )
*/ */
int deconv_get_omit(int stride, int filter_width, int pad) { int deconv_get_omit(int stride, int filter_width, int pad) {
if (((filter_width - pad) <= 0)) { // ((filter_width-pad) > stride) || PADDLE_MOBILE_ENFORCE(filter_width > pad, "Wrong deconv parameters");
// error int idx;
return 0;
}
int idx = 1;
bool flag = false; bool flag = false;
for (idx = 1; idx <= stride; ++idx) { for (idx = 1; idx <= stride; ++idx) {
int j = idx; int j = idx;
...@@ -102,10 +97,6 @@ int deconv_get_omit(int stride, int filter_width, int pad) { ...@@ -102,10 +97,6 @@ int deconv_get_omit(int stride, int filter_width, int pad) {
return (stride - idx); return (stride - idx);
} }
int deconv_get_sub_filter_num(int filter_num, int stride) {
return filter_num * stride;
}
void deconv_get_sub_filter(char** data_in, int height, int width, void deconv_get_sub_filter(char** data_in, int height, int width,
int sub_conv_n, int kernel_num, int channel) { int sub_conv_n, int kernel_num, int channel) {
char* ptr_tmp = *data_in; char* ptr_tmp = *data_in;
...@@ -245,7 +236,6 @@ void deconv_format_filter(float** data_in, int num, int channel, int height, ...@@ -245,7 +236,6 @@ void deconv_format_filter(float** data_in, int num, int channel, int height,
char* ptr_space = (char*)fpga_malloc(sub_conv_n * align_offset * char* ptr_space = (char*)fpga_malloc(sub_conv_n * align_offset *
sizeof(char)); // continuous space sizeof(char)); // continuous space
for (int i = 0; i < sub_conv_n; ++i) { for (int i = 0; i < sub_conv_n; ++i) {
int offset = i * origin_offset;
char* ptr_tmp = (ptr_ptr_data)[i]; char* ptr_tmp = (ptr_ptr_data)[i];
filter::align_element(&ptr_tmp, sub_num, sub_chw); filter::align_element(&ptr_tmp, sub_num, sub_chw);
......
...@@ -21,7 +21,6 @@ namespace deconv_filter { ...@@ -21,7 +21,6 @@ namespace deconv_filter {
void deconv_inverse_filter(float** data_in, int num, int channel, int width, void deconv_inverse_filter(float** data_in, int num, int channel, int width,
int height); int height);
int deconv_calc_sub_pad(int filter_axis, int pad, int stride); int deconv_calc_sub_pad(int filter_axis, int pad, int stride);
int deconv_get_sub_filter_num(int filter_num, int stride);
int deconv_get_sub_filter_axis(int filter_axis, int stride); int deconv_get_sub_filter_axis(int filter_axis, int stride);
int deconv_get_sub_out_axis(int image_axis, int sub_pad, int sub_filter_axis); int deconv_get_sub_out_axis(int image_axis, int sub_pad, int sub_filter_axis);
int deconv_get_omit(int stride, int filter_width, int pad); int deconv_get_omit(int stride, int filter_width, int pad);
......
此差异已折叠。
...@@ -153,10 +153,6 @@ int memory_request(struct fpga_memory *memory, size_t size, uint64_t *addr) { ...@@ -153,10 +153,6 @@ int memory_request(struct fpga_memory *memory, size_t size, uint64_t *addr) {
uint64_t _nr = DIV_ROUND_UP(size, FPGA_PAGE_SIZE); uint64_t _nr = DIV_ROUND_UP(size, FPGA_PAGE_SIZE);
unsigned int nr = (unsigned int)_nr; unsigned int nr = (unsigned int)_nr;
int ret = 0; int ret = 0;
DLOG << size;
DLOG << _nr;
DLOG << nr;
uint64_t a_size = FPGA_PAGE_SIZE * nr; uint64_t a_size = FPGA_PAGE_SIZE * nr;
DLOG << a_size; DLOG << a_size;
...@@ -283,7 +279,7 @@ int fpga_memory_add() { ...@@ -283,7 +279,7 @@ int fpga_memory_add() {
return 0; return 0;
} }
uint64_t vaddr_to_paddr(void *address) { uint64_t vaddr_to_paddr_driver(void *address) {
uint64_t paddr = 0; uint64_t paddr = 0;
auto iter = g_fpgainfo.fpga_vaddr2paddr_map.find(address); auto iter = g_fpgainfo.fpga_vaddr2paddr_map.find(address);
if (iter != g_fpgainfo.fpga_vaddr2paddr_map.end()) { if (iter != g_fpgainfo.fpga_vaddr2paddr_map.end()) {
...@@ -315,7 +311,7 @@ void *fpga_reg_free(void *ptr) { ...@@ -315,7 +311,7 @@ void *fpga_reg_free(void *ptr) {
g_fpgainfo.fpga_addr2size_map.erase(iter); g_fpgainfo.fpga_addr2size_map.erase(iter);
munmap(ptr, size); munmap(ptr, size);
} else { } else {
DLOG << "Invalid pointer"; DLOG << "Invalid pointer" << ptr;
} }
} }
...@@ -347,7 +343,7 @@ void fpga_free_driver(void *ptr) { ...@@ -347,7 +343,7 @@ void fpga_free_driver(void *ptr) {
g_fpgainfo.fpga_addr2size_map.erase(iter); g_fpgainfo.fpga_addr2size_map.erase(iter);
munmap(ptr, size); munmap(ptr, size);
p_addr = vaddr_to_paddr(ptr); p_addr = vaddr_to_paddr_driver(ptr);
pos = (p_addr - g_fpgainfo.memory_info->mem_start) / FPGA_PAGE_SIZE; pos = (p_addr - g_fpgainfo.memory_info->mem_start) / FPGA_PAGE_SIZE;
/*clear bitmap*/ /*clear bitmap*/
...@@ -361,7 +357,7 @@ void fpga_free_driver(void *ptr) { ...@@ -361,7 +357,7 @@ void fpga_free_driver(void *ptr) {
g_fpgainfo.fpga_vaddr2paddr_map.erase(iter); g_fpgainfo.fpga_vaddr2paddr_map.erase(iter);
} }
} else { } else {
DLOG << "Invalid pointer"; DLOG << "Invalid pointer" << ptr;
} }
} }
...@@ -373,7 +369,7 @@ int fpga_flush_driver(void *address, size_t size) { ...@@ -373,7 +369,7 @@ int fpga_flush_driver(void *address, size_t size) {
struct MemoryCacheArgs args; struct MemoryCacheArgs args;
uint64_t p_addr; uint64_t p_addr;
p_addr = vaddr_to_paddr(address); p_addr = vaddr_to_paddr_driver(address);
args.offset = (void *)(p_addr - FPGA_MEM_PHY_ADDR); // NOLINT args.offset = (void *)(p_addr - FPGA_MEM_PHY_ADDR); // NOLINT
args.size = size; args.size = size;
...@@ -385,7 +381,7 @@ int fpga_invalidate_driver(void *address, size_t size) { ...@@ -385,7 +381,7 @@ int fpga_invalidate_driver(void *address, size_t size) {
struct MemoryCacheArgs args; struct MemoryCacheArgs args;
uint64_t p_addr; uint64_t p_addr;
p_addr = vaddr_to_paddr(address); p_addr = vaddr_to_paddr_driver(address);
args.offset = (void *)(p_addr - FPGA_MEM_PHY_ADDR); // NOLINT args.offset = (void *)(p_addr - FPGA_MEM_PHY_ADDR); // NOLINT
args.size = size; args.size = size;
......
...@@ -31,8 +31,8 @@ namespace driver { ...@@ -31,8 +31,8 @@ namespace driver {
#define FPGA_REG_PHY_ADDR 0xa0000000 #define FPGA_REG_PHY_ADDR 0xa0000000
#define FPGA_REG_SIZE 0x1000 #define FPGA_REG_SIZE 0x1000
#define FPGA_MEM_PHY_ADDR 0x20000000 #define FPGA_MEM_PHY_ADDR 0x40000000
#define FPGA_MEM_SIZE 0x20000000 #define FPGA_MEM_SIZE 0x80000000
#define FPGA_PAGE_SIZE (16UL * 1024UL) #define FPGA_PAGE_SIZE (16UL * 1024UL)
...@@ -122,15 +122,11 @@ void *fpga_malloc_driver(size_t size); ...@@ -122,15 +122,11 @@ void *fpga_malloc_driver(size_t size);
void fpga_free_driver(void *ptr); void fpga_free_driver(void *ptr);
void fpga_copy_driver(void *dest, const void *src, size_t num);
int fpga_flush_driver(void *address, size_t size); int fpga_flush_driver(void *address, size_t size);
int fpga_invalidate_driver(void *address, size_t size); int fpga_invalidate_driver(void *address, size_t size);
/*pe*/ uint64_t vaddr_to_paddr_driver(void *address);
uint64_t vaddr_to_paddr(void *address);
int fpga_regpoll(uint64_t reg, uint64_t val, int time); int fpga_regpoll(uint64_t reg, uint64_t val, int time);
......
...@@ -59,6 +59,9 @@ int close_device() { ...@@ -59,6 +59,9 @@ int close_device() {
void *fpga_malloc(size_t size) { void *fpga_malloc(size_t size) {
static uint64_t counter = 0; static uint64_t counter = 0;
if (size <= 0) {
size = 1;
}
#ifdef PADDLE_MOBILE_ZU5 #ifdef PADDLE_MOBILE_ZU5
auto ptr = driver::fpga_malloc_driver(size); auto ptr = driver::fpga_malloc_driver(size);
#else #else
...@@ -115,7 +118,7 @@ int fpga_invalidate(void *address, size_t size) { ...@@ -115,7 +118,7 @@ int fpga_invalidate(void *address, size_t size) {
} }
uint64_t vaddr_to_paddr(void *address) { uint64_t vaddr_to_paddr(void *address) {
#ifdef PADDLE_MOBILE_ZU5 #ifdef PADDLE_MOBILE_ZU5
return driver::vaddr_to_paddr(address); return driver::vaddr_to_paddr_driver(address);
#else #else
return 0; return 0;
#endif #endif
......
...@@ -37,6 +37,18 @@ enum LayoutType { ...@@ -37,6 +37,18 @@ enum LayoutType {
LAYOUT_HWC = 0, LAYOUT_HWC = 0,
}; };
enum ActivationType {
NONE = 0,
LEAKYRELU = 1,
SIGMOID = 2,
TANH = 3,
};
struct ActivationArgs {
enum ActivationType activation_type;
int16_t leaky_relu_negative_slope;
};
struct KernelArgs { struct KernelArgs {
uint32_t width; uint32_t width;
uint32_t height; uint32_t height;
...@@ -58,7 +70,10 @@ struct ImageOutputArgs { ...@@ -58,7 +70,10 @@ struct ImageOutputArgs {
void* address; // output result address; void* address; // output result address;
float* scale_address; // output scale address; float* scale_address; // output scale address;
uint64_t timer_cnt; // time counter for FPGA computation uint64_t timer_cnt; // time counter for FPGA computation
struct ActivationArgs
activation; // To select activation and specify (Leaky)Relu parameter.
}; };
#ifdef PADDLE_MOBILE_FPGA_V1 #ifdef PADDLE_MOBILE_FPGA_V1
struct ConvDriverParam { struct ConvDriverParam {
uint64_t image_address_phy; uint64_t image_address_phy;
...@@ -195,10 +210,14 @@ struct DeconvArgs { ...@@ -195,10 +210,14 @@ struct DeconvArgs {
uint32_t sub_output_width; uint32_t sub_output_width;
uint32_t sub_output_height; uint32_t sub_output_height;
struct ImageOutputArgs output; struct ImageOutputArgs output;
struct ConvArgs* conv_args; struct SplitConvArgs* split_conv_args;
}; };
static inline int align_to_x(int num, int x) { return (num + x - 1) / x * x; } // static inline int align_to_x(int num, int x) { return (num + x - 1) / x * x;
// }
static inline uint32_t align_to_x(int64_t num, int64_t x) {
return ((uint32_t)(num + x) - 1) / (uint32_t)x * (uint32_t)x;
}
int16_t fp32_2_fp16(float fp32_num); int16_t fp32_2_fp16(float fp32_num);
float fp16_2_fp32(int16_t fp16_num); float fp16_2_fp32(int16_t fp16_num);
......
...@@ -28,6 +28,10 @@ extern _PaddleMobile__Framework__Proto__VarType__Type ToDataType( ...@@ -28,6 +28,10 @@ extern _PaddleMobile__Framework__Proto__VarType__Type ToDataType(
extern std::type_index ToTypeIndex( extern std::type_index ToTypeIndex(
_PaddleMobile__Framework__Proto__VarType__Type type); _PaddleMobile__Framework__Proto__VarType__Type type);
inline _PaddleMobile__Framework__Proto__VarType__Type ToDataType(int type) {
return static_cast<_PaddleMobile__Framework__Proto__VarType__Type>(type);
}
template <typename Visitor> template <typename Visitor>
inline void VisitDataType(_PaddleMobile__Framework__Proto__VarType__Type type, inline void VisitDataType(_PaddleMobile__Framework__Proto__VarType__Type type,
Visitor visitor) { Visitor visitor) {
......
此差异已折叠。
...@@ -17,6 +17,7 @@ limitations under the License. */ ...@@ -17,6 +17,7 @@ limitations under the License. */
#include <map> #include <map>
#include <memory> #include <memory>
#include <string> #include <string>
#include <utility>
#include <vector> #include <vector>
#include "common/types.h" #include "common/types.h"
#include "common/util.h" #include "common/util.h"
...@@ -28,41 +29,29 @@ limitations under the License. */ ...@@ -28,41 +29,29 @@ limitations under the License. */
namespace paddle_mobile { namespace paddle_mobile {
namespace framework { namespace framework {
template <typename Dtype = CPU, Precision P = Precision::FP32> template <typename Device, typename T = float>
class Executor { class Executor {
public: public:
typedef typename PrecisionTrait<P>::ptype Ptype; Executor(const Program<Device> &program, int batch_size = 1,
// exector constructor const bool use_optimize = true, const bool lod_mode = false);
// @param program program converted from proto program in PaddlePaddle
// @param use_optimize bool whether use operator fusion to speed up or not PMStatus Predict(const std::vector<std::pair<std::string, Tensor>> &inputs);
// @param loddable bool PMStatus Predict(
Executor(const framework::Program<Dtype> program, int batch_size = 1, const std::vector<std::pair<std::string, LoDTensor>> &inputs);
const bool use_optimize = true, const bool loddable = false);
std::vector<T> Predict(const std::vector<T> &input,
// predict with tensor input const std::vector<int64_t> &dims);
// @param t input tensor to do prediction PMStatus Predict();
// @return predicted tensor
std::shared_ptr<framework::Tensor> Predict(const framework::Tensor &t); void SetInput(const Tensor &input, const std::string &var_name);
void SetInput(const LoDTensor &input, const std::string &var_name);
// predict with lod tensor input
// @param t input lod tensor to do prediction std::shared_ptr<LoDTensor> GetOutput(const std::string &var_name);
// @return predicted lod tensor
std::shared_ptr<framework::LoDTensor> PredictLod(
const framework::LoDTensor &t);
// predict with vector input and dims
// @param input vector whose elements will be formed
// @param input lod tensor to do prediction
// @param dims vector whose elements will be formed
// @param input tensor shape
// @return vector which is flatted from predicted tensor
std::vector<Ptype> Predict(const std::vector<Ptype> &input,
const std::vector<int64_t> &dims);
#ifdef PADDLE_MOBILE_FPGA #ifdef PADDLE_MOBILE_FPGA
void InjectVariable(const framework::Tensor &t, std::string var_name); void InjectVariable(const Tensor &t, std::string var_name);
void FeedData(const framework::Tensor &t); void FeedData(const Tensor &t);
std::shared_ptr<framework::Tensor> FetchResult(int id = -1); std::shared_ptr<Tensor> FetchResult(int id = -1);
void Predict_From_To(int start = 0, int end = -1); void Predict_From_To(int start = 0, int end = -1);
void Predict_From(int start); void Predict_From(int start);
void Predict_To(int end); void Predict_To(int end);
...@@ -70,26 +59,28 @@ class Executor { ...@@ -70,26 +59,28 @@ class Executor {
protected: protected:
Executor() = default; Executor() = default;
std::shared_ptr<framework::Tensor> Predict(const framework::Tensor &t,
int block_id); bool varInputMemory(const std::shared_ptr<VarDesc> &var_desc, Variable *var,
bool varInputMemory(const std::shared_ptr<framework::VarDesc> &var_desc, LoDTensor *tensor) const;
framework::Variable *var,
framework::LoDTensor *tensor) const;
void InitMemory(); void InitMemory();
void InitCombineMemory(); void InitCombineMemory();
void LoadMemory(void **data, void LoadMemory(void **data, const std::shared_ptr<VarDesc> var_desc,
const std::shared_ptr<framework::VarDesc> var_desc, LoDTensor *tensor);
framework::LoDTensor *tensor);
#ifdef PADDLE_MOBILE_CL #ifdef PADDLE_MOBILE_CL
void LoadMemory(const framework::VarDesc var_desc, float *tensorInput, void LoadMemory(const VarDesc var_desc, float *tensorInput, char **data);
char **data);
#endif #endif
framework::Program<Dtype> program_;
int batch_size_ = 1; int batch_size_;
std::shared_ptr<framework::ProgramDesc> to_predict_program_; bool use_optimize_;
std::map<framework::BlockDesc, bool lod_mode_;
std::vector<std::shared_ptr<framework::OperatorBase<Dtype>>>> Program<Device> program_;
ops_of_block_; std::shared_ptr<ProgramDesc> program_desc_;
typedef std::shared_ptr<OperatorBase<Device>> OperatorBasePtr;
std::vector<std::vector<OperatorBasePtr>> ops_of_block_;
// operators list
std::vector<OperatorBasePtr> ops_list_;
#ifdef PADDLE_MOBILE_PROFILE #ifdef PADDLE_MOBILE_PROFILE
struct ProfInfo { struct ProfInfo {
int tid = 0; int tid = 0;
...@@ -97,8 +88,6 @@ class Executor { ...@@ -97,8 +88,6 @@ class Executor {
uint64_t runEnd = 0UL; uint64_t runEnd = 0UL;
}; };
#endif #endif
bool use_optimize_ = false;
bool loddable_ = false;
}; };
} // namespace framework } // namespace framework
......
...@@ -191,6 +191,7 @@ LOAD_OP2(mul, CPU, MALI_GPU); ...@@ -191,6 +191,7 @@ LOAD_OP2(mul, CPU, MALI_GPU);
#endif #endif
#ifdef RELU_OP #ifdef RELU_OP
LOAD_OP2(relu, CPU, MALI_GPU); LOAD_OP2(relu, CPU, MALI_GPU);
LOAD_OP1(relu6, CPU);
#endif #endif
#ifdef IM2SEQUENCE_OP #ifdef IM2SEQUENCE_OP
LOAD_OP1(im2sequence, CPU); LOAD_OP1(im2sequence, CPU);
...@@ -227,12 +228,22 @@ LOAD_FUSION_MATCHER(fusion_conv_bn); ...@@ -227,12 +228,22 @@ LOAD_FUSION_MATCHER(fusion_conv_bn);
#ifdef ELEMENTWISESUB_OP #ifdef ELEMENTWISESUB_OP
LOAD_OP1(elementwise_sub, CPU) LOAD_OP1(elementwise_sub, CPU)
#endif #endif
#ifdef TOP_K_OP
LOAD_OP1(top_k, CPU)
#endif
#ifdef CAST_OP
LOAD_OP1(cast, CPU)
#endif
#ifdef QUANT_OP #ifdef QUANT_OP
LOAD_OP1(quantize, CPU); LOAD_OP1(quantize, CPU);
#endif #endif
#ifdef DEQUANT_OP #ifdef DEQUANT_OP
LOAD_OP1(dequantize, CPU); LOAD_OP1(dequantize, CPU);
#endif #endif
#ifdef FUSION_DEQUANT_BN_OP
LOAD_OP1(fusion_dequant_bn, CPU);
LOAD_FUSION_MATCHER(fusion_dequant_bn);
#endif
#ifdef FUSION_DEQUANT_ADD_BN_OP #ifdef FUSION_DEQUANT_ADD_BN_OP
LOAD_OP1(fusion_dequant_add_bn, CPU); LOAD_OP1(fusion_dequant_add_bn, CPU);
LOAD_FUSION_MATCHER(fusion_dequant_add_bn); LOAD_FUSION_MATCHER(fusion_dequant_add_bn);
...@@ -245,3 +256,11 @@ LOAD_FUSION_MATCHER(fusion_dequant_bn_relu); ...@@ -245,3 +256,11 @@ LOAD_FUSION_MATCHER(fusion_dequant_bn_relu);
LOAD_OP1(fusion_dequant_add_bn_relu, CPU); LOAD_OP1(fusion_dequant_add_bn_relu, CPU);
LOAD_FUSION_MATCHER(fusion_dequant_add_bn_relu); LOAD_FUSION_MATCHER(fusion_dequant_add_bn_relu);
#endif #endif
#ifdef FUSION_DEQUANT_ADD_BN_QUANT_OP
LOAD_OP1(fusion_dequant_add_bn_quant, CPU);
LOAD_FUSION_MATCHER(fusion_dequant_add_bn_quant);
#endif
#ifdef FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
LOAD_OP1(fusion_dequant_add_bn_relu_quant, CPU);
LOAD_FUSION_MATCHER(fusion_dequant_add_bn_relu_quant);
#endif
...@@ -23,14 +23,8 @@ limitations under the License. */ ...@@ -23,14 +23,8 @@ limitations under the License. */
namespace paddle_mobile { namespace paddle_mobile {
namespace framework { namespace framework {
/** template <typename Device, typename T>
* muteandresize tensor as originProgramDesc and scope in loadParams void Loader<Device, T>::InitMemoryFromProgram(
*
* @param originProgramDesc
* @param scope
*/
template <typename Dtype, Precision P>
void Loader<Dtype, P>::InitMemoryFromProgram(
const std::shared_ptr<ProgramDesc> &originProgramDesc, const std::shared_ptr<ProgramDesc> &originProgramDesc,
const std::shared_ptr<Scope> &scope) { const std::shared_ptr<Scope> &scope) {
for (const auto &block : originProgramDesc.get()->Blocks()) { for (const auto &block : originProgramDesc.get()->Blocks()) {
...@@ -43,8 +37,6 @@ void Loader<Dtype, P>::InitMemoryFromProgram( ...@@ -43,8 +37,6 @@ void Loader<Dtype, P>::InitMemoryFromProgram(
tensor->Resize(make_ddim(dim)); tensor->Resize(make_ddim(dim));
} else { } else {
auto dim = var_desc->Tensor_desc().Dims(); auto dim = var_desc->Tensor_desc().Dims();
// PADDLE_MOBILE_ENFORCE(dim.size() > 0, "dim size is 0");
// dim[0] = 1;
if (dim.size() == 0) { if (dim.size() == 0) {
auto tensor = var->GetMutable<LoDTensor>(); auto tensor = var->GetMutable<LoDTensor>();
framework::DDim dDim = {0}; framework::DDim dDim = {0};
...@@ -60,7 +52,7 @@ void Loader<Dtype, P>::InitMemoryFromProgram( ...@@ -60,7 +52,7 @@ void Loader<Dtype, P>::InitMemoryFromProgram(
} }
} }
} else { } else {
// TODO(codeWorm): some. // TODO(codeWorm)
} }
} }
} }
...@@ -68,7 +60,7 @@ void Loader<Dtype, P>::InitMemoryFromProgram( ...@@ -68,7 +60,7 @@ void Loader<Dtype, P>::InitMemoryFromProgram(
#ifdef PADDLE_MOBILE_CL #ifdef PADDLE_MOBILE_CL
template <> template <>
void Loader<GPU_CL, Precision::FP32>::InitMemoryFromProgram( void Loader<GPU_CL, float>::InitMemoryFromProgram(
const std::shared_ptr<ProgramDesc> &originProgramDesc, const std::shared_ptr<ProgramDesc> &originProgramDesc,
const std::shared_ptr<Scope> &scope) { const std::shared_ptr<Scope> &scope) {
for (const auto &block : originProgramDesc.get()->Blocks()) { for (const auto &block : originProgramDesc.get()->Blocks()) {
...@@ -77,7 +69,6 @@ void Loader<GPU_CL, Precision::FP32>::InitMemoryFromProgram( ...@@ -77,7 +69,6 @@ void Loader<GPU_CL, Precision::FP32>::InitMemoryFromProgram(
if (var_desc->Type() == VARTYPE_TYPE_LOD_TENSOR) { if (var_desc->Type() == VARTYPE_TYPE_LOD_TENSOR) {
if (var_desc->Persistable()) { if (var_desc->Persistable()) {
auto dim = var_desc->Tensor_desc().Dims(); auto dim = var_desc->Tensor_desc().Dims();
// auto tensor = var->GetMutable<LoDTensor>();
auto cl_image = var->GetMutable<framework::CLImage>(); auto cl_image = var->GetMutable<framework::CLImage>();
cl_image->Resize(make_ddim(dim)); cl_image->Resize(make_ddim(dim));
} else { } else {
...@@ -88,14 +79,13 @@ void Loader<GPU_CL, Precision::FP32>::InitMemoryFromProgram( ...@@ -88,14 +79,13 @@ void Loader<GPU_CL, Precision::FP32>::InitMemoryFromProgram(
cl_image->Resize(make_ddim(dim)); cl_image->Resize(make_ddim(dim));
} }
} else { } else {
// TODO(codeWorm): some. // TODO(codeWorm)
} }
} }
} }
} }
template <> template <>
const Program<GPU_CL, Precision::FP32> const Program<GPU_CL, float> Loader<GPU_CL, float>::LoadCombinedMemory(
Loader<GPU_CL, Precision::FP32>::LoadCombinedMemory(
size_t read_size, const uint8_t *buf, size_t combined_params_len, size_t read_size, const uint8_t *buf, size_t combined_params_len,
uint8_t *combined_params_buf, bool optimize, bool quantification) { uint8_t *combined_params_buf, bool optimize, bool quantification) {
bool can_add_split = false; bool can_add_split = false;
...@@ -113,7 +103,7 @@ Loader<GPU_CL, Precision::FP32>::LoadCombinedMemory( ...@@ -113,7 +103,7 @@ Loader<GPU_CL, Precision::FP32>::LoadCombinedMemory(
auto originProgramDesc = std::make_shared<ProgramDesc>(c_program); auto originProgramDesc = std::make_shared<ProgramDesc>(c_program);
Program<GPU_CL, Precision::FP32> program; Program<GPU_CL, float> program;
program.combined = true; program.combined = true;
program.originProgram = originProgramDesc; program.originProgram = originProgramDesc;
program.quantification = quantification; program.quantification = quantification;
...@@ -145,16 +135,16 @@ Loader<GPU_CL, Precision::FP32>::LoadCombinedMemory( ...@@ -145,16 +135,16 @@ Loader<GPU_CL, Precision::FP32>::LoadCombinedMemory(
/** /**
* fusion and print someinfos * fusion and print someinfos
* @tparam Dtype * @tparam Device
* @tparam P * @tparam P
* @param optimize * @param optimize
* @param can_add_split * @param can_add_split
* @param program * @param program
* @param originProgramDesc * @param originProgramDesc
*/ */
template <typename Dtype, Precision P> template <typename Device, typename T>
void FusionAndPrintInfos( void FusionAndPrintInfos(
bool optimize, bool can_add_split, Program<Dtype, P> *program, bool optimize, bool can_add_split, Program<Device, T> *program,
const std::shared_ptr<ProgramDesc> &originProgramDesc) { const std::shared_ptr<ProgramDesc> &originProgramDesc) {
if (optimize) { if (optimize) {
ProgramOptimize program_optimize; ProgramOptimize program_optimize;
...@@ -193,22 +183,22 @@ static size_t ReadBuffer(const char *file_name, uint8_t **out) { ...@@ -193,22 +183,22 @@ static size_t ReadBuffer(const char *file_name, uint8_t **out) {
return cur_len; return cur_len;
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
const Program<Dtype, P> Loader<Dtype, P>::Load(const std::string &dirname, const Program<Device, T> Loader<Device, T>::Load(const std::string &dirname,
bool optimize, bool optimize,
bool quantification, bool quantification,
bool can_add_split) { bool can_add_split) {
auto program = this->LoadProgram(dirname + "/__model__", optimize, auto program = this->LoadProgram(dirname + "/__model__", optimize,
quantification, can_add_split); quantification, can_add_split);
program.model_path = dirname; program.model_path = dirname;
return program; return program;
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
const Program<Dtype, P> Loader<Dtype, P>::Load(const std::string &model_path, const Program<Device, T> Loader<Device, T>::Load(const std::string &model_path,
const std::string &para_path, const std::string &para_path,
bool optimize, bool optimize,
bool quantification) { bool quantification) {
auto program = this->LoadProgram(model_path, optimize, quantification); auto program = this->LoadProgram(model_path, optimize, quantification);
program.para_path = para_path; program.para_path = para_path;
...@@ -217,8 +207,8 @@ const Program<Dtype, P> Loader<Dtype, P>::Load(const std::string &model_path, ...@@ -217,8 +207,8 @@ const Program<Dtype, P> Loader<Dtype, P>::Load(const std::string &model_path,
return program; return program;
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
const Program<Dtype, P> Loader<Dtype, P>::LoadProgram( const Program<Device, T> Loader<Device, T>::LoadProgram(
const std::string &model_path, bool optimize, bool quantification, const std::string &model_path, bool optimize, bool quantification,
bool can_add_split) { bool can_add_split) {
std::string model_filename = model_path; std::string model_filename = model_path;
...@@ -237,7 +227,7 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadProgram( ...@@ -237,7 +227,7 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadProgram(
// //
auto originProgramDesc = std::make_shared<ProgramDesc>(c_program); auto originProgramDesc = std::make_shared<ProgramDesc>(c_program);
Program<Dtype, P> program; Program<Device, T> program;
program.originProgram = originProgramDesc; program.originProgram = originProgramDesc;
program.quantification = quantification; program.quantification = quantification;
program.combined_params_len = 0; program.combined_params_len = 0;
...@@ -254,8 +244,8 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadProgram( ...@@ -254,8 +244,8 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadProgram(
return program; return program;
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
const Program<Dtype, P> Loader<Dtype, P>::LoadCombinedMemory( const Program<Device, T> Loader<Device, T>::LoadCombinedMemory(
size_t read_size, const uint8_t *buf, size_t combined_params_len, size_t read_size, const uint8_t *buf, size_t combined_params_len,
uint8_t *combined_params_buf, bool optimize, bool quantification) { uint8_t *combined_params_buf, bool optimize, bool quantification) {
bool can_add_split = false; bool can_add_split = false;
...@@ -273,7 +263,7 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadCombinedMemory( ...@@ -273,7 +263,7 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadCombinedMemory(
auto originProgramDesc = std::make_shared<ProgramDesc>(c_program); auto originProgramDesc = std::make_shared<ProgramDesc>(c_program);
Program<Dtype, P> program; Program<Device, T> program;
program.combined = true; program.combined = true;
program.originProgram = originProgramDesc; program.originProgram = originProgramDesc;
program.quantification = quantification; program.quantification = quantification;
...@@ -289,13 +279,13 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadCombinedMemory( ...@@ -289,13 +279,13 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadCombinedMemory(
return program; return program;
} }
template class Loader<CPU, Precision::FP32>; template class Loader<CPU, float>;
template class Loader<FPGA, Precision::FP32>; template class Loader<FPGA, float>;
template class Loader<GPU_MALI, Precision::FP32>; template class Loader<GPU_MALI, float>;
template class Loader<GPU_CL, Precision::FP32>; template class Loader<GPU_CL, float>;
} // namespace framework } // namespace framework
} // namespace paddle_mobile } // namespace paddle_mobile
...@@ -22,39 +22,39 @@ limitations under the License. */ ...@@ -22,39 +22,39 @@ limitations under the License. */
namespace paddle_mobile { namespace paddle_mobile {
namespace framework { namespace framework {
template <typename Dtype = CPU, Precision P = Precision::FP32> template <typename Device = CPU, typename T = float>
class Loader { class Loader {
public: public:
/* /*
* @b load separate format fluid model * @b load separate format fluid model
* @b 加载分开形式的 fluid 模型 * @b 加载分开存储的fluid模型
* */ * */
const Program<Dtype, P> Load(const std::string &dirname, const Program<Device, T> Load(const std::string &dirname,
bool optimize = false, bool optimize = false,
bool quantification = false, bool quantification = false,
bool can_add_split = false); bool can_add_split = false);
/* /*
* @b load combine format fluid mode * @b load combine format fluid mode
* @b 加载结合在一起格式的模型 * @b 加载统一存储的fluid模型
* */ * */
const Program<Dtype, P> Load(const std::string &model_path, const Program<Device, T> Load(const std::string &model_path,
const std::string &para_path, const std::string &para_path,
bool optimize = false, bool optimize = false,
bool quantification = false); bool quantification = false);
const Program<Dtype, P> LoadCombinedMemory(size_t model_len, const Program<Device, T> LoadCombinedMemory(size_t model_len,
const uint8_t *model_buf, const uint8_t *model_buf,
size_t combined_params_len, size_t combined_params_len,
uint8_t *combined_params_buf, uint8_t *combined_params_buf,
bool optimize = false, bool optimize = false,
bool quantification = false); bool quantification = false);
private: private:
const Program<Dtype, P> LoadProgram(const std::string &model_path, const Program<Device, T> LoadProgram(const std::string &model_path,
bool optimize = false, bool optimize = false,
bool quantification = false, bool quantification = false,
bool can_add_split = false); bool can_add_split = false);
void InitMemoryFromProgram( void InitMemoryFromProgram(
const std::shared_ptr<ProgramDesc> &originProgramDesc, const std::shared_ptr<ProgramDesc> &originProgramDesc,
......
...@@ -16,12 +16,12 @@ limitations under the License. */ ...@@ -16,12 +16,12 @@ limitations under the License. */
#include <memory> #include <memory>
#include <string> #include <string>
#include <utility>
#include <vector> #include <vector>
#include "tensor.h" #include "framework/tensor.h"
#include "tensor_util.h" #include "framework/tensor_util.h"
namespace paddle_mobile { namespace paddle_mobile {
namespace framework { namespace framework {
/* /*
...@@ -202,5 +202,29 @@ void SerializeToStream(std::ostream &os, const LoDTensor &tensor); ...@@ -202,5 +202,29 @@ void SerializeToStream(std::ostream &os, const LoDTensor &tensor);
void DeserializeFromStream(std::istream &is, LoDTensor *tensor); void DeserializeFromStream(std::istream &is, LoDTensor *tensor);
#ifdef PADDLE_MOBILE_DEBUG
inline Print &operator<<(Print &printer, const LoDTensor &tensor) {
printer << " dims: " << tensor.dims() << "\n";
int stride = tensor.numel() / 20;
stride = stride > 0 ? stride : 1;
#ifndef PADDLE_MOBILE_FPGA
for (int i = 0; i < tensor.numel(); i += stride) {
if (tensor.type() == typeid(float)) {
printer << tensor.data<float>()[i] << " ";
} else if (tensor.type() == typeid(int32_t)) {
printer << tensor.data<int32_t>()[i] << " ";
} else if (tensor.type() == typeid(int64_t)) {
printer << tensor.data<int64_t>()[i] << " ";
} else if (tensor.type() == typeid(int8_t)) {
printer << static_cast<int>(tensor.data<int8_t>()[i]) << " ";
} else if (tensor.type() == typeid(int32_t)) {
printer << tensor.data<int32_t>()[i] << " ";
}
}
#endif // PADDLE_MOBILE_FPGA
return printer;
}
#endif // PADDLE_MOBILE_DEBUG
} // namespace framework } // namespace framework
} // namespace paddle_mobile } // namespace paddle_mobile
...@@ -98,24 +98,6 @@ class OpRegistry { ...@@ -98,24 +98,6 @@ class OpRegistry {
} }
}; };
#define REGISTER_OPERATOR_INT8(op_type, op_class, device_name, device_type) \
template class op_class<device_type, int8_t>; \
template <typename Dtype, typename T> \
class _OpClass_##op_type##_##device_name : public op_class<Dtype, T> { \
public: \
DEFINE_OP_CONSTRUCTOR(_OpClass_##op_type##_##device_name, op_class); \
}; \
static paddle_mobile::framework::OperatorRegistrar< \
device_type, _OpClass_##op_type##_##device_name<device_type, int8_t>> \
__op_registrar_##op_type##_##device_name(#op_type); \
int TouchOpRegistrar_##op_type##_##device_name() { \
__op_registrar_##op_type##_##device_name.Touch(); \
return 0; \
}
#define REGISTER_OPERATOR_CPU_INT8(op_type, op_class) \
REGISTER_OPERATOR_INT8(op_type, op_class, cpu, paddle_mobile::CPU);
#define REGISTER_OPERATOR(op_type, op_class, device_name, device_type) \ #define REGISTER_OPERATOR(op_type, op_class, device_name, device_type) \
template class op_class<device_type, float>; \ template class op_class<device_type, float>; \
template <typename Dtype, typename T> \ template <typename Dtype, typename T> \
......
...@@ -220,7 +220,16 @@ void Node::Folder( ...@@ -220,7 +220,16 @@ void Node::Folder(
} }
} else { } else {
for (auto &op_output : this->op_desc_->outputs_) { for (auto &op_output : this->op_desc_->outputs_) {
op_desc->outputs_.emplace(op_output.first, op_output.second); auto output_key = op_output.first;
if (change->find(this->type_) != change->end()) {
const auto change_pairs = (*change)[this->type_];
for (const auto &target : change_pairs) {
if (target.first == output_key) {
output_key = target.second;
}
}
}
op_desc->outputs_.emplace(output_key, op_output.second);
} }
for (auto &output : this->outputs_) { for (auto &output : this->outputs_) {
......
...@@ -14,16 +14,15 @@ limitations under the License. */ ...@@ -14,16 +14,15 @@ limitations under the License. */
#pragma once #pragma once
#include <string>
#include "common/types.h" #include "common/types.h"
#include "framework/program/program_desc.h" #include "framework/program/program_desc.h"
#include "framework/scope.h" #include "framework/scope.h"
#include <string>
namespace paddle_mobile { namespace paddle_mobile {
namespace framework { namespace framework {
template <typename Dtype, Precision P = Precision::FP32> template <typename Device, typename T = float>
class Program { class Program {
public: public:
std::shared_ptr<ProgramDesc> originProgram; std::shared_ptr<ProgramDesc> originProgram;
......
...@@ -26,6 +26,7 @@ limitations under the License. */ ...@@ -26,6 +26,7 @@ limitations under the License. */
namespace paddle_mobile { namespace paddle_mobile {
namespace framework { namespace framework {
class Scope { class Scope {
public: public:
Scope() = default; Scope() = default;
......
...@@ -148,8 +148,8 @@ class Tensor : public TensorBase { ...@@ -148,8 +148,8 @@ class Tensor : public TensorBase {
PADDLE_MOBILE_ENFORCE( PADDLE_MOBILE_ENFORCE(
(std::is_same<T, void>::value || (std::is_same<T, void>::value ||
holder_->type().hash_code() == typeid(T).hash_code()), holder_->type().hash_code() == typeid(T).hash_code()),
"Tensor holds the wrong type, it holds %s", "Tensor holds the wrong type, it holds %s, requested %s",
this->holder_->type().name()); this->holder_->type().name(), typeid(T).name());
return reinterpret_cast<T *>(reinterpret_cast<uintptr_t>(holder_->ptr()) + return reinterpret_cast<T *>(reinterpret_cast<uintptr_t>(holder_->ptr()) +
offset_); offset_);
...@@ -162,7 +162,7 @@ class Tensor : public TensorBase { ...@@ -162,7 +162,7 @@ class Tensor : public TensorBase {
PADDLE_MOBILE_ENFORCE( PADDLE_MOBILE_ENFORCE(
(std::is_same<T, void>::value || (std::is_same<T, void>::value ||
holder_->type().hash_code() == typeid(T).hash_code()), holder_->type().hash_code() == typeid(T).hash_code()),
"Tensor holds the wrong type, it holds %s ,requested:%s", "Tensor holds the wrong type, it holds %s, requested %s",
this->holder_->type().name(), typeid(T).name()); this->holder_->type().name(), typeid(T).name());
return reinterpret_cast<const T *>( return reinterpret_cast<const T *>(
...@@ -226,7 +226,6 @@ inline Print &operator<<(Print &printer, const Tensor &tensor) { ...@@ -226,7 +226,6 @@ inline Print &operator<<(Print &printer, const Tensor &tensor) {
} }
} }
#endif #endif
return printer; return printer;
} }
......
...@@ -18,17 +18,17 @@ ...@@ -18,17 +18,17 @@
namespace paddle_mobile { namespace paddle_mobile {
template <typename Dtype, Precision P> template <typename Device, typename T>
PaddleMobilePredictor<Dtype, P>::PaddleMobilePredictor( PaddleMobilePredictor<Device, T>::PaddleMobilePredictor(
const PaddleMobileConfig &config) { const PaddleMobileConfig &config) {
PADDLE_MOBILE_ENFORCE(Init(config) == true, PADDLE_MOBILE_ENFORCE(Init(config) == true,
"paddle mobile predictor init failed!"); "paddle mobile predictor init failed!");
config_ = config; config_ = config;
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
bool PaddleMobilePredictor<Dtype, P>::Init(const PaddleMobileConfig &config) { bool PaddleMobilePredictor<Device, T>::Init(const PaddleMobileConfig &config) {
paddle_mobile_.reset(new PaddleMobile<Dtype, P>()); paddle_mobile_.reset(new PaddleMobile<Device, T>());
#ifdef PADDLE_MOBILE_CL #ifdef PADDLE_MOBILE_CL
paddle_mobile_->SetCLPath(config.cl_path); paddle_mobile_->SetCLPath(config.cl_path);
#endif #endif
...@@ -52,8 +52,8 @@ bool PaddleMobilePredictor<Dtype, P>::Init(const PaddleMobileConfig &config) { ...@@ -52,8 +52,8 @@ bool PaddleMobilePredictor<Dtype, P>::Init(const PaddleMobileConfig &config) {
paddle_mobile_->SetThreadNum(config.thread_num); paddle_mobile_->SetThreadNum(config.thread_num);
return true; return true;
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
bool PaddleMobilePredictor<Dtype, P>::Run( bool PaddleMobilePredictor<Device, T>::Run(
const std::vector<PaddleTensor> &inputs, const std::vector<PaddleTensor> &inputs,
std::vector<PaddleTensor> *output_data, int batch_size) { std::vector<PaddleTensor> *output_data, int batch_size) {
if (inputs.empty()) { if (inputs.empty()) {
...@@ -78,12 +78,12 @@ bool PaddleMobilePredictor<Dtype, P>::Run( ...@@ -78,12 +78,12 @@ bool PaddleMobilePredictor<Dtype, P>::Run(
framework::Tensor input_tensor; framework::Tensor input_tensor;
input_tensor.Resize(ddim); input_tensor.Resize(ddim);
int input_length = framework::product(ddim); int input_length = framework::product(ddim);
typedef typename PrecisionTrait<P>::ptype PType; auto input_ptr = input_tensor.mutable_data<T>();
auto input_ptr = input_tensor.mutable_data<PType>();
memcpy(input_ptr, static_cast<PType *>(input.data.data()), memcpy(input_ptr, static_cast<T *>(input.data.data()),
input_length * sizeof(PType)); input_length * sizeof(T));
auto output_tensor = paddle_mobile_->Predict(input_tensor); paddle_mobile_->Predict(input_tensor);
auto output_tensor = paddle_mobile_->Fetch();
if (output_data->empty()) { if (output_data->empty()) {
LOG(kLOG_ERROR) << "At least one output should be set with tensors' names."; LOG(kLOG_ERROR) << "At least one output should be set with tensors' names.";
...@@ -99,18 +99,18 @@ bool PaddleMobilePredictor<Dtype, P>::Run( ...@@ -99,18 +99,18 @@ bool PaddleMobilePredictor<Dtype, P>::Run(
output.shape.push_back(static_cast<int>(d)); output.shape.push_back(static_cast<int>(d));
} }
if (output.data.length() < output_length * sizeof(PType)) { if (output.data.length() < output_length * sizeof(T)) {
output.data.Resize(output_length * sizeof(PType)); output.data.Resize(output_length * sizeof(T));
} }
memcpy(output.data.data(), output_tensor->template data<PType>(), memcpy(output.data.data(), output_tensor->template data<T>(),
output_length * sizeof(PType)); output_length * sizeof(T));
return true; return true;
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
PaddleMobilePredictor<Dtype, P>::~PaddleMobilePredictor() { PaddleMobilePredictor<Device, T>::~PaddleMobilePredictor() {
paddle_mobile_->Clear(); paddle_mobile_->Clear();
} }
...@@ -122,13 +122,13 @@ CreatePaddlePredictor<PaddleMobileConfig, PaddleEngineKind::kPaddleMobile>( ...@@ -122,13 +122,13 @@ CreatePaddlePredictor<PaddleMobileConfig, PaddleEngineKind::kPaddleMobile>(
std::unique_ptr<PaddlePredictor> x; std::unique_ptr<PaddlePredictor> x;
if (config.precision == PaddleMobileConfig::FP32) { if (config.precision == PaddleMobileConfig::FP32) {
if (config.device == PaddleMobileConfig::kCPU) { if (config.device == PaddleMobileConfig::kCPU) {
x.reset(new PaddleMobilePredictor<CPU, Precision::FP32>(config)); x.reset(new PaddleMobilePredictor<CPU, float>(config));
} else if (config.device == PaddleMobileConfig::kFPGA) { } else if (config.device == PaddleMobileConfig::kFPGA) {
x.reset(new PaddleMobilePredictor<FPGA, Precision::FP32>(config)); x.reset(new PaddleMobilePredictor<FPGA, float>(config));
} else if (config.device == PaddleMobileConfig::kGPU_MALI) { } else if (config.device == PaddleMobileConfig::kGPU_MALI) {
x.reset(new PaddleMobilePredictor<GPU_MALI, Precision::FP32>(config)); x.reset(new PaddleMobilePredictor<GPU_MALI, float>(config));
} else if (config.device == PaddleMobileConfig::kGPU_CL) { } else if (config.device == PaddleMobileConfig::kGPU_CL) {
x.reset(new PaddleMobilePredictor<GPU_CL, Precision::FP32>(config)); x.reset(new PaddleMobilePredictor<GPU_CL, float>(config));
} else { } else {
LOG(kLOG_ERROR) << "unsupport device type!"; LOG(kLOG_ERROR) << "unsupport device type!";
return nullptr; return nullptr;
......
...@@ -29,7 +29,7 @@ limitations under the License. */ ...@@ -29,7 +29,7 @@ limitations under the License. */
namespace paddle_mobile { namespace paddle_mobile {
template <typename Dtype = CPU, Precision P = Precision::FP32> template <typename Device = CPU, typename T = float>
class PaddleMobilePredictor : public PaddlePredictor { class PaddleMobilePredictor : public PaddlePredictor {
public: public:
PaddleMobilePredictor() = delete; PaddleMobilePredictor() = delete;
...@@ -43,7 +43,7 @@ class PaddleMobilePredictor : public PaddlePredictor { ...@@ -43,7 +43,7 @@ class PaddleMobilePredictor : public PaddlePredictor {
~PaddleMobilePredictor() override; ~PaddleMobilePredictor() override;
private: private:
std::unique_ptr<PaddleMobile<Dtype, P>> paddle_mobile_; std::unique_ptr<PaddleMobile<Device, T>> paddle_mobile_;
bool Init(const PaddleMobileConfig& config); bool Init(const PaddleMobileConfig& config);
PaddleMobileConfig config_; PaddleMobileConfig config_;
......
...@@ -59,7 +59,7 @@ ...@@ -59,7 +59,7 @@
@interface PaddleMobileCPU() @interface PaddleMobileCPU()
{ {
paddle_mobile::PaddleMobile<paddle_mobile::CPU, paddle_mobile::Precision::FP32> *pam_; paddle_mobile::PaddleMobile<paddle_mobile::CPU, float> *pam_;
BOOL loaded_; BOOL loaded_;
} }
...@@ -73,7 +73,7 @@ static std::mutex shared_mutex; ...@@ -73,7 +73,7 @@ static std::mutex shared_mutex;
- (instancetype)initWithConfig:(PaddleMobileCPUConfig *)config { - (instancetype)initWithConfig:(PaddleMobileCPUConfig *)config {
if (self = [super init]) { if (self = [super init]) {
pam_ = new paddle_mobile::PaddleMobile<paddle_mobile::CPU, paddle_mobile::Precision::FP32>(); pam_ = new paddle_mobile::PaddleMobile<paddle_mobile::CPU, float>();
_config = config; _config = config;
} }
return self; return self;
...@@ -82,6 +82,7 @@ static std::mutex shared_mutex; ...@@ -82,6 +82,7 @@ static std::mutex shared_mutex;
-(instancetype)init { -(instancetype)init {
if (self = [super init]) { if (self = [super init]) {
_config = [[PaddleMobileCPUConfig alloc] init]; _config = [[PaddleMobileCPUConfig alloc] init];
pam_ = new paddle_mobile::PaddleMobile<paddle_mobile::CPU, float>();
} }
return self; return self;
} }
...@@ -246,7 +247,8 @@ static std::mutex shared_mutex; ...@@ -246,7 +247,8 @@ static std::mutex shared_mutex;
memcpy(input_ptr, input, memcpy(input_ptr, input,
numel * sizeof(float)); numel * sizeof(float));
std::shared_ptr<paddle_mobile::framework::Tensor> output = pam_->Predict(input_tensor); pam_->Predict(input_tensor);
std::shared_ptr<paddle_mobile::framework::Tensor> output = pam_->Fetch();
float *output_pointer = new float[output->numel()]; float *output_pointer = new float[output->numel()];
......
...@@ -16,21 +16,23 @@ limitations under the License. */ ...@@ -16,21 +16,23 @@ limitations under the License. */
#include "paddle_mobile_jni.h" #include "paddle_mobile_jni.h"
#include <cmath> #include <cmath>
#include <string>
#include <vector>
#include "common/log.h" #include "common/log.h"
#include "framework/tensor.h" #include "framework/tensor.h"
#include "io/paddle_mobile.h" #include "io/paddle_mobile.h"
#ifdef ENABLE_EXCEPTION #ifdef ENABLE_EXCEPTION
#include "common/enforce.h" #include "common/enforce.h"
#endif #endif
#ifdef __cplusplus #ifdef __cplusplus
extern "C" { extern "C" {
#endif #endif
namespace paddle_mobile { namespace paddle_mobile {
namespace jni { namespace jni {
using framework::DDim; using framework::DDim;
using framework::Program; using framework::Program;
using framework::Tensor; using framework::Tensor;
...@@ -200,7 +202,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictImage( ...@@ -200,7 +202,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictImage(
for (int i = 0; i < length; i++) { for (int i = 0; i < length; i++) {
input_ptr[i] = dataPointer[i]; input_ptr[i] = dataPointer[i];
} }
auto output = getPaddleMobileInstance()->Predict(input); getPaddleMobileInstance()->Predict(input);
auto output = getPaddleMobileInstance()->Fetch();
count = output->numel(); count = output->numel();
result = env->NewFloatArray(count); result = env->NewFloatArray(count);
env->SetFloatArrayRegion(result, 0, count, output->data<float>()); env->SetFloatArrayRegion(result, 0, count, output->data<float>());
...@@ -233,7 +236,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictImage( ...@@ -233,7 +236,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictImage(
for (int i = 0; i < length; i++) { for (int i = 0; i < length; i++) {
input_ptr[i] = dataPointer[i]; input_ptr[i] = dataPointer[i];
} }
auto output = getPaddleMobileInstance()->Predict(input); getPaddleMobileInstance()->Predict(input);
auto output = getPaddleMobileInstance()->Fetch();
count = output->numel(); count = output->numel();
result = env->NewFloatArray(count); result = env->NewFloatArray(count);
env->SetFloatArrayRegion(result, 0, count, output->data<float>()); env->SetFloatArrayRegion(result, 0, count, output->data<float>());
...@@ -328,7 +332,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictYuv( ...@@ -328,7 +332,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictYuv(
for (int i = 0; i < length; i++) { for (int i = 0; i < length; i++) {
input_ptr[i] = matrix[i]; input_ptr[i] = matrix[i];
} }
auto output = getPaddleMobileInstance()->Predict(input); getPaddleMobileInstance()->Predict(input);
auto output = getPaddleMobileInstance()->Fetch();
count = output->numel(); count = output->numel();
result = env->NewFloatArray(count); result = env->NewFloatArray(count);
env->SetFloatArrayRegion(result, 0, count, output->data<float>()); env->SetFloatArrayRegion(result, 0, count, output->data<float>());
...@@ -363,7 +368,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictYuv( ...@@ -363,7 +368,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictYuv(
for (int i = 0; i < length; i++) { for (int i = 0; i < length; i++) {
input_ptr[i] = matrix[i]; input_ptr[i] = matrix[i];
} }
auto output = getPaddleMobileInstance()->Predict(input); getPaddleMobileInstance()->Predict(input);
auto output = getPaddleMobileInstance()->Fetch();
count = output->numel(); count = output->numel();
result = env->NewFloatArray(count); result = env->NewFloatArray(count);
env->SetFloatArrayRegion(result, 0, count, output->data<float>()); env->SetFloatArrayRegion(result, 0, count, output->data<float>());
...@@ -399,7 +405,8 @@ Java_com_baidu_paddle_PML_predictLod(JNIEnv *env, jclass thiz, jlongArray buf) { ...@@ -399,7 +405,8 @@ Java_com_baidu_paddle_PML_predictLod(JNIEnv *env, jclass thiz, jlongArray buf) {
auto *pdata = words.mutable_data<int64_t>(); auto *pdata = words.mutable_data<int64_t>();
size_t n = words.numel() * sizeof(int64_t); size_t n = words.numel() * sizeof(int64_t);
memcpy(pdata, ids.data(), n); memcpy(pdata, ids.data(), n);
auto vec_result = paddle_mobile.PredictLod(words); paddle_mobile.Predict(words);
auto vec_result = paddle_mobile.Fetch();
int count = vec_result->numel(); int count = vec_result->numel();
jlongArray result = NULL; jlongArray result = NULL;
ANDROIDLOGE("predict nlp size %d", count); ANDROIDLOGE("predict nlp size %d", count);
......
...@@ -13,66 +13,68 @@ See the License for the specific language governing permissions and ...@@ -13,66 +13,68 @@ See the License for the specific language governing permissions and
limitations under the License. */ limitations under the License. */
#include "io/paddle_mobile.h" #include "io/paddle_mobile.h"
#include <utility>
#include "common/common.h"
#ifdef PADDLE_MOBILE_CL #ifdef PADDLE_MOBILE_CL
#include <CL/cl.h> #include <CL/cl.h>
#include "framework/cl/cl_tensor.h" #include "framework/cl/cl_tensor.h"
#endif #endif
#include "common/common.h"
#include "operators/math/gemm.h" #include "operators/math/gemm.h"
namespace paddle_mobile { namespace paddle_mobile {
template <typename Dtype, Precision P> template <typename Device, typename T>
void PaddleMobile<Dtype, P>::SetThreadNum(int num) { void PaddleMobile<Device, T>::SetThreadNum(int num) {
#ifdef _OPENMP #ifdef _OPENMP
omp_set_num_threads(num); omp_set_num_threads(num);
#endif #endif
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
bool PaddleMobile<Dtype, P>::Load(const std::string &dirname, bool optimize, PMStatus PaddleMobile<Device, T>::Load(const std::string &dirname,
bool quantification, int batch_size, bool optimize, bool quantification,
bool loddable) { int batch_size, bool loddable) {
if (loader_.get() == nullptr) { if (loader_.get() == nullptr) {
loader_ = std::make_shared<framework::Loader<Dtype, P>>(); loader_ = std::make_shared<framework::Loader<Device, T>>();
} else { } else {
LOG(kLOG_INFO) << "loader inited"; LOG(kLOG_INFO) << "loader inited";
} }
if (executor_.get() == nullptr) { if (executor_.get() == nullptr) {
executor_ = std::make_shared<framework::Executor<Dtype, P>>( executor_ = std::make_shared<framework::Executor<Device, T>>(
loader_->Load(dirname, optimize, quantification), batch_size, optimize, loader_->Load(dirname, optimize, quantification), batch_size, optimize,
loddable); loddable);
} else { } else {
LOG(kLOG_INFO) << "executor inited"; LOG(kLOG_INFO) << "executor inited";
} }
return true; return PMSuccess;
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
bool PaddleMobile<Dtype, P>::Load(const std::string &model_path, PMStatus PaddleMobile<Device, T>::Load(const std::string &model_path,
const std::string &para_path, bool optimize, const std::string &para_path,
bool quantification, int batch_size, bool optimize, bool quantification,
bool loddable) { int batch_size, bool loddable) {
if (loader_.get() == nullptr) { if (loader_.get() == nullptr) {
loader_ = std::make_shared<framework::Loader<Dtype, P>>(); loader_ = std::make_shared<framework::Loader<Device, T>>();
} else { } else {
LOG(kLOG_INFO) << "loader inited"; LOG(kLOG_INFO) << "loader inited";
} }
if (executor_.get() == nullptr) { if (executor_.get() == nullptr) {
executor_ = std::make_shared<framework::Executor<Dtype, P>>( executor_ = std::make_shared<framework::Executor<Device, T>>(
loader_->Load(model_path, para_path, optimize, quantification), loader_->Load(model_path, para_path, optimize, quantification),
batch_size, optimize, loddable); batch_size, optimize, loddable);
} else { } else {
LOG(kLOG_INFO) << "executor inited"; LOG(kLOG_INFO) << "executor inited";
} }
return true; return PMSuccess;
} }
template <typename Dtype, Precision P> template <typename Dtype, typename T>
bool PaddleMobile<Dtype, P>::LoadCombinedMemory(size_t model_len, bool PaddleMobile<Dtype, T>::LoadCombinedMemory(size_t model_len,
const uint8_t *model_buf, const uint8_t *model_buf,
size_t combined_params_len, size_t combined_params_len,
uint8_t *combined_params_buf, uint8_t *combined_params_buf,
...@@ -80,13 +82,12 @@ bool PaddleMobile<Dtype, P>::LoadCombinedMemory(size_t model_len, ...@@ -80,13 +82,12 @@ bool PaddleMobile<Dtype, P>::LoadCombinedMemory(size_t model_len,
bool quantification, int batch_size, bool quantification, int batch_size,
bool loddable) { bool loddable) {
if (loader_.get() == nullptr) { if (loader_.get() == nullptr) {
loader_ = std::make_shared<framework::Loader<Dtype, P>>(); loader_ = std::make_shared<framework::Loader<Device, T>>();
} else { } else {
LOG(kLOG_INFO) << "loader inited"; LOG(kLOG_INFO) << "loader inited";
} }
if (executor_.get() == nullptr) { if (executor_.get() == nullptr) {
executor_ = std::make_shared<framework::Executor<Dtype, P>>( executor_ = std::make_shared<framework::Executor<Device, T>>(
loader_->LoadCombinedMemory(model_len, model_buf, combined_params_len, loader_->LoadCombinedMemory(model_len, model_buf, combined_params_len,
combined_params_buf, optimize, combined_params_buf, optimize,
quantification), quantification),
...@@ -95,38 +96,76 @@ bool PaddleMobile<Dtype, P>::LoadCombinedMemory(size_t model_len, ...@@ -95,38 +96,76 @@ bool PaddleMobile<Dtype, P>::LoadCombinedMemory(size_t model_len,
LOG(kLOG_INFO) << "executor inited"; LOG(kLOG_INFO) << "executor inited";
} }
return true; return PMSuccess;
} }
template <typename Dtype, Precision P>
std::shared_ptr<framework::Tensor> PaddleMobile<Dtype, P>::Predict( template <typename Device, typename T>
const framework::Tensor &t) { PMStatus PaddleMobile<Device, T>::Predict(const framework::Tensor &input) {
return executor_->Predict(t); std::vector<std::pair<std::string, framework::Tensor>> inputs;
inputs.push_back(std::make_pair("feed", input));
return this->Predict(inputs);
}
template <typename Device, typename T>
PMStatus PaddleMobile<Device, T>::Predict(const framework::LoDTensor &input) {
std::vector<std::pair<std::string, framework::LoDTensor>> inputs;
inputs.push_back(std::make_pair("feed", input));
return this->Predict(inputs);
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
std::shared_ptr<framework::Tensor> PaddleMobile<Dtype, P>::PredictLod( PMStatus PaddleMobile<Device, T>::Predict(
const framework::LoDTensor &t) { const std::vector<std::pair<std::string, framework::Tensor>> &inputs) {
return executor_->PredictLod(t); return executor_->Predict(inputs);
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
std::vector<typename PaddleMobile<Dtype, P>::Ptype> PMStatus PaddleMobile<Device, T>::Predict(
PaddleMobile<Dtype, P>::Predict(const std::vector<Ptype> &input, const std::vector<std::pair<std::string, framework::LoDTensor>> &inputs) {
const std::vector<int64_t> &dims) { return executor_->Predict(inputs);
}
template <typename Device, typename T>
std::vector<T> PaddleMobile<Device, T>::Predict(
const std::vector<T> &input, const std::vector<int64_t> &dims) {
return executor_->Predict(input, dims); return executor_->Predict(input, dims);
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
void PaddleMobile<Dtype, P>::Clear() { PMStatus PaddleMobile<Device, T>::Predict() {
return executor_->Predict();
}
template <typename Device, typename T>
void PaddleMobile<Device, T>::Feed(const framework::Tensor &input,
const std::string &var_name) {
executor_->SetInput(input, var_name);
}
template <typename Device, typename T>
void PaddleMobile<Device, T>::Feed(const framework::LoDTensor &input,
const std::string &var_name) {
executor_->SetInput(input, var_name);
}
typedef std::shared_ptr<framework::LoDTensor> LoDTensorPtr;
template <typename Device, typename T>
LoDTensorPtr PaddleMobile<Device, T>::Fetch(const std::string &var_name) {
return executor_->GetOutput(var_name);
}
template <typename Device, typename T>
void PaddleMobile<Device, T>::Clear() {
executor_ = nullptr; executor_ = nullptr;
loader_ = nullptr; loader_ = nullptr;
} }
template <typename Dtype, Precision P>
double PaddleMobile<Dtype, P>::GetPredictTime() {} template <typename Device, typename T>
double PaddleMobile<Device, T>::GetPredictTime() {}
#ifdef PADDLE_MOBILE_CPU #ifdef PADDLE_MOBILE_CPU
template <> template <>
double PaddleMobile<CPU, Precision::FP32>::GetPredictTime() { double PaddleMobile<CPU, float>::GetPredictTime() {
int m = 32; int m = 32;
int n = 224 * 224; int n = 224 * 224;
int k = 27; int k = 27;
...@@ -142,14 +181,13 @@ double PaddleMobile<CPU, Precision::FP32>::GetPredictTime() { ...@@ -142,14 +181,13 @@ double PaddleMobile<CPU, Precision::FP32>::GetPredictTime() {
int t1 = 1; int t1 = 1;
int t2 = 1; int t2 = 1;
for (int i = 0; i < m * k; ++i) { for (int i = 0; i < m * k; ++i) {
unsigned int seed = 100; a[i] = t1 + rand() % t2; // NOLINT
a[i] = t1 + rand_r(&seed) % t2;
} }
for (int i = 0; i < k * n; ++i) { for (int i = 0; i < k * n; ++i) {
unsigned int seed = 200; b[i] = t1 + rand() % t2; // NOLINT
b[i] = t1 + rand_r(&seed) % t2;
} }
paddle_mobile::operators::math::Gemm gemm;
operators::math::Gemm gemm;
auto time1 = paddle_mobile::time(); auto time1 = paddle_mobile::time();
gemm.Sgemm(m, n, k, static_cast<float>(1), a, lda, b, ldb, gemm.Sgemm(m, n, k, static_cast<float>(1), a, lda, b, ldb,
static_cast<float>(0), c, ldc, false, static_cast<float>(0), c, ldc, false,
...@@ -163,57 +201,51 @@ double PaddleMobile<CPU, Precision::FP32>::GetPredictTime() { ...@@ -163,57 +201,51 @@ double PaddleMobile<CPU, Precision::FP32>::GetPredictTime() {
} }
#endif #endif
template <typename Dtype, Precision P>
PaddleMobile<Dtype, P>::~PaddleMobile() {
executor_ = nullptr;
loader_ = nullptr;
}
#ifdef PADDLE_MOBILE_FPGA #ifdef PADDLE_MOBILE_FPGA
template <typename Device, typename T>
template <typename Dtype, Precision P> void PaddleMobile<Device, T>::InjectVariable(const framework::Tensor &t,
void PaddleMobile<Dtype, P>::InjectVariable(const framework::Tensor &t, std::string var_name) {
std::string var_name) {
executor_->InjectVariable(t, var_name); executor_->InjectVariable(t, var_name);
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
void PaddleMobile<Dtype, P>::FeedData(const framework::Tensor &t) { void PaddleMobile<Device, T>::FeedData(const framework::Tensor &t) {
executor_->FeedData(t); executor_->FeedData(t);
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
std::shared_ptr<framework::Tensor> PaddleMobile<Dtype, P>::FetchResult(int id) { std::shared_ptr<framework::Tensor> PaddleMobile<Device, T>::FetchResult(
int id) {
return executor_->FetchResult(id); return executor_->FetchResult(id);
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
void PaddleMobile<Dtype, P>::Predict_From_To(int start, int end) { void PaddleMobile<Device, T>::Predict_From_To(int start, int end) {
executor_->Predict_From_To(start, end); executor_->Predict_From_To(start, end);
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
void PaddleMobile<Dtype, P>::Predict_From(int start) { void PaddleMobile<Device, T>::Predict_From(int start) {
executor_->Predict_From(start); executor_->Predict_From(start);
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
void PaddleMobile<Dtype, P>::Predict_To(int end) { void PaddleMobile<Device, T>::Predict_To(int end) {
executor_->Predict_To(end); executor_->Predict_To(end);
} }
#endif #endif
#ifdef PADDLE_MOBILE_CL #ifdef PADDLE_MOBILE_CL
static std::mutex lc; static std::mutex lc;
template <typename Dtype, Precision P> template <typename Device, typename T>
void PaddleMobile<Dtype, P>::SetCLPath(std::string path) { void PaddleMobile<Device, T>::SetCLPath(std::string path) {
std::lock_guard<std::mutex> lock(lc); std::lock_guard<std::mutex> lock(lc);
if (framework::CLEngine::Instance()->GetCLPath() == "") { if (framework::CLEngine::Instance()->GetCLPath() == "") {
framework::CLEngine::Instance()->setClPath(path); framework::CLEngine::Instance()->setClPath(path);
} }
} }
template <> template <>
double PaddleMobile<GPU_CL, Precision::FP32>::GetPredictTime() { double PaddleMobile<GPU_CL, float>::GetPredictTime() {
cl_int status; cl_int status;
cl_uint nPlatform; cl_uint nPlatform;
clGetPlatformIDs(0, NULL, &nPlatform); clGetPlatformIDs(0, NULL, &nPlatform);
...@@ -411,8 +443,8 @@ double PaddleMobile<GPU_CL, Precision::FP32>::GetPredictTime() { ...@@ -411,8 +443,8 @@ double PaddleMobile<GPU_CL, Precision::FP32>::GetPredictTime() {
return -1; return -1;
} }
} }
template <typename Dtype, Precision P> template <typename Device, typename T>
int PaddleMobile<Dtype, P>::readText( int PaddleMobile<Device, T>::readText(
const char *kernelPath, const char *kernelPath,
char **pcode) { // 读取文本文件放入 pcode,返回字符串长度 char **pcode) { // 读取文本文件放入 pcode,返回字符串长度
FILE *fp; FILE *fp;
...@@ -441,13 +473,11 @@ int PaddleMobile<Dtype, P>::readText( ...@@ -441,13 +473,11 @@ int PaddleMobile<Dtype, P>::readText(
fclose(fp); fclose(fp);
return size + 1; return size + 1;
} }
#endif #endif
template class PaddleMobile<CPU, Precision::FP32>; template class PaddleMobile<CPU, float>;
template class PaddleMobile<FPGA, Precision::FP32>; template class PaddleMobile<FPGA, float>;
template class PaddleMobile<GPU_MALI, Precision::FP32>; template class PaddleMobile<GPU_MALI, float>;
template class PaddleMobile<GPU_CL, float>;
template class PaddleMobile<GPU_CL, Precision::FP32>;
} // namespace paddle_mobile } // namespace paddle_mobile
...@@ -16,6 +16,7 @@ limitations under the License. */ ...@@ -16,6 +16,7 @@ limitations under the License. */
#include <memory> #include <memory>
#include <string> #include <string>
#include <utility>
#include <vector> #include <vector>
#ifdef _OPENMP #ifdef _OPENMP
#include <omp.h> #include <omp.h>
...@@ -32,44 +33,53 @@ limitations under the License. */ ...@@ -32,44 +33,53 @@ limitations under the License. */
namespace paddle_mobile { namespace paddle_mobile {
template <typename Dtype = CPU, Precision P = Precision::FP32> template <typename Device, typename T = float>
class PaddleMobile { class PaddleMobile {
typedef typename PrecisionTrait<P>::ptype Ptype;
public: public:
PaddleMobile() { PaddleMobile() {
#ifndef PADDLE_MOBILE_CL #ifndef PADDLE_MOBILE_CL
bool is_gpu = std::is_same<DeviceType<kGPU_CL>, Dtype>::value; bool is_gpu = std::is_same<DeviceType<kGPU_CL>, Device>::value;
PADDLE_MOBILE_ENFORCE(!is_gpu, PADDLE_MOBILE_ENFORCE(!is_gpu, "Please recompile with GPU_CL is on");
"Not Enable GPU in CmakeList but run gpu codes ");
#endif #endif
} }
bool Load(const std::string &dirname, bool optimize = false, ~PaddleMobile() {}
bool quantification = false, int batch_size = 1,
bool loddable = false); PMStatus Load(const std::string &dirname, const bool optimize = false,
const bool quantification = false, const int batch_size = 1,
const bool lod = false);
PMStatus Load(const std::string &model_path, const std::string &para_path,
const bool optimize = false, const bool quantification = false,
const int batch_size = 1, const bool lod = false);
PMStatus Predict(const framework::Tensor &input);
PMStatus Predict(const framework::LoDTensor &input);
bool Load(const std::string &model_path, const std::string &para_path, PMStatus Predict(
bool optimize = false, bool quantification = false, const std::vector<std::pair<std::string, framework::Tensor>> &inputs);
int batch_size = 1, bool loddable = false); PMStatus Predict(
const std::vector<std::pair<std::string, framework::LoDTensor>> &inputs);
std::shared_ptr<framework::Tensor> Predict(const framework::Tensor &t); std::vector<T> Predict(const std::vector<T> &input,
const std::vector<int64_t> &dims);
PMStatus Predict();
std::shared_ptr<framework::Tensor> PredictLod(const framework::LoDTensor &t); void Feed(const framework::LoDTensor &input, const std::string &var_name);
void Feed(const framework::Tensor &input, const std::string &var_name);
std::vector<Ptype> Predict(const std::vector<Ptype> &input, typedef std::shared_ptr<framework::LoDTensor> LoDTensorPtr;
const std::vector<int64_t> &dims); LoDTensorPtr Fetch(const std::string &var_name);
LoDTensorPtr Fetch() { return Fetch("fetch"); }
bool LoadCombinedMemory(size_t model_len, const uint8_t *model_buf, bool LoadCombinedMemory(size_t model_len, const uint8_t *model_buf,
size_t combined_params_len, size_t combined_params_len,
uint8_t *combined_params_buf, bool optimize = false, bool quantification = false, uint8_t *combined_params_buf, bool optimize = false, bool quantification = false,
int batch_size = 1, bool loddable = false); int batch_size = 1, bool loddable = false);
void SetThreadNum(int num); void SetThreadNum(int count);
void Clear(); void Clear();
double GetPredictTime(); double GetPredictTime();
~PaddleMobile();
#ifdef PADDLE_MOBILE_FPGA #ifdef PADDLE_MOBILE_FPGA
void InjectVariable(const framework::Tensor &t, std::string var_name); void InjectVariable(const framework::Tensor &t, std::string var_name);
void FeedData(const framework::Tensor &t); void FeedData(const framework::Tensor &t);
...@@ -80,15 +90,15 @@ class PaddleMobile { ...@@ -80,15 +90,15 @@ class PaddleMobile {
#endif #endif
#ifdef PADDLE_MOBILE_CL #ifdef PADDLE_MOBILE_CL
public: public: // NOLINT
void SetCLPath(std::string cl_path); void SetCLPath(std::string cl_path);
int readText(const char *kernelPath, int readText(const char *kernelPath,
char **pcode); // 读取文本文件放入 pcode,返回字符串长度 char **pcode); // 读取文本文件放入 pcode,返回字符串长度
#endif #endif
private: private:
std::shared_ptr<framework::Loader<Dtype, P>> loader_; std::shared_ptr<framework::Loader<Device, T>> loader_;
std::shared_ptr<framework::Executor<Dtype, P>> executor_; std::shared_ptr<framework::Executor<Device, T>> executor_;
}; };
} // namespace paddle_mobile } // namespace paddle_mobile
...@@ -14,10 +14,12 @@ limitations under the License. */ ...@@ -14,10 +14,12 @@ limitations under the License. */
#include "io/paddle_test_inference_api.h" #include "io/paddle_test_inference_api.h"
#include "io/paddle_mobile.h" #include "io/paddle_mobile.h"
namespace paddle_mobile { namespace paddle_mobile {
template <typename Dtype, Precision P>
double PaddleTester<Dtype, P>::CaculatePredictTime(std::string *cl_path) { template <typename Device, typename T>
PaddleMobile<Dtype, P> paddle_mobile; double PaddleTester<Device, T>::CaculatePredictTime(std::string *cl_path) {
PaddleMobile<Device, T> paddle_mobile;
#ifdef PADDLE_MOBILE_CL #ifdef PADDLE_MOBILE_CL
if (cl_path) { if (cl_path) {
paddle_mobile.SetCLPath(*cl_path); paddle_mobile.SetCLPath(*cl_path);
...@@ -26,10 +28,10 @@ double PaddleTester<Dtype, P>::CaculatePredictTime(std::string *cl_path) { ...@@ -26,10 +28,10 @@ double PaddleTester<Dtype, P>::CaculatePredictTime(std::string *cl_path) {
#endif #endif
return paddle_mobile.GetPredictTime(); return paddle_mobile.GetPredictTime();
} }
template class PaddleTester<CPU, Precision::FP32>; template class PaddleTester<CPU, float>;
template class PaddleTester<FPGA, Precision::FP32>; template class PaddleTester<FPGA, float>;
template class PaddleTester<GPU_MALI, Precision::FP32>; template class PaddleTester<GPU_MALI, float>;
template class PaddleTester<GPU_CL, Precision::FP32>; template class PaddleTester<GPU_CL, float>;
} // namespace paddle_mobile } // namespace paddle_mobile
...@@ -20,10 +20,13 @@ limitations under the License. */ ...@@ -20,10 +20,13 @@ limitations under the License. */
*/ */
#pragma once #pragma once
#include "common/types.h" #include "common/types.h"
#include "string" #include "string"
namespace paddle_mobile { namespace paddle_mobile {
template <typename Dtype, Precision P = Precision::FP32>
template <typename Device, typename T = float>
class PaddleTester { class PaddleTester {
public: public:
double CaculatePredictTime(std::string *cl_path = nullptr); double CaculatePredictTime(std::string *cl_path = nullptr);
......
...@@ -12,26 +12,25 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ...@@ -12,26 +12,25 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and See the License for the specific language governing permissions and
limitations under the License. */ limitations under the License. */
#pragma once #ifdef CAST_OP
#ifdef FUSION_DEQUANT_ADD_BN_OP #include "operators/cast_op.h"
#include "framework/operator.h"
#include "operators/op_param.h"
namespace paddle_mobile { namespace paddle_mobile {
namespace operators { namespace operators {
template <typename DeviceType, typename T> template <typename DeviceType, typename T>
class FusionDequantAddBNKernel void CastOp<DeviceType, T>::InferShape() const {
: public framework::OpKernelBase<DeviceType, const auto &dims = this->param_.input_->dims();
FusionDequantAddBNParam<DeviceType>> { this->param_.output_->Resize(dims);
public: }
void Compute(const FusionDequantAddBNParam<DeviceType> &param);
bool Init(FusionDequantAddBNParam<DeviceType> *param);
};
} // namespace operators } // namespace operators
} // namespace paddle_mobile } // namespace paddle_mobile
namespace ops = paddle_mobile::operators;
#ifdef PADDLE_MOBILE_CPU
REGISTER_OPERATOR_CPU(cast, ops::CastOp);
#endif #endif
#endif // CAST_OP
...@@ -12,39 +12,34 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ...@@ -12,39 +12,34 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and See the License for the specific language governing permissions and
limitations under the License. */ limitations under the License. */
#ifdef FUSION_FC_INT8_OP #ifdef CAST_OP
#pragma once #pragma once
#include <string> #include <string>
#include <vector>
#include "framework/operator.h" #include "framework/operator.h"
#include "framework/program/program-optimize/fusion_op_register.h" #include "operators/kernel/kernels.h"
#include "operators/kernel/fusion_fc_kernel.h"
#include "operators/op_param.h" #include "operators/op_param.h"
namespace paddle_mobile { namespace paddle_mobile {
namespace operators { namespace operators {
template <typename DeviceType, typename T> template <typename DeviceType, typename T>
class FusionFcInt8Op class CastOp : public framework::OperatorWithKernel<
: public framework::OperatorWithKernel<DeviceType, DeviceType, CastParam<DeviceType>,
FusionFcParam<DeviceType>, operators::CastKernel<DeviceType, T>> {
FusionFcKernel<DeviceType, T>> {
public: public:
FusionFcInt8Op(const std::string &type, const VariableNameMap &inputs, CastOp(const std::string &type, const VariableNameMap &inputs,
const VariableNameMap &outputs, const VariableNameMap &outputs, const framework::AttributeMap &attrs,
const framework::AttributeMap &attrs, std::shared_ptr<framework::Scope> scope)
std::shared_ptr<framework::Scope> scope) : framework::OperatorWithKernel<DeviceType, CastParam<DeviceType>,
: framework::OperatorWithKernel<DeviceType, FusionFcParam<DeviceType>, operators::CastKernel<DeviceType, T>>(
FusionFcKernel<DeviceType, T>>(
type, inputs, outputs, attrs, scope) {} type, inputs, outputs, attrs, scope) {}
// inference output shape
void InferShape() const override; void InferShape() const override;
}; };
} // namespace operators } // namespace operators
} // namespace paddle_mobile } // namespace paddle_mobile
#endif // FUSION_FC_INT8_OP #endif // CAST_OP
...@@ -33,4 +33,4 @@ namespace ops = paddle_mobile::operators; ...@@ -33,4 +33,4 @@ namespace ops = paddle_mobile::operators;
REGISTER_OPERATOR_CPU(dequantize, ops::DequantizeOp); REGISTER_OPERATOR_CPU(dequantize, ops::DequantizeOp);
#endif #endif
#endif #endif // DEQUANT_OP
...@@ -44,4 +44,4 @@ class DequantizeOp ...@@ -44,4 +44,4 @@ class DequantizeOp
} // namespace operators } // namespace operators
} // namespace paddle_mobile } // namespace paddle_mobile
#endif #endif // DEQUANT_OP
...@@ -25,12 +25,11 @@ limitations under the License. */ ...@@ -25,12 +25,11 @@ limitations under the License. */
namespace paddle_mobile { namespace paddle_mobile {
namespace operators { namespace operators {
using std::string;
template <typename DeviceType, typename T> template <typename DeviceType, typename T>
class FillConstantOp : public framework::OperatorBase<DeviceType> { class FillConstantOp : public framework::OperatorBase<DeviceType> {
public: public:
FillConstantOp(const string &type, const VariableNameMap &inputs, FillConstantOp(const std::string &type, const VariableNameMap &inputs,
const VariableNameMap &outputs, const VariableNameMap &outputs,
const framework::AttributeMap attrs, const framework::AttributeMap attrs,
std::shared_ptr<framework::Scope> scope) std::shared_ptr<framework::Scope> scope)
...@@ -58,7 +57,7 @@ class FillConstantOp : public framework::OperatorBase<DeviceType> { ...@@ -58,7 +57,7 @@ class FillConstantOp : public framework::OperatorBase<DeviceType> {
tensor->Resize(framework::make_ddim(param_.Shape())); tensor->Resize(framework::make_ddim(param_.Shape()));
tensor->mutable_data(framework::ToTypeIndex(data_type)); tensor->mutable_data(framework::ToTypeIndex(data_type));
math::set_constant(tensor, value); math::SetConstant(tensor, value);
} }
void Init() {} void Init() {}
......
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef FUSION_CONVADDRELU_INT8_OP
#include "operators/fusion_conv_add_relu_int8_op.h"
#include <vector>
#include "operators/math/conv_func.h"
namespace paddle_mobile {
namespace operators {
template <typename Dtype, typename T>
void FusionConvAddReluInt8Op<Dtype, T>::InferShape() const {
auto in_dims = this->param_.Input()->dims();
auto filter_dims = this->param_.Filter()->dims();
const std::vector<int> &strides = this->param_.Strides();
std::vector<int> paddings = this->param_.Paddings();
int groups = this->param_.Groups();
std::vector<int> dilations = this->param_.Dilations();
PADDLE_MOBILE_ENFORCE((in_dims.size() == filter_dims.size() &&
dilations.size() == paddings.size() &&
paddings.size() == strides.size()),
"ConvParam is not suitable");
std::vector<int64_t> output_shape({in_dims[0], filter_dims[0]});
for (size_t i = 0; i < strides.size(); ++i) {
output_shape.push_back(
math::ConvOutputSize(in_dims[i + 2], filter_dims[i + 2], dilations[i],
paddings[i], strides[i]));
}
framework::DDim ddim = framework::make_ddim(output_shape);
this->param_.Output()->Resize(ddim);
}
} // namespace operators
} // namespace paddle_mobile
namespace ops = paddle_mobile::operators;
#ifdef PADDLE_MOBILE_CPU
REGISTER_OPERATOR_CPU_INT8(fusion_conv_add_relu_int8,
ops::FusionConvAddReluInt8Op);
#endif
#endif // FUSION_CONVADDRELU_INT8_OP
...@@ -20,7 +20,7 @@ limitations under the License. */ ...@@ -20,7 +20,7 @@ limitations under the License. */
#include <vector> #include <vector>
#include "framework/operator.h" #include "framework/operator.h"
#include "framework/program/program-optimize/fusion_op_register.h" #include "framework/program/program-optimize/fusion_op_register.h"
#include "operators/kernel/dequant_add_bn_kernel.h" #include "operators/kernel/dequant_bn_kernel.h"
#include "operators/op_param.h" #include "operators/op_param.h"
namespace paddle_mobile { namespace paddle_mobile {
...@@ -43,7 +43,8 @@ class FusionDequantAddBNMatcher : public framework::FusionOpMatcher { ...@@ -43,7 +43,8 @@ class FusionDequantAddBNMatcher : public framework::FusionOpMatcher {
{{"Scale", "BNScale"}, {{"Scale", "BNScale"},
{"Mean", "BNMean"}, {"Mean", "BNMean"},
{"Bias", "BNBias"}, {"Bias", "BNBias"},
{"Variance", "BNVariance"}}}}, {"Variance", "BNVariance"},
{"Y", "Out"}}}},
removed_nodes); removed_nodes);
} }
......
...@@ -20,7 +20,7 @@ limitations under the License. */ ...@@ -20,7 +20,7 @@ limitations under the License. */
#include <vector> #include <vector>
#include "framework/operator.h" #include "framework/operator.h"
#include "framework/program/program-optimize/fusion_op_register.h" #include "framework/program/program-optimize/fusion_op_register.h"
#include "operators/kernel/dequant_bn_relu_kernel.h" #include "operators/kernel/dequant_bn_kernel.h"
#include "operators/op_param.h" #include "operators/op_param.h"
namespace paddle_mobile { namespace paddle_mobile {
...@@ -44,7 +44,8 @@ class FusionDequantAddBNReluMatcher : public framework::FusionOpMatcher { ...@@ -44,7 +44,8 @@ class FusionDequantAddBNReluMatcher : public framework::FusionOpMatcher {
{{"Scale", "BNScale"}, {{"Scale", "BNScale"},
{"Mean", "BNMean"}, {"Mean", "BNMean"},
{"Bias", "BNBias"}, {"Bias", "BNBias"},
{"Variance", "BNVariance"}}}}, {"Variance", "BNVariance"},
{"Y", "Out"}}}},
removed_nodes); removed_nodes);
} }
...@@ -54,7 +55,7 @@ class FusionDequantAddBNReluMatcher : public framework::FusionOpMatcher { ...@@ -54,7 +55,7 @@ class FusionDequantAddBNReluMatcher : public framework::FusionOpMatcher {
template <typename DeviceType, typename T> template <typename DeviceType, typename T>
class FusionDequantAddBNReluOp class FusionDequantAddBNReluOp
: public framework::OperatorWithKernel< : public framework::OperatorWithKernel<
DeviceType, FusionDequantAddBNReluParam<DeviceType>, DeviceType, FusionDequantAddBNParam<DeviceType>,
operators::FusionDequantAddBNReluKernel<DeviceType, T>> { operators::FusionDequantAddBNReluKernel<DeviceType, T>> {
public: public:
FusionDequantAddBNReluOp(const std::string &type, FusionDequantAddBNReluOp(const std::string &type,
...@@ -63,7 +64,7 @@ class FusionDequantAddBNReluOp ...@@ -63,7 +64,7 @@ class FusionDequantAddBNReluOp
const framework::AttributeMap &attrs, const framework::AttributeMap &attrs,
std::shared_ptr<framework::Scope> scope) std::shared_ptr<framework::Scope> scope)
: framework::OperatorWithKernel< : framework::OperatorWithKernel<
DeviceType, FusionDequantAddBNReluParam<DeviceType>, DeviceType, FusionDequantAddBNParam<DeviceType>,
operators::FusionDequantAddBNReluKernel<DeviceType, T>>( operators::FusionDequantAddBNReluKernel<DeviceType, T>>(
type, inputs, outputs, attrs, scope) {} type, inputs, outputs, attrs, scope) {}
// inference output shape // inference output shape
......
...@@ -12,50 +12,51 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ...@@ -12,50 +12,51 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and See the License for the specific language governing permissions and
limitations under the License. */ limitations under the License. */
#ifdef FUSION_FC_INT8_OP #include "operators/fusion_dequant_add_bn_relu_quant_op.h"
#include "operators/fusion_fc_int8_op.h"
#ifdef FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
namespace paddle_mobile { namespace paddle_mobile {
namespace operators { namespace operators {
template <typename Dtype, typename T> template <typename Dtype, typename T>
void FusionFcInt8Op<Dtype, T>::InferShape() const { void FusionDequantAddBNReluQuantOp<Dtype, T>::InferShape() const {
auto x_dims = this->param_.InputX()->dims(); const auto& input_dims = this->param_.input_->dims();
auto y_dims = this->param_.InputY()->dims(); this->param_.output_->Resize(input_dims);
int x_num_col_dims = this->param_.XNumColDims(); }
int y_num_col_dims = this->param_.YNumColDims();
assert(x_dims.size() > x_num_col_dims);
assert(y_dims.size() > y_num_col_dims);
/// (1,2,3,4) , x_num_col_dims = 2 -> (2,12)
auto x_mat_dims = framework::flatten_to_2d(x_dims, x_num_col_dims);
auto y_mat_dims = framework::flatten_to_2d(y_dims, y_num_col_dims);
assert(x_mat_dims[1] == y_mat_dims[0]); } // namespace operators
} // namespace paddle_mobile
std::vector<int64_t> output_dims; namespace ops = paddle_mobile::operators;
output_dims.reserve( REGISTER_FUSION_MATCHER(fusion_dequant_add_bn_relu_quant,
static_cast<size_t>(x_num_col_dims + y_dims.size() - y_num_col_dims)); ops::FusionDequantAddBNReluQuantMatcher);
for (int i = 0; i < x_num_col_dims; ++i) { #ifdef PADDLE_MOBILE_CPU
output_dims.push_back(x_dims[i]); REGISTER_OPERATOR_CPU(fusion_dequant_add_bn_relu_quant,
} ops::FusionDequantAddBNReluQuantOp);
#endif
#endif // FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
for (int i = y_num_col_dims; i < y_dims.size(); ++i) { #ifdef FUSION_DEQUANT_ADD_BN_QUANT_OP
output_dims.push_back(y_dims[i]); namespace paddle_mobile {
} namespace operators {
framework::DDim ddim = framework::make_ddim(output_dims); template <typename Dtype, typename T>
this->param_.Out()->Resize(ddim); void FusionDequantAddBNQuantOp<Dtype, T>::InferShape() const {
const auto& input_dims = this->param_.input_->dims();
this->param_.output_->Resize(input_dims);
} }
} // namespace operators } // namespace operators
} // namespace paddle_mobile } // namespace paddle_mobile
namespace ops = paddle_mobile::operators; namespace ops = paddle_mobile::operators;
REGISTER_FUSION_MATCHER(fusion_dequant_add_bn_quant,
ops::FusionDequantAddBNQuantMatcher);
#ifdef PADDLE_MOBILE_CPU #ifdef PADDLE_MOBILE_CPU
REGISTER_OPERATOR_CPU_INT8(fusion_fc_int8, ops::FusionFcInt8Op); REGISTER_OPERATOR_CPU(fusion_dequant_add_bn_quant,
ops::FusionDequantAddBNQuantOp);
#endif #endif
#endif // FUSION_FC_INT8_OP
#endif // FUSION_DEQUANT_ADD_BN_QUANT_OP
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#pragma once
#include <string>
#include <vector>
#include "framework/operator.h"
#include "framework/program/program-optimize/fusion_op_register.h"
#include "operators/kernel/dequant_bn_kernel.h"
#include "operators/op_param.h"
namespace paddle_mobile {
namespace operators {
#ifdef FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
class FusionDequantAddBNReluQuantMatcher : public framework::FusionOpMatcher {
public:
FusionDequantAddBNReluQuantMatcher() {
node_ = framework::Node(G_OP_TYPE_DEQUANTIZE);
node_ > std::make_shared<framework::Node>(G_OP_TYPE_ELEMENTWISE_ADD) >
std::make_shared<framework::Node>(G_OP_TYPE_BATCHNORM) >
std::make_shared<framework::Node>(G_OP_TYPE_RELU) >
std::make_shared<framework::Node>(G_OP_TYPE_QUANTIZE);
}
void FolderNodes(
framework::Node *node,
std::vector<std::shared_ptr<framework::Node>> *removed_nodes) {
node->Folder(node_.Depth(), Type(),
{{G_OP_TYPE_ELEMENTWISE_ADD, {{"Y", "Y"}}},
{G_OP_TYPE_BATCHNORM,
{{"Scale", "BNScale"},
{"Mean", "BNMean"},
{"Bias", "BNBias"},
{"Variance", "BNVariance"},
{"Y", "Out"}}}},
removed_nodes);
}
std::string Type() { return G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU_QUANT; }
};
template <typename DeviceType, typename T>
class FusionDequantAddBNReluQuantOp
: public framework::OperatorWithKernel<
DeviceType, FusionDequantAddBNReluQuantParam<DeviceType>,
operators::FusionDequantAddBNReluQuantKernel<DeviceType, T>> {
public:
FusionDequantAddBNReluQuantOp(const std::string &type,
const VariableNameMap &inputs,
const VariableNameMap &outputs,
const framework::AttributeMap &attrs,
std::shared_ptr<framework::Scope> scope)
: framework::OperatorWithKernel<
DeviceType, FusionDequantAddBNReluQuantParam<DeviceType>,
operators::FusionDequantAddBNReluQuantKernel<DeviceType, T>>(
type, inputs, outputs, attrs, scope) {}
// inference output shape
void InferShape() const override;
};
#endif // FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
#ifdef FUSION_DEQUANT_ADD_BN_QUANT_OP
class FusionDequantAddBNQuantMatcher : public framework::FusionOpMatcher {
public:
FusionDequantAddBNQuantMatcher() {
node_ = framework::Node(G_OP_TYPE_DEQUANTIZE);
node_ > std::make_shared<framework::Node>(G_OP_TYPE_ELEMENTWISE_ADD) >
std::make_shared<framework::Node>(G_OP_TYPE_BATCHNORM) >
std::make_shared<framework::Node>(G_OP_TYPE_QUANTIZE);
}
void FolderNodes(
framework::Node *node,
std::vector<std::shared_ptr<framework::Node>> *removed_nodes) {
node->Folder(node_.Depth(), Type(),
{{G_OP_TYPE_ELEMENTWISE_ADD, {{"Y", "Y"}}},
{G_OP_TYPE_BATCHNORM,
{{"Scale", "BNScale"},
{"Mean", "BNMean"},
{"Bias", "BNBias"},
{"Variance", "BNVariance"},
{"Y", "Out"}}}},
removed_nodes);
}
std::string Type() { return G_OP_TYPE_FUSION_DEQUANT_ADD_BN_QUANT; }
};
template <typename DeviceType, typename T>
class FusionDequantAddBNQuantOp
: public framework::OperatorWithKernel<
DeviceType, FusionDequantAddBNQuantParam<DeviceType>,
operators::FusionDequantAddBNQuantKernel<DeviceType, T>> {
public:
FusionDequantAddBNQuantOp(const std::string &type,
const VariableNameMap &inputs,
const VariableNameMap &outputs,
const framework::AttributeMap &attrs,
std::shared_ptr<framework::Scope> scope)
: framework::OperatorWithKernel<
DeviceType, FusionDequantAddBNQuantParam<DeviceType>,
operators::FusionDequantAddBNQuantKernel<DeviceType, T>>(
type, inputs, outputs, attrs, scope) {}
// inference output shape
void InferShape() const override;
};
#endif // FUSION_DEQUANT_ADD_BN_QUANT_OP
} // namespace operators
} // namespace paddle_mobile
...@@ -12,28 +12,43 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ...@@ -12,28 +12,43 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and See the License for the specific language governing permissions and
limitations under the License. */ limitations under the License. */
#ifdef FUSION_DEQUANT_BN_RELU_OP #include "operators/fusion_dequant_bn_op.h"
#include "operators/fusion_dequant_bn_relu_op.h"
namespace paddle_mobile { namespace paddle_mobile {
namespace operators { namespace operators {
#ifdef FUSION_DEQUANT_BN_OP
template <typename Dtype, typename T>
void FusionDequantBNOp<Dtype, T>::InferShape() const {
const auto& input_dims = this->param_.input_->dims();
this->param_.output_->Resize(input_dims);
}
#endif // FUSION_DEQUANT_BN_OP
#ifdef FUSION_DEQUANT_BN_RELU_OP
template <typename Dtype, typename T> template <typename Dtype, typename T>
void FusionDequantBNReluOp<Dtype, T>::InferShape() const { void FusionDequantBNReluOp<Dtype, T>::InferShape() const {
const auto& input_dims = this->param_.input_->dims(); const auto& input_dims = this->param_.input_->dims();
this->param_.output_->Resize(input_dims); this->param_.output_->Resize(input_dims);
} }
#endif // FUSION_DEQUANT_BN_RELU_OP
} // namespace operators } // namespace operators
} // namespace paddle_mobile } // namespace paddle_mobile
namespace ops = paddle_mobile::operators; namespace ops = paddle_mobile::operators;
#ifdef FUSION_DEQUANT_BN_OP
REGISTER_FUSION_MATCHER(fusion_dequant_bn, ops::FusionDequantBNMatcher);
#ifdef PADDLE_MOBILE_CPU
REGISTER_OPERATOR_CPU(fusion_dequant_bn, ops::FusionDequantBNOp);
#endif // PADDLE_MOBILE_CPU
#endif // FUSION_DEQUANT_BN_OP
#ifdef FUSION_DEQUANT_BN_RELU_OP
REGISTER_FUSION_MATCHER(fusion_dequant_bn_relu, REGISTER_FUSION_MATCHER(fusion_dequant_bn_relu,
ops::FusionDequantBNReluMatcher); ops::FusionDequantBNReluMatcher);
#ifdef PADDLE_MOBILE_CPU #ifdef PADDLE_MOBILE_CPU
REGISTER_OPERATOR_CPU(fusion_dequant_bn_relu, ops::FusionDequantBNReluOp); REGISTER_OPERATOR_CPU(fusion_dequant_bn_relu, ops::FusionDequantBNReluOp);
#endif #endif // PADDLE_MOBILE_CPU
#endif // FUSION_DEQUANT_BN_RELU_OP
#endif
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#pragma once
#include <string>
#include <vector>
#include "framework/operator.h"
#include "framework/program/program-optimize/fusion_op_register.h"
#include "operators/kernel/dequant_bn_kernel.h"
#include "operators/op_param.h"
namespace paddle_mobile {
namespace operators {
#if defined(FUSION_DEQUANT_BN_OP) || defined(FUSION_DEQUANT_BN_RELU_OP)
class FusionDequantBNMatcher : public framework::FusionOpMatcher {
public:
FusionDequantBNMatcher() {
node_ = framework::Node(G_OP_TYPE_DEQUANTIZE);
node_ > std::make_shared<framework::Node>(G_OP_TYPE_BATCHNORM);
}
virtual void FolderNodes(
framework::Node *node,
std::vector<std::shared_ptr<framework::Node>> *removed_nodes) {
node->Folder(node_.Depth(), Type(),
{{G_OP_TYPE_BATCHNORM,
{{"Scale", "BNScale"},
{"Mean", "BNMean"},
{"Bias", "BNBias"},
{"Variance", "BNVariance"},
{"Y", "Out"}}}},
removed_nodes);
}
std::string Type() override { return G_OP_TYPE_FUSION_DEQUANT_BN; }
};
#endif // FUSION_DEQUANT_BN_OP || FUSION_DEQUANT_BN_RELU_OP
#ifdef FUSION_DEQUANT_BN_OP
template <typename DeviceType, typename T>
class FusionDequantBNOp : public framework::OperatorWithKernel<
DeviceType, FusionDequantBNParam<DeviceType>,
operators::FusionDequantBNKernel<DeviceType, T>> {
public:
FusionDequantBNOp(const std::string &type, const VariableNameMap &inputs,
const VariableNameMap &outputs,
const framework::AttributeMap &attrs,
std::shared_ptr<framework::Scope> scope)
: framework::OperatorWithKernel<
DeviceType, FusionDequantBNParam<DeviceType>,
operators::FusionDequantBNKernel<DeviceType, T>>(
type, inputs, outputs, attrs, scope) {}
// inference output shape
void InferShape() const override;
};
#endif // FUSION_DEQUANT_BN_OP
#ifdef FUSION_DEQUANT_BN_RELU_OP
class FusionDequantBNReluMatcher : public FusionDequantBNMatcher {
public:
FusionDequantBNReluMatcher() : FusionDequantBNMatcher() {
node_ > std::make_shared<framework::Node>(G_OP_TYPE_RELU);
}
virtual std::string Type() { return G_OP_TYPE_FUSION_DEQUANT_BN_RELU; }
};
template <typename DeviceType, typename T>
class FusionDequantBNReluOp
: public framework::OperatorWithKernel<
DeviceType, FusionDequantBNParam<DeviceType>,
operators::FusionDequantBNReluKernel<DeviceType, T>> {
public:
FusionDequantBNReluOp(const std::string &type, const VariableNameMap &inputs,
const VariableNameMap &outputs,
const framework::AttributeMap &attrs,
std::shared_ptr<framework::Scope> scope)
: framework::OperatorWithKernel<
DeviceType, FusionDequantBNParam<DeviceType>,
operators::FusionDequantBNReluKernel<DeviceType, T>>(
type, inputs, outputs, attrs, scope) {}
void InferShape() const override;
};
#endif // FUSION_DEQUANT_BN_RELU_OP
} // namespace operators
} // namespace paddle_mobile
...@@ -42,7 +42,8 @@ class FusionDequantBNReluMatcher : public framework::FusionOpMatcher { ...@@ -42,7 +42,8 @@ class FusionDequantBNReluMatcher : public framework::FusionOpMatcher {
{{"Scale", "BNScale"}, {{"Scale", "BNScale"},
{"Mean", "BNMean"}, {"Mean", "BNMean"},
{"Bias", "BNBias"}, {"Bias", "BNBias"},
{"Variance", "BNVariance"}}}}, {"Variance", "BNVariance"},
{"Y", "Out"}}}},
removed_nodes); removed_nodes);
} }
......
...@@ -14,19 +14,15 @@ limitations under the License. */ ...@@ -14,19 +14,15 @@ limitations under the License. */
#ifdef GRU_OP #ifdef GRU_OP
#include "operators/gru_op.h"
#include <vector> #include <vector>
#include "common/enforce.h" #include "common/enforce.h"
#include "operators/gru_op.h"
namespace paddle_mobile { namespace paddle_mobile {
namespace operators { namespace operators {
template <typename Dtype, typename T> template <typename Dtype, typename T>
void GruOp<Dtype, T>::InferShape() const { void GruOp<Dtype, T>::InferShape() const {
auto lod_size = this->param_.InputInput()->lod().size();
PADDLE_MOBILE_ENFORCE((lod_size == 1),
"Current LoD only supports one dimension.");
auto input_dims = this->param_.InputInput()->dims(); auto input_dims = this->param_.InputInput()->dims();
auto weight_dims = this->param_.InputWeight()->dims(); auto weight_dims = this->param_.InputWeight()->dims();
int input_size = input_dims[1]; int input_size = input_dims[1];
......
...@@ -15,6 +15,7 @@ limitations under the License. */ ...@@ -15,6 +15,7 @@ limitations under the License. */
#ifdef IM2SEQUENCE_OP #ifdef IM2SEQUENCE_OP
#include "operators/im2sequence_op.h" #include "operators/im2sequence_op.h"
#include <vector>
namespace paddle_mobile { namespace paddle_mobile {
namespace operators { namespace operators {
...@@ -29,20 +30,16 @@ int Im2SequenceOutputSize(int input_size, int kernel, int padding_1, ...@@ -29,20 +30,16 @@ int Im2SequenceOutputSize(int input_size, int kernel, int padding_1,
template <typename Dtype, typename T> template <typename Dtype, typename T>
void Im2SequenceOp<Dtype, T>::InferShape() const { void Im2SequenceOp<Dtype, T>::InferShape() const {
auto in_x_dims = this->param_.Input()->dims(); auto in_x_dims = this->param_.Input()->dims();
const std::vector<int> &kernels = this->param_.Kernels(); const std::vector<int> &kernels = this->param_.Kernels();
const std::vector<int> &strides = this->param_.Strides(); const std::vector<int> &strides = this->param_.Strides();
std::vector<int> paddings = this->param_.Paddings(); std::vector<int> paddings = this->param_.Paddings();
std::vector<int64_t> output_shape({in_x_dims[0], in_x_dims[1]}); std::vector<int64_t> output_shape({in_x_dims[0], in_x_dims[1]});
for (size_t i = 0; i < strides.size(); ++i) { for (size_t i = 0; i < strides.size(); ++i) {
output_shape.push_back(Im2SequenceOutputSize(in_x_dims[i + 2], kernels[i], output_shape.push_back(Im2SequenceOutputSize(in_x_dims[i + 2], kernels[i],
paddings[i], paddings[i + 2], paddings[i], paddings[i + 2],
strides[i])); strides[i]));
} }
framework::DDim ddim = framework::make_ddim(output_shape); framework::DDim ddim = framework::make_ddim(output_shape);
this->param_.Output()->Resize(ddim); this->param_.Output()->Resize(ddim);
} }
...@@ -54,9 +51,5 @@ namespace ops = paddle_mobile::operators; ...@@ -54,9 +51,5 @@ namespace ops = paddle_mobile::operators;
#ifdef PADDLE_MOBILE_CPU #ifdef PADDLE_MOBILE_CPU
REGISTER_OPERATOR_CPU(im2sequence, ops::Im2SequenceOp); REGISTER_OPERATOR_CPU(im2sequence, ops::Im2SequenceOp);
#endif #endif
#ifdef PADDLE_MOBILE_MALI_GPU
#endif
#ifdef PADDLE_MOBILE_FPGA
#endif
#endif #endif // IM2SEQUENCE_OP
...@@ -12,39 +12,63 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ...@@ -12,39 +12,63 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and See the License for the specific language governing permissions and
limitations under the License. */ limitations under the License. */
#ifdef POOL_OP #ifdef CAST_OP
#pragma once
#ifdef _OPENMP
#include <omp.h>
#endif
#include <algorithm> #include <algorithm>
#include <vector> #include <vector>
#include "framework/tensor.h" #include "framework/data_type.h"
#if __ARM_NEON #include "operators/kernel/kernels.h"
#include <arm_neon.h>
#endif // __ARM_NEON
namespace paddle_mobile { namespace paddle_mobile {
namespace operators { namespace operators {
namespace math {
void Pool3x3Avgs1p1(const framework::Tensor *input, framework::Tensor *output); template <typename InT>
void Pool3x3Maxs1p1(const framework::Tensor *input, framework::Tensor *output); struct CastOutOpFunctor {
void Pool3x3Max(std::vector<int> strides, std::vector<int> paddings, const framework::Tensor* in_;
const framework::Tensor *input, framework::Tensor *output); framework::Tensor* out_;
CastOutOpFunctor(const framework::Tensor* in, framework::Tensor* out)
void Pool3x3Avg(std::vector<int> strides, std::vector<int> paddings, : in_(in), out_(out) {}
const framework::Tensor *in_x, framework::Tensor *out);
template <typename OutT>
void Pool3x3Maxs1_int8(const framework::Tensor *input, void apply() const {
framework::Tensor *output, int32_t pad_h, int32_t pad_w); const InT* input = in_->data<InT>();
void Pool3x3Maxs2_int8(const framework::Tensor *input, OutT* output = out_->mutable_data<OutT>();
framework::Tensor *output, int32_t pad_h, int32_t pad_w); size_t numel = in_->numel();
void Pool3x3Max_int8(const std::vector<int> &strides, for (int i = 0; i < numel; ++i) {
const std::vector<int> &paddings, output[i] = static_cast<OutT>(input[i]);
const framework::Tensor *input, framework::Tensor *output); }
} // namespace math }
};
struct CastOpFunctor {
const framework::Tensor* in_;
framework::Tensor* out_;
int output_type_;
CastOpFunctor(const framework::Tensor* in, framework::Tensor* out,
const int output_type)
: in_(in), out_(out), output_type_(output_type) {}
template <typename InT>
void apply() const {
framework::VisitDataType(framework::ToDataType(output_type_),
CastOutOpFunctor<InT>(in_, out_));
}
};
template <>
bool CastKernel<CPU, float>::Init(CastParam<CPU>* param) {
return true;
}
template <>
void CastKernel<CPU, float>::Compute(const CastParam<CPU>& param) {
const Tensor* input = param.input_;
Tensor* output = param.output_;
framework::VisitDataType(framework::ToDataType(param.input_type_),
CastOpFunctor(input, output, param.output_type_));
}
} // namespace operators } // namespace operators
} // namespace paddle_mobile } // namespace paddle_mobile
#endif #endif // CAST_OP
...@@ -15,6 +15,7 @@ limitations under the License. */ ...@@ -15,6 +15,7 @@ limitations under the License. */
#ifdef FUSION_CONVADDBNRELU_OP #ifdef FUSION_CONVADDBNRELU_OP
#include "operators/kernel/conv_add_bn_relu_kernel.h" #include "operators/kernel/conv_add_bn_relu_kernel.h"
#include <math.h>
#include "operators/kernel/central-arm-func/conv_add_bn_relu_arm_func.h" #include "operators/kernel/central-arm-func/conv_add_bn_relu_arm_func.h"
namespace paddle_mobile { namespace paddle_mobile {
......
...@@ -32,20 +32,6 @@ void ConvAddReluKernel<CPU, float>::Compute( ...@@ -32,20 +32,6 @@ void ConvAddReluKernel<CPU, float>::Compute(
} }
template class ConvAddReluKernel<CPU, float>; template class ConvAddReluKernel<CPU, float>;
#ifdef FUSION_CONVADDRELU_INT8_OP
template <>
bool ConvAddReluKernel<CPU, int8_t>::Init(FusionConvAddReluParam<CPU> *param) {
return true;
}
template <>
void ConvAddReluKernel<CPU, int8_t>::Compute(
const FusionConvAddReluParam<CPU> &param) {
ConvAddReluCompute<int8_t, int32_t>(param);
}
template class ConvAddReluKernel<CPU, int8_t>;
#endif
} // namespace operators } // namespace operators
} // namespace paddle_mobile } // namespace paddle_mobile
......
...@@ -55,10 +55,9 @@ bool ConvKernel<CPU, float>::Init(ConvParam<CPU> *param) { ...@@ -55,10 +55,9 @@ bool ConvKernel<CPU, float>::Init(ConvParam<CPU> *param) {
param->Input()->dims()[2] <= 140 /* refered from ncnn */) { param->Input()->dims()[2] <= 140 /* refered from ncnn */) {
param->ExecMode() = ConvParam<CPU>::EXEC_WINOGRAD3X3_FLOAT; param->ExecMode() = ConvParam<CPU>::EXEC_WINOGRAD3X3_FLOAT;
// transform weight // transform weight
framework::Tensor transformed_weight; param->transformed_filter_ = new framework::Tensor;
operators::math::winograd_transform_weight<8, 3>(*param->Filter(), operators::math::winograd_transform_weight<8, 3>(
&transformed_weight); *param->Filter(), param->transformed_filter_);
framework::TensorCopy(transformed_weight, param->Filter());
#endif #endif
} else { } else {
param->ExecMode() = ConvParam<CPU>::EXEC_GEMM_FLOAT; param->ExecMode() = ConvParam<CPU>::EXEC_GEMM_FLOAT;
......
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef FUSION_DEQUANT_ADD_BN_OP
#include "operators/kernel/dequant_add_bn_kernel.h"
#include <cmath>
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#include <arm_neon.h>
#endif
namespace paddle_mobile {
namespace operators {
template <>
bool FusionDequantAddBNKernel<CPU, float>::Init(
FusionDequantAddBNParam<CPU> *param) {
// elementwise add params
const Tensor *bias = param->bias_;
// batch norm params
const Tensor *bn_mean = param->bn_mean_;
const Tensor *bn_variance = param->bn_variance_;
Tensor *bn_scale = param->bn_scale_;
Tensor *bn_bias = param->bn_bias_;
const float epsilon = param->epsilon_;
const float *bias_ptr = bias->data<float>();
const float *mean_ptr = bn_mean->data<float>();
const float *var_ptr = bn_variance->data<float>();
float *bn_scale_ptr = bn_scale->mutable_data<float>();
float *bn_bias_ptr = bn_bias->mutable_data<float>();
for (int c = 0; c < bn_scale->numel(); ++c) {
float inv_scale = bn_scale_ptr[c] / (std::sqrt(var_ptr[c] + epsilon));
bn_scale_ptr[c] = inv_scale;
bn_bias_ptr[c] = inv_scale * (bias_ptr[c] - mean_ptr[c]) + bn_bias_ptr[c];
}
return true;
}
template <>
void FusionDequantAddBNKernel<CPU, float>::Compute(
const FusionDequantAddBNParam<CPU> &param) {
const int32_t *input = param.input_->data<int32_t>();
const float *bn_scale = param.bn_scale_->data<float>();
const float *bn_bias = param.bn_bias_->data<float>();
// dequantize params
const float activation_scale = param.activation_scale_->data<float>()[0];
const float weight_scale = param.weight_scale_;
const float dequant_scale = activation_scale / weight_scale;
float *output = param.output_->mutable_data<float>();
int batch_size = param.input_->dims()[0];
int channels = param.input_->dims()[1];
size_t spatial_size = param.input_->dims()[2] * param.input_->dims()[3];
#pragma omp parallel for collapse(2)
for (int batch = 0; batch < batch_size; ++batch) {
for (int c = 0; c < channels; ++c) {
float scale = bn_scale[c] * dequant_scale;
float bias = bn_bias[c];
size_t offset = (batch * channels + c) * spatial_size;
const int32_t *x = input + offset;
float *y = output + offset;
size_t remain = spatial_size;
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
int loop = spatial_size >> 4;
remain = spatial_size & 0xF;
float32x4_t __scale = vdupq_n_f32(scale);
float32x4_t __bias = vdupq_n_f32(bias);
for (int k = 0; k < loop; ++k, x += 16, y += 16) {
int32x4_t r0 = vld1q_s32(x);
int32x4_t r1 = vld1q_s32(x + 4);
int32x4_t r2 = vld1q_s32(x + 8);
int32x4_t r3 = vld1q_s32(x + 12);
float32x4_t f0 = vcvtq_f32_s32(r0);
float32x4_t f1 = vcvtq_f32_s32(r1);
float32x4_t f2 = vcvtq_f32_s32(r2);
float32x4_t f3 = vcvtq_f32_s32(r3);
f0 = vmlaq_f32(__bias, __scale, f0);
f1 = vmlaq_f32(__bias, __scale, f1);
f2 = vmlaq_f32(__bias, __scale, f2);
f3 = vmlaq_f32(__bias, __scale, f3);
vst1q_f32(y, f0);
vst1q_f32(y + 4, f1);
vst1q_f32(y + 8, f2);
vst1q_f32(y + 12, f3);
}
#endif // __ARM_NEON__
for (int k = 0; k < remain; ++k) {
y[k] = scale * x[k] + bias;
}
}
}
}
} // namespace operators
} // namespace paddle_mobile
#endif // FUSION_DEQUANT_ADD_BN_OP
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include "operators/kernel/dequant_bn_relu_kernel.h"
#include <cmath>
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#include <arm_neon.h>
#endif
namespace paddle_mobile {
namespace operators {
#if defined(FUSION_DEQUANT_BN_RELU_OP) || defined(FUSION_DEQUANT_ADD_BN_RELU_OP)
void DequantBNReluCompute(const FusionDequantBNParam<CPU> *param) {
const int32_t *input = param->input_->data<int32_t>();
const float *bn_scale = param->bn_scale_->data<float>();
const float *bn_bias = param->bn_bias_->data<float>();
// dequantize params
const float activation_scale = param->activation_scale_->data<float>()[0];
const float weight_scale = param->weight_scale_;
const float dequant_scale = activation_scale / weight_scale;
float *output = param->output_->mutable_data<float>();
int batch_size = param->input_->dims()[0];
int channels = param->input_->dims()[1];
size_t spatial_size = param->input_->dims()[2] * param->input_->dims()[3];
#pragma omp parallel for collapse(2)
for (int batch = 0; batch < batch_size; ++batch) {
for (int c = 0; c < channels; ++c) {
float scale = bn_scale[c] * dequant_scale;
float bias = bn_bias[c];
size_t offset = (batch * channels + c) * spatial_size;
const int32_t *x = input + offset;
float *y = output + offset;
size_t remain = spatial_size;
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
int loop = spatial_size >> 4;
remain = spatial_size & 0xF;
float32x4_t __scale = vdupq_n_f32(scale);
float32x4_t __bias = vdupq_n_f32(bias);
float32x4_t __zero = vdupq_n_f32(0.f);
for (int k = 0; k < loop; ++k, x += 16, y += 16) {
int32x4_t r0 = vld1q_s32(x);
int32x4_t r1 = vld1q_s32(x + 4);
int32x4_t r2 = vld1q_s32(x + 8);
int32x4_t r3 = vld1q_s32(x + 12);
float32x4_t f0 = vcvtq_f32_s32(r0);
float32x4_t f1 = vcvtq_f32_s32(r1);
float32x4_t f2 = vcvtq_f32_s32(r2);
float32x4_t f3 = vcvtq_f32_s32(r3);
f0 = vmlaq_f32(__bias, __scale, f0);
f1 = vmlaq_f32(__bias, __scale, f1);
f2 = vmlaq_f32(__bias, __scale, f2);
f3 = vmlaq_f32(__bias, __scale, f3);
f0 = vmaxq_f32(__zero, f0);
f1 = vmaxq_f32(__zero, f1);
f2 = vmaxq_f32(__zero, f2);
f3 = vmaxq_f32(__zero, f3);
vst1q_f32(y, f0);
vst1q_f32(y + 4, f1);
vst1q_f32(y + 8, f2);
vst1q_f32(y + 12, f3);
}
#endif // __ARM_NEON__
for (int k = 0; k < remain; ++k) {
y[k] = std::max(scale * x[k] + bias, 0.f);
}
}
}
}
#endif
#ifdef FUSION_DEQUANT_BN_RELU_OP
template <>
bool FusionDequantBNReluKernel<CPU, float>::Init(
FusionDequantBNReluParam<CPU> *param) {
// batch norm params
const Tensor *bn_mean = param->bn_mean_;
const Tensor *bn_variance = param->bn_variance_;
Tensor *bn_scale = param->bn_scale_;
Tensor *bn_bias = param->bn_bias_;
const float epsilon = param->epsilon_;
const float *mean_ptr = bn_mean->data<float>();
const float *var_ptr = bn_variance->data<float>();
float *bn_scale_ptr = bn_scale->mutable_data<float>();
float *bn_bias_ptr = bn_bias->mutable_data<float>();
for (int c = 0; c < bn_scale->numel(); ++c) {
float inv_scale = bn_scale_ptr[c] / (std::sqrt(var_ptr[c] + epsilon));
bn_scale_ptr[c] = inv_scale;
bn_bias_ptr[c] = bn_bias_ptr[c] - inv_scale * mean_ptr[c];
}
return true;
}
template <>
void FusionDequantBNReluKernel<CPU, float>::Compute(
const FusionDequantBNReluParam<CPU> &param) {
DequantBNReluCompute(&param);
}
#endif // FUSION_DEQUANT_BN_RELU_OP
#ifdef FUSION_DEQUANT_ADD_BN_RELU_OP
template <>
bool FusionDequantAddBNReluKernel<CPU, float>::Init(
FusionDequantAddBNReluParam<CPU> *param) {
// elementwise add params
const Tensor *bias = param->bias_;
// batch norm params
const Tensor *bn_mean = param->bn_mean_;
const Tensor *bn_variance = param->bn_variance_;
Tensor *bn_scale = param->bn_scale_;
Tensor *bn_bias = param->bn_bias_;
const float epsilon = param->epsilon_;
const float *bias_ptr = bias->data<float>();
const float *mean_ptr = bn_mean->data<float>();
const float *var_ptr = bn_variance->data<float>();
float *bn_scale_ptr = bn_scale->mutable_data<float>();
float *bn_bias_ptr = bn_bias->mutable_data<float>();
for (int c = 0; c < bn_scale->numel(); ++c) {
float inv_scale = bn_scale_ptr[c] / (std::sqrt(var_ptr[c] + epsilon));
bn_scale_ptr[c] = inv_scale;
bn_bias_ptr[c] = inv_scale * (bias_ptr[c] - mean_ptr[c]) + bn_bias_ptr[c];
}
return true;
}
template <>
void FusionDequantAddBNReluKernel<CPU, float>::Compute(
const FusionDequantAddBNReluParam<CPU> &param) {
DequantBNReluCompute(&param);
}
#endif // FUSION_DEQUANT_ADD_BN_RELU_OP
} // namespace operators
} // namespace paddle_mobile
/* Copyright (c) 201f8 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <cmath>
#include "operators/kernel/dequant_bn_kernel.h"
#include "operators/math/activation.h"
#include "operators/math/quantize.h"
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#include <arm_neon.h>
#endif
namespace paddle_mobile {
namespace operators {
#if defined(FUSION_DEQUANT_BN_OP) || defined(FUSION_DEQUANT_ADD_BN_OP) || \
defined(FUSION_DEQUANT_BN_RELU_OP) || \
defined(FUSION_DEQUANT_ADD_BN_RELU_OP) || \
defined(FUSION_DEQUANT_ADD_BN_QUANT_OP) || \
defined(FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP)
void PublicFusionDequantBNInitParam(FusionDequantBNParam<CPU> *param,
const framework::Tensor *bias) {
// batch norm params
const Tensor *bn_mean = param->bn_mean_;
const Tensor *bn_variance = param->bn_variance_;
Tensor *bn_scale = param->bn_scale_;
Tensor *bn_bias = param->bn_bias_;
const float epsilon = param->epsilon_;
const float *mean_ptr = bn_mean->data<float>();
const float *var_ptr = bn_variance->data<float>();
float *bn_scale_ptr = bn_scale->mutable_data<float>();
float *bn_bias_ptr = bn_bias->mutable_data<float>();
for (int c = 0; c < bn_scale->numel(); ++c) {
float inv_scale = 1.f / (std::sqrt(var_ptr[c] + epsilon));
float val = bias ? bias->data<float>()[c] : 0;
bn_bias_ptr[c] =
inv_scale * bn_scale_ptr[c] * (val - mean_ptr[c]) + bn_bias_ptr[c];
bn_scale_ptr[c] = inv_scale * bn_scale_ptr[c];
}
}
#endif
#if defined(FUSION_DEQUANT_BN_OP) || defined(FUSION_DEQUANT_ADD_BN_OP) || \
defined(FUSION_DEQUANT_BN_RELU_OP) || \
defined(FUSION_DEQUANT_ADD_BN_RELU_OP)
template <ActivationType Act>
void DequantBNCompute(const FusionDequantBNParam<CPU> *param) {
const int32_t *input = param->input_->data<int32_t>();
const float *bn_scale = param->bn_scale_->data<float>();
const float *bn_bias = param->bn_bias_->data<float>();
// dequantize params
const float activation_scale = param->activation_scale_->data<float>()[0];
const float weight_scale = param->weight_scale_;
const float dequant_scale = activation_scale / weight_scale;
float *output = param->output_->mutable_data<float>();
int batch_size = param->input_->dims()[0];
int channels = param->input_->dims()[1];
size_t spatial_size = param->input_->dims()[2] * param->input_->dims()[3];
#pragma omp parallel for collapse(2)
for (int batch = 0; batch < batch_size; ++batch) {
for (int c = 0; c < channels; ++c) {
// not fuse bn and dequant scale to minimize precision difference
// float scale = bn_scale[c] * dequant_scale;
float scale = bn_scale[c];
float bias = bn_bias[c];
size_t offset = (batch * channels + c) * spatial_size;
const int32_t *x = input + offset;
float *y = output + offset;
size_t remain = spatial_size;
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
int loop = spatial_size >> 4;
remain = spatial_size & 0xF;
float32x4_t __dequant_scale = vdupq_n_f32(dequant_scale);
float32x4_t __scale = vdupq_n_f32(scale);
float32x4_t __bias = vdupq_n_f32(bias);
for (int k = 0; k < loop; ++k, x += 16, y += 16) {
int32x4_t r0 = vld1q_s32(x);
int32x4_t r1 = vld1q_s32(x + 4);
int32x4_t r2 = vld1q_s32(x + 8);
int32x4_t r3 = vld1q_s32(x + 12);
float32x4_t f0 = vcvtq_f32_s32(r0);
float32x4_t f1 = vcvtq_f32_s32(r1);
float32x4_t f2 = vcvtq_f32_s32(r2);
float32x4_t f3 = vcvtq_f32_s32(r3);
f0 = vmulq_f32(__dequant_scale, f0);
f1 = vmulq_f32(__dequant_scale, f1);
f2 = vmulq_f32(__dequant_scale, f2);
f3 = vmulq_f32(__dequant_scale, f3);
f0 = vmlaq_f32(__bias, __scale, f0);
f1 = vmlaq_f32(__bias, __scale, f1);
f2 = vmlaq_f32(__bias, __scale, f2);
f3 = vmlaq_f32(__bias, __scale, f3);
f0 = math::vActiveq_f32<Act>(f0);
f1 = math::vActiveq_f32<Act>(f1);
f2 = math::vActiveq_f32<Act>(f2);
f3 = math::vActiveq_f32<Act>(f3);
vst1q_f32(y, f0);
vst1q_f32(y + 4, f1);
vst1q_f32(y + 8, f2);
vst1q_f32(y + 12, f3);
}
#endif // __ARM_NEON__
for (int k = 0; k < remain; ++k) {
y[k] = math::Active<Act>(scale * (dequant_scale * x[k]) + bias);
}
}
}
}
#endif
#ifdef FUSION_DEQUANT_BN_OP
template <>
bool FusionDequantBNKernel<CPU, float>::Init(FusionDequantBNParam<CPU> *param) {
PublicFusionDequantBNInitParam(param, nullptr);
return true;
}
template <>
void FusionDequantBNKernel<CPU, float>::Compute(
const FusionDequantBNParam<CPU> &param) {
DequantBNCompute<IDENTITY>(&param);
}
#endif // FUSION_DEQUANT_BN_OP
#ifdef FUSION_DEQUANT_BN_RELU_OP
template <>
bool FusionDequantBNReluKernel<CPU, float>::Init(
FusionDequantBNParam<CPU> *param) {
PublicFusionDequantBNInitParam(param, nullptr);
return true;
}
template <>
void FusionDequantBNReluKernel<CPU, float>::Compute(
const FusionDequantBNParam<CPU> &param) {
DequantBNCompute<RELU>(&param);
}
#endif // FUSION_DEQUANT_BN_RELU_OP
#ifdef FUSION_DEQUANT_ADD_BN_OP
template <>
bool FusionDequantAddBNKernel<CPU, float>::Init(
FusionDequantAddBNParam<CPU> *param) {
const framework::Tensor *bias = param->bias_;
PublicFusionDequantBNInitParam(param, bias);
return true;
}
template <>
void FusionDequantAddBNKernel<CPU, float>::Compute(
const FusionDequantAddBNParam<CPU> &param) {
DequantBNCompute<IDENTITY>(&param);
}
#endif // FUSION_DEQUANT_ADD_BN_OP
#ifdef FUSION_DEQUANT_ADD_BN_RELU_OP
template <>
bool FusionDequantAddBNReluKernel<CPU, float>::Init(
FusionDequantAddBNParam<CPU> *param) {
const framework::Tensor *bias = param->bias_;
PublicFusionDequantBNInitParam(param, bias);
return true;
}
template <>
void FusionDequantAddBNReluKernel<CPU, float>::Compute(
const FusionDequantAddBNParam<CPU> &param) {
DequantBNCompute<RELU>(&param);
}
#endif // FUSION_DEQUANT_ADD_BN_RELU_OP
#if defined(FUSION_DEQUANT_ADD_BN_QUANT_OP) || \
defined(FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP)
template <Activation Act, RoundType R>
void DequantBNQuantCompute(const FusionDequantAddBNQuantParam<CPU> *param) {
const int32_t *input = param->input_->data<int32_t>();
const float *bn_scale = param->bn_scale_->data<float>();
const float *bn_bias = param->bn_bias_->data<float>();
// dequantize params
const float activation_scale = param->activation_scale_->data<float>()[0];
const float weight_scale = param->weight_scale_;
const float dequant_scale = activation_scale / weight_scale;
// quantize params
Tensor *output_scale = param->online_scale_;
float max_abs = 0.f;
int8_t *output = param->output_->mutable_data<int8_t>();
int batch_size = param->input_->dims()[0];
int channels = param->input_->dims()[1];
size_t spatial_size = param->input_->dims()[2] * param->input_->dims()[3];
// if (param->is_static_) {
if (true) {
max_abs = param->static_scale_;
float quant_scale = 127.f / max_abs;
#pragma omp parallel for collapse(2)
for (int batch = 0; batch < batch_size; ++batch) {
for (int c = 0; c < channels; ++c) {
// not fuse bn and dequant scale to minimize precision difference
// float scale = bn_scale[c] * dequant_scale;
float scale = bn_scale[c];
float bias = bn_bias[c];
size_t offset = (batch * channels + c) * spatial_size;
const int32_t *x = input + offset;
int8_t *y = output + offset;
size_t remain = spatial_size;
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
int loop = spatial_size >> 4;
remain = spatial_size & 0xF;
float32x4_t __dequant_scale = vdupq_n_f32(dequant_scale);
float32x4_t __scale = vdupq_n_f32(scale);
float32x4_t __bias = vdupq_n_f32(bias);
float32x4_t __quant_scale = vdupq_n_f32(quant_scale);
for (int k = 0; k < loop; ++k, x += 16, y += 16) {
int32x4_t r0 = vld1q_s32(x);
int32x4_t r1 = vld1q_s32(x + 4);
int32x4_t r2 = vld1q_s32(x + 8);
int32x4_t r3 = vld1q_s32(x + 12);
float32x4_t f0 = vcvtq_f32_s32(r0);
float32x4_t f1 = vcvtq_f32_s32(r1);
float32x4_t f2 = vcvtq_f32_s32(r2);
float32x4_t f3 = vcvtq_f32_s32(r3);
f0 = vmulq_f32(__dequant_scale, f0);
f1 = vmulq_f32(__dequant_scale, f1);
f2 = vmulq_f32(__dequant_scale, f2);
f3 = vmulq_f32(__dequant_scale, f3);
f0 = vmlaq_f32(__bias, __scale, f0);
f1 = vmlaq_f32(__bias, __scale, f1);
f2 = vmlaq_f32(__bias, __scale, f2);
f3 = vmlaq_f32(__bias, __scale, f3);
f0 = math::vActiveq_f32<Act>(f0);
f1 = math::vActiveq_f32<Act>(f1);
f2 = math::vActiveq_f32<Act>(f2);
f3 = math::vActiveq_f32<Act>(f3);
f0 = vmulq_f32(__quant_scale, f0);
f1 = vmulq_f32(__quant_scale, f1);
f2 = vmulq_f32(__quant_scale, f2);
f3 = vmulq_f32(__quant_scale, f3);
int32x4_t q0 = math::vRoundq_f32<R>(f0);
int32x4_t q1 = math::vRoundq_f32<R>(f1);
int32x4_t q2 = math::vRoundq_f32<R>(f2);
int32x4_t q3 = math::vRoundq_f32<R>(f3);
int16x4_t d0 = vmovn_s32(q0);
int16x4_t d1 = vmovn_s32(q1);
int16x4_t d2 = vmovn_s32(q2);
int16x4_t d3 = vmovn_s32(q3);
int16x8_t q5 = vcombine_s16(d0, d1);
int16x8_t q6 = vcombine_s16(d2, d3);
int8x8_t d5 = vmovn_s16(q5);
int8x8_t d6 = vmovn_s16(q6);
vst1_s8(y, d5);
vst1_s8(y + 8, d6);
}
#endif // __ARM_NEON__
for (int k = 0; k < remain; ++k) {
float x_temp =
math::Active<Act>(scale * (dequant_scale * x[k]) + bias);
y[k] = math::Round<R>(x_temp * quant_scale);
}
}
}
} else {
// TODO(hjchen2)
max_abs = std::max(max_abs, 1e-6f);
}
param->online_scale_->mutable_data<float>()[0] = max_abs;
}
template <>
bool FusionDequantAddBNQuantKernel<CPU, float>::Init(
FusionDequantAddBNQuantParam<CPU> *param) {
const framework::Tensor *bias = param->bias_;
PublicFusionDequantBNInitParam(param, bias);
return true;
}
template <>
void FusionDequantAddBNQuantKernel<CPU, float>::Compute(
const FusionDequantAddBNQuantParam<CPU> &param) {
switch (param.round_type_) {
case ROUND_NEAREST_TO_EVEN:
DequantBNQuantCompute<IDENTITY, ROUND_NEAREST_TO_EVEN>(&param);
break;
case ROUND_NEAREST_TOWARDS_ZERO:
DequantBNQuantCompute<IDENTITY, ROUND_NEAREST_TOWARDS_ZERO>(&param);
break;
case ROUND_NEAREST_AWAY_ZERO:
DequantBNQuantCompute<IDENTITY, ROUND_NEAREST_AWAY_ZERO>(&param);
break;
default:
LOG(kLOG_ERROR) << "round type is not supported.";
break;
}
}
#endif // FUSION_DEQUANT_ADD_BN_QUANT_OP
#ifdef FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
template <>
bool FusionDequantAddBNReluQuantKernel<CPU, float>::Init(
FusionDequantAddBNQuantParam<CPU> *param) {
const framework::Tensor *bias = param->bias_;
PublicFusionDequantBNInitParam(param, bias);
return true;
}
template <>
void FusionDequantAddBNReluQuantKernel<CPU, float>::Compute(
const FusionDequantAddBNQuantParam<CPU> &param) {
switch (param.round_type_) {
case ROUND_NEAREST_TO_EVEN:
DequantBNQuantCompute<RELU, ROUND_NEAREST_TO_EVEN>(&param);
break;
case ROUND_NEAREST_TOWARDS_ZERO:
DequantBNQuantCompute<RELU, ROUND_NEAREST_TOWARDS_ZERO>(&param);
break;
case ROUND_NEAREST_AWAY_ZERO:
DequantBNQuantCompute<RELU, ROUND_NEAREST_AWAY_ZERO>(&param);
break;
default:
LOG(kLOG_ERROR) << "round type is not supported.";
break;
}
}
#endif // FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
} // namespace operators
} // namespace paddle_mobile
...@@ -30,8 +30,8 @@ bool DequantizeKernel<CPU, float>::Init(DequantizeParam<CPU> *param) { ...@@ -30,8 +30,8 @@ bool DequantizeKernel<CPU, float>::Init(DequantizeParam<CPU> *param) {
template <> template <>
void DequantizeKernel<CPU, float>::Compute(const DequantizeParam<CPU> &param) { void DequantizeKernel<CPU, float>::Compute(const DequantizeParam<CPU> &param) {
const Tensor *input = param.input_; const LoDTensor *input = param.input_;
Tensor *output = param.output_; LoDTensor *output = param.output_;
float activation_scale = param.activation_scale_->data<float>()[0]; float activation_scale = param.activation_scale_->data<float>()[0];
float weight_scale = param.weight_scale_; float weight_scale = param.weight_scale_;
const int32_t *x = input->data<const int32_t>(); const int32_t *x = input->data<const int32_t>();
...@@ -72,6 +72,7 @@ void DequantizeKernel<CPU, float>::Compute(const DequantizeParam<CPU> &param) { ...@@ -72,6 +72,7 @@ void DequantizeKernel<CPU, float>::Compute(const DequantizeParam<CPU> &param) {
for (size_t i = 0; i < size; ++i) { for (size_t i = 0; i < size; ++i) {
y[i] = x[i] * scale; y[i] = x[i] * scale;
} }
output->set_lod(input->lod());
} }
} // namespace operators } // namespace operators
......
...@@ -29,12 +29,6 @@ template <> ...@@ -29,12 +29,6 @@ template <>
void GruKernel<CPU, float>::Compute(const GruParam<CPU> &param) { void GruKernel<CPU, float>::Compute(const GruParam<CPU> &param) {
GruCompute<float>(param); GruCompute<float>(param);
param.OutHidden()->set_lod(param.InputInput()->lod()); param.OutHidden()->set_lod(param.InputInput()->lod());
// DLOG << "________________" << param.OutHidden()->dims();
// DLOG << "________________" << param.OutHidden()->numel();
// auto *hiden_data = param.OutHidden()->data<float>();
// for (int64_t i = 0; i < 10; i++) {
// DLOG << "****************" << hiden_data[i];
// }
} }
template class GruKernel<CPU, float>; template class GruKernel<CPU, float>;
......
...@@ -15,11 +15,56 @@ limitations under the License. */ ...@@ -15,11 +15,56 @@ limitations under the License. */
#ifdef RELU_OP #ifdef RELU_OP
#include "operators/kernel/relu_kernel.h" #include "operators/kernel/relu_kernel.h"
#include "operators/kernel/central-arm-func/relu_arm_func.h" #include "common/types.h"
#include "operators/math/activation.h"
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#include <arm_neon.h>
#endif
namespace paddle_mobile { namespace paddle_mobile {
namespace operators { namespace operators {
template <typename Dtype, ActivationType Act>
struct ReluCompute {
void operator()(const Tensor *input, Tensor *output) {}
};
template <ActivationType Act>
struct ReluCompute<float, Act> {
void operator()(const Tensor *input, Tensor *output) {
const float *x = input->data<float>();
float *y = output->mutable_data<float>();
size_t remain = input->numel();
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
size_t loop = remain >> 4;
remain = remain & 0xF;
#pragma omp parallel for
for (size_t i = 0; i < loop; ++i) {
const float *local_x = x + (i << 4);
float *local_y = y + (i << 4);
float32x4_t r0 = vld1q_f32(local_x);
float32x4_t r1 = vld1q_f32(local_x + 4);
float32x4_t r2 = vld1q_f32(local_x + 8);
float32x4_t r3 = vld1q_f32(local_x + 12);
r0 = math::vActiveq_f32<Act>(r0);
r1 = math::vActiveq_f32<Act>(r1);
r2 = math::vActiveq_f32<Act>(r2);
r3 = math::vActiveq_f32<Act>(r3);
vst1q_f32(local_y, r0);
vst1q_f32(local_y + 4, r1);
vst1q_f32(local_y + 8, r2);
vst1q_f32(local_y + 12, r3);
}
x += (loop << 4);
y += (loop << 4);
#endif
for (size_t i = 0; i < remain; ++i) {
y[i] = math::Active<Act>(x[i]);
}
}
};
template <> template <>
bool ReluKernel<CPU, float>::Init(ReluParam<CPU> *param) { bool ReluKernel<CPU, float>::Init(ReluParam<CPU> *param) {
return true; return true;
...@@ -27,7 +72,21 @@ bool ReluKernel<CPU, float>::Init(ReluParam<CPU> *param) { ...@@ -27,7 +72,21 @@ bool ReluKernel<CPU, float>::Init(ReluParam<CPU> *param) {
template <> template <>
void ReluKernel<CPU, float>::Compute(const ReluParam<CPU> &param) { void ReluKernel<CPU, float>::Compute(const ReluParam<CPU> &param) {
ReluCompute<float>(param); const Tensor *input = param.InputX();
Tensor *output = param.Out();
ReluCompute<float, RELU>()(input, output);
}
template <>
bool Relu6Kernel<CPU, float>::Init(ReluParam<CPU> *param) {
return true;
}
template <>
void Relu6Kernel<CPU, float>::Compute(const ReluParam<CPU> &param) {
const Tensor *input = param.InputX();
Tensor *output = param.Out();
ReluCompute<float, RELU6>()(input, output);
} }
} // namespace operators } // namespace operators
......
...@@ -13,8 +13,9 @@ See the License for the specific language governing permissions and ...@@ -13,8 +13,9 @@ See the License for the specific language governing permissions and
limitations under the License. */ limitations under the License. */
#ifdef FUSION_CONVADDADDPRELU_OP #ifdef FUSION_CONVADDADDPRELU_OP
#pragma once #pragma once
#include <string>
#include <vector> #include <vector>
#include "operators/math/conv_func.h" #include "operators/math/conv_func.h"
#include "operators/math/im2col.h" #include "operators/math/im2col.h"
...@@ -115,20 +116,7 @@ void ConvAddAddPReluCompute(const FusionConvAddAddPReluParam<CPU> &param) { ...@@ -115,20 +116,7 @@ void ConvAddAddPReluCompute(const FusionConvAddAddPReluParam<CPU> &param) {
Tensor filter_slice = filter.Slice(g * out_step, (g + 1) * out_step); Tensor filter_slice = filter.Slice(g * out_step, (g + 1) * out_step);
Tensor bias1_slice = bias1_batch.Slice(g * out_step, (g + 1) * out_step); Tensor bias1_slice = bias1_batch.Slice(g * out_step, (g + 1) * out_step);
float *biase_data1 = bias1_slice.data<float>(); float *biase_data1 = bias1_slice.data<float>();
// int n = bias1_slice.dims()[0]; math::MatMulWithPRelu(filter_slice, false, col_matrix, false, &out_slice,
// int m = bias1_slice.dims()[1];
// for(int i=0;i<n*m;i++){
// if(biase_data1[i]!=0)
// DLOG<<biase_data1[i]<<",yangfei";
// }
// math::matmul<float>(filter_slice, false, col_matrix,
// false,
// static_cast<float>(1),
// &out_slice,
// static_cast<float>(1), true,
// biase_data);
math::matmulWithPRelu(filter_slice, false, col_matrix, false, &out_slice,
p, mode, biase_data, biase_data1); p, mode, biase_data, biase_data1);
} }
} }
...@@ -137,4 +125,4 @@ void ConvAddAddPReluCompute(const FusionConvAddAddPReluParam<CPU> &param) { ...@@ -137,4 +125,4 @@ void ConvAddAddPReluCompute(const FusionConvAddAddPReluParam<CPU> &param) {
} // namespace operators } // namespace operators
} // namespace paddle_mobile } // namespace paddle_mobile
#endif #endif // FUSION_CONVADDADDPRELU_OP
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册