提交 832e763e 编写于 作者: R Ray Liu 提交者: GitHub

Merge branch 'develop' into develop

......@@ -10,6 +10,7 @@ option(LOG_PROFILE "log profile" OFF)
option(CPU "armv7 with neon" ON)
option(GPU_MALI "mali gpu" OFF)
option(GPU_CL "opencl gpu" OFF)
option(FPGA "fpga" OFF)
if(FPGA)
option(FPGAV1 "fpga v1" ON)
......@@ -144,7 +145,7 @@ if(FPGA)
endforeach()
file(GLOB_RECURSE _tmp_list src/operators/kernel/fpga/V2/*.h src/fpga/V2/*.h)
foreach(f ${_tmp_list})
list(REMOVE_ITEM PADDLE_MOBILE_CC ${f})
list(REMOVE_ITEM PADDLE_MOBILE_H ${f})
endforeach()
endif()
if(FPGAV2)
......@@ -156,7 +157,7 @@ if(FPGA)
endforeach()
file(GLOB_RECURSE _tmp_list src/operators/kernel/fpga/V1/*.h src/fpga/V1/*.h)
foreach(f ${_tmp_list})
list(REMOVE_ITEM PADDLE_MOBILE_CC ${f})
list(REMOVE_ITEM PADDLE_MOBILE_H ${f})
endforeach()
endif()
......
......@@ -7,11 +7,21 @@
<!--[![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle-Mobile.svg)](https://github.com/PaddlePaddle/Paddle-Mobile/releases)
[![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)-->
Welcome to Paddle-Mobile GitHub project。Paddle-Mobile is a project of PaddlePaddle as well as a deep learning framework for embedded platforms.
欢迎来到 Paddle-Mobile GitHub 项目。Paddle-Mobile是PaddlePaddle组织下的项目,是一个致力于嵌入式平台的深度学习的框架。
## Features
- high performance in support of ARM CPU
- support Mali GPU
- support Andreno GPU
- support the realization of GPU Metal on Apple devices
- support implementation on ZU5、ZU9 and other FPGA-based development boards
- support implementation on Raspberry Pi and other arm-linux development boards
## Features
- 高性能支持ARM CPU
- 支持Mali GPU
- 支持Andreno GPU
......@@ -19,6 +29,7 @@
- 支持ZU5、ZU9等FPGA开发板
- 支持树莓派等arm-linux开发板
## Demo
- [ANDROID](https://github.com/xiebaiyuan/paddle-mobile-demo)
......@@ -26,6 +37,27 @@
[https://github.com/PaddlePaddle/paddle-mobile/tree/develop/demo](https://github.com/PaddlePaddle/paddle-mobile/tree/develop/demo)
## Documentation
### Documentation of design
If you want to know more details about the documentation of paddle-mobile design, please refer to the link as follows. There are many previous designs and discussion: [issue](https://github.com/PaddlePaddle/paddle-mobile/issues).
[link of documentation of design](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/design_doc.md)
### Documentation of development
Documentation of development is mainly about building, running and other tasks.As a developer,you can use it with the help of contributed documents.
* [iOS](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_ios.md)
* [Android_CPU](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_android.md)
* [Android_GPU](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_android_GPU.md)
* [FPGA](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_fpga.md)
* [ARM_LINUX](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_arm_linux.md)
### How to contribute your documents
- [tutorial link to contribute documents](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/CONTRIBUTING.md)
- Main procedure of contributing code is covered in the document above.If you have other problems during the procedure,please send them as [issue](https://github.com/PaddlePaddle/paddle-mobile/issues). We will deal with it as quickly as possible.
## 文档
### 设计文档
......@@ -46,6 +78,24 @@
- [贡献文档链接](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/CONTRIBUTING.md)
- 上面文档中涵盖了主要的贡献代码流程,如果在实践中您还遇到了其他问题,可以发[issue](https://github.com/PaddlePaddle/paddle-mobile/issues)。我们看到后会尽快处理。
## Acquision of Models
At present Paddle-Mobile only supports Paddle fluid training model. Models wiil be operated regularly after transformation if you have various models.
### 1. Use Paddle Fluid directly to train
It is the most reliable method to be recommanded
### 2. Transform Caffe to Paddle Fluid model
[https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/caffe2fluid](https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/caffe2fluid)
### 3. ONNX
ONNX is expanded as Open Neural Network Exchange. The project is aimed to make a full communication and usage among diffrent nerual network development frameworks.
Except for directly using fluid models trained by PaddlePaddle,you can also get certain Paddle fluid models through onnx transformation.
At present,work in support of onnx is also under operation in Baidu. Related tranformation project can be referred to here:
[https://github.com/PaddlePaddle/paddle-onnx](https://github.com/PaddlePaddle/paddle-onnx)
### 4. Download parts of testing models and testing pictures
[http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip](http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip)
## 模型获得
目前Paddle-Mobile仅支持Paddle fluid训练的模型。如果你手中的模型是不同种类的模型,需要进行模型转换才可以运行。
......@@ -64,6 +114,22 @@ ONNX全称为“Open Neural Network Exchange”,即“开放的神经网络切
### 4. 部分测试模型和测试图片下载
[http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip](http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip)
<!--## Online output of simple search
Gif as following is the application output of online main part detection of simple search app
![ezgif-1-050a733dfb](http://otkwwi4x8.bkt.clouddn.com/2018-07-05-ezgif-1-050a733dfb.gif)-->
## Ask Question
Welcome to put forward or tackle with our problems,You can post your question in our issue modular on github. [Github Issues](https://github.com/PaddlePaddle/paddle-mobile/issues).
## Copyright and License
Paddle-Mobile provide relatively unstricted Apache-2.0 Open source agreement [Apache-2.0 license](LICENSE).
## Old version Mobile-Deep-Learning
Original MDL(Mobile-Deep-Learning) project has been transferred to [Mobile-Deep-Learning](https://github.com/allonli/mobile-deep-learning)
<!--## 简单搜索线上效果
如下gif是简单搜索app的线上主体检测应用效果
......
......@@ -16,9 +16,9 @@ limitations under the License. */
#ifdef ENABLE_EXCEPTION
#include <stdio.h>
#include <stdlib.h>
#include <exception>
#include <string>
#endif
namespace paddle_mobile {
......
......@@ -22,9 +22,10 @@ const char *G_OP_TYPE_BATCHNORM = "batch_norm";
const char *G_OP_TYPE_BOX_CODER = "box_coder";
const char *G_OP_TYPE_CONCAT = "concat";
const char *G_OP_TYPE_ELEMENTWISE_ADD = "elementwise_add";
const char *G_OP_TYPE_ELEMENTWISE_SUB = "elementwise_sub";
const char *G_OP_TYPE_ELEMENTWISE_MUL = "elementwise_mul";
const char *G_OP_TYPE_FILL_CONSTANT = "fill_constant";
const char *G_OP_TYPE_FUSION_CONV_ADD_RELU = "fusion_conv_add_relu";
const char *G_OP_TYPE_FUSION_CONV_ADD_RELU_INT8 = "fusion_conv_add_relu_int8";
const char *G_OP_TYPE_FUSION_CONV_ADD_PRELU = "fusion_conv_add_prelu";
const char *G_OP_TYPE_FUSION_CONV_ADD_ADD_PRELU = "fusion_conv_add_add_prelu";
const char *G_OP_TYPE_FUSION_CONV_ADD_BN_RELU = "fusion_conv_add_bn_relu";
......@@ -32,7 +33,6 @@ const char *G_OP_TYPE_FUSION_CONV_BN_ADD_RELU = "fusion_conv_bn_add_relu";
const char *G_OP_TYPE_FUSION_DWCONV_BN_RELU = "fusion_dwconv_bn_relu";
const char *G_OP_TYPE_FUSION_CONV_BN_RELU = "fusion_conv_bn_relu";
const char *G_OP_TYPE_FC = "fusion_fc";
const char *G_OP_TYPE_FC_INT8 = "fusion_fc_int8";
const char *G_OP_TYPE_FUSION_CONV_ADD = "fusion_conv_add";
const char *G_OP_TYPE_LRN = "lrn";
const char *G_OP_TYPE_MUL = "mul";
......@@ -41,6 +41,7 @@ const char *G_OP_TYPE_POLYGON_BOX_TRANSFORM = "polygon_box_transform";
const char *G_OP_TYPE_POOL2D = "pool2d";
const char *G_OP_TYPE_PRIOR_BOX = "prior_box";
const char *G_OP_TYPE_RELU = "relu";
const char *G_OP_TYPE_RELU6 = "relu6";
const char *G_OP_TYPE_RESHAPE = "reshape";
const char *G_OP_TYPE_RESHAPE2 = "reshape2";
const char *G_OP_TYPE_SIGMOID = "sigmoid";
......@@ -68,14 +69,20 @@ const char *G_OP_TYPE_CRF = "crf_decoding";
const char *G_OP_TYPE_BILINEAR_INTERP = "bilinear_interp";
const char *G_OP_TYPE_FLATTEN = "flatten";
const char *G_OP_TYPE_SHAPE = "shape";
const char *G_OP_TYPE_ELEMENTWISE_MUL = "elementwise_mul";
const char *G_OP_TYPE_SUM = "sum";
const char *G_OP_TYPE_TOP_K = "top_k";
const char *G_OP_TYPE_CAST = "cast";
const char *G_OP_TYPE_QUANTIZE = "quantize";
const char *G_OP_TYPE_DEQUANTIZE = "dequantize";
const char *G_OP_TYPE_FUSION_DEQUANT_BN = "fusion_dequant_bn";
const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN = "fusion_dequant_add_bn";
const char *G_OP_TYPE_FUSION_DEQUANT_BN_RELU = "fusion_dequant_bn_relu";
const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU = "fusion_dequant_add_bn_relu";
const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_QUANT =
"fusion_dequant_add_bn_quant";
const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU_QUANT =
"fusion_dequant_add_bn_relu_quant";
const char *G_OP_TYPE_TANH = "tanh";
const char *G_OP_TYPE_FUSION_DECONV_RELU = "fusion_deconv_relu";
......@@ -91,10 +98,13 @@ std::unordered_map<
{G_OP_TYPE_PRELU, {{"X", "Alpha"}, {"Out"}}},
{G_OP_TYPE_FUSION_CONV_ADD, {{"Input"}, {"Out"}}},
{G_OP_TYPE_RELU, {{"X"}, {"Out"}}},
{G_OP_TYPE_RELU6, {{"X"}, {"Out"}}},
{G_OP_TYPE_SOFTMAX, {{"X"}, {"Out"}}},
{G_OP_TYPE_SIGMOID, {{"X"}, {"Out"}}},
{G_OP_TYPE_MUL, {{"X"}, {"Out"}}},
{G_OP_TYPE_ELEMENTWISE_ADD, {{"X", "Y"}, {"Out"}}},
{G_OP_TYPE_ELEMENTWISE_SUB, {{"X", "Y"}, {"Out"}}},
{G_OP_TYPE_ELEMENTWISE_MUL, {{"X", "Y"}, {"Out"}}},
{G_OP_TYPE_POOL2D, {{"X"}, {"Out"}}},
{G_OP_TYPE_BATCHNORM, {{"X"}, {"Y"}}},
{G_OP_TYPE_LRN, {{"X"}, {"Out"}}},
......@@ -112,13 +122,11 @@ std::unordered_map<
{G_OP_TYPE_MULTICLASS_NMS, {{"BBoxes", "Scores"}, {"Out"}}},
{G_OP_TYPE_POLYGON_BOX_TRANSFORM, {{"Input"}, {"Output"}}},
{G_OP_TYPE_FC, {{"X", "Y", "Z"}, {"Out"}}},
{G_OP_TYPE_FC_INT8, {{"X", "Y", "Z", "Scale"}, {"Out"}}},
{G_OP_TYPE_RESHAPE, {{"X"}, {"Out"}}},
{G_OP_TYPE_RESHAPE2, {{"X"}, {"Out", "XShape"}}},
{G_OP_TYPE_DEPTHWISE_CONV, {{"Input"}, {"Output"}}},
{G_OP_TYPE_FILL_CONSTANT, {{}, {"Out"}}},
{G_OP_TYPE_FUSION_CONV_ADD_RELU, {{"Input"}, {"Out"}}},
{G_OP_TYPE_FUSION_CONV_ADD_RELU_INT8, {{"Input", "Scale"}, {"Out"}}},
{G_OP_TYPE_FUSION_CONV_ADD_PRELU, {{"Input"}, {"Out"}}},
{G_OP_TYPE_FUSION_CONV_ADD_ADD_PRELU, {{"Input"}, {"Out"}}},
{G_OP_TYPE_IM2SEQUENCE, {{"X"}, {"Out"}}},
......@@ -139,12 +147,18 @@ std::unordered_map<
{G_OP_TYPE_SHAPE, {{"Input"}, {"Out"}}},
{G_OP_TYPE_CONV_TRANSPOSE, {{"Input"}, {"Output"}}},
{G_OP_TYPE_SUM, {{"X"}, {"Out"}}},
{G_OP_TYPE_ELEMENTWISE_MUL, {{"X", "Y"}, {"Out"}}},
{G_OP_TYPE_TOP_K, {{"X"}, {"Out", "Indices"}}},
{G_OP_TYPE_CAST, {{"X"}, {"Out"}}},
{G_OP_TYPE_QUANTIZE, {{"X"}, {"Out", "OutScale"}}},
{G_OP_TYPE_DEQUANTIZE, {{"X", "Scale"}, {"Out"}}},
{G_OP_TYPE_FUSION_DEQUANT_ADD_BN, {{"X", "Scale"}, {"Y"}}},
{G_OP_TYPE_FUSION_DEQUANT_BN, {{"X", "Scale"}, {"Out"}}},
{G_OP_TYPE_FUSION_DEQUANT_ADD_BN, {{"X", "Scale"}, {"Out"}}},
{G_OP_TYPE_FUSION_DEQUANT_BN_RELU, {{"X", "Scale"}, {"Out"}}},
{G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU, {{"X", "Scale"}, {"Out"}}},
{G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU_QUANT,
{{"X", "Scale"}, {"Out", "OutScale"}}},
{G_OP_TYPE_FUSION_DEQUANT_ADD_BN_QUANT,
{{"X", "Scale"}, {"Out", "OutScale"}}},
{G_OP_TYPE_TANH, {{"X"}, {"Out"}}},
{G_OP_TYPE_FUSION_DECONV_RELU, {{"Input"}, {"Out"}}},
{G_OP_TYPE_FUSION_DECONV_ADD, {{"Input"}, {"Out"}}},
......
......@@ -87,10 +87,24 @@ enum PMStatus {
};
enum RoundType {
ROUND_UNK = 0,
ROUND_NEAREST_AWAY_ZERO = 1,
ROUND_NEAREST_TOWARDS_ZERO = 2,
ROUND_NEAREST_TO_EVEN = 3
ROUND_NEAREST_AWAY_ZERO = 0,
ROUND_NEAREST_TOWARDS_ZERO = 1,
ROUND_NEAREST_TO_EVEN = 2,
};
enum ActivationType {
IDENTITY = 0,
RELU = 1,
RELU6 = 2,
PRELU = 3,
LEAKY_RELU = 4,
TANH = 5,
SIGMOID = 6,
};
enum PoolingType {
MAX = 0,
AVG = 1,
};
extern const char *G_OP_TYPE_CONV;
......@@ -98,12 +112,12 @@ extern const char *G_OP_TYPE_BATCHNORM;
extern const char *G_OP_TYPE_BOX_CODER;
extern const char *G_OP_TYPE_CONCAT;
extern const char *G_OP_TYPE_ELEMENTWISE_ADD;
extern const char *G_OP_TYPE_ELEMENTWISE_SUB;
extern const char *G_OP_TYPE_ELEMENTWISE_MUL;
extern const char *G_OP_TYPE_FUSION_CONV_ADD_RELU;
extern const char *G_OP_TYPE_FUSION_CONV_ADD_RELU_INT8;
extern const char *G_OP_TYPE_FUSION_CONV_ADD_PRELU;
extern const char *G_OP_TYPE_FUSION_CONV_ADD_ADD_PRELU;
extern const char *G_OP_TYPE_FC;
extern const char *G_OP_TYPE_FC_INT8;
extern const char *G_OP_TYPE_FUSION_CONV_ADD;
extern const char *G_OP_TYPE_FUSION_CONV_ADD_BN_RELU;
extern const char *G_OP_TYPE_FUSION_CONV_BN_ADD_RELU;
......@@ -116,6 +130,7 @@ extern const char *G_OP_TYPE_MULTICLASS_NMS;
extern const char *G_OP_TYPE_POOL2D;
extern const char *G_OP_TYPE_PRIOR_BOX;
extern const char *G_OP_TYPE_RELU;
extern const char *G_OP_TYPE_RELU6;
extern const char *G_OP_TYPE_RESHAPE;
extern const char *G_OP_TYPE_SIGMOID;
extern const char *G_OP_TYPE_SOFTMAX;
......@@ -136,13 +151,17 @@ extern const char *G_OP_TYPE_FUSION_CONV_BN;
extern const char *G_OP_TYPE_CONV_TRANSPOSE;
extern const char *G_OP_TYPE_PRELU;
extern const char *G_OP_TYPE_SUM;
extern const char *G_OP_TYPE_ELEMENTWISE_MUL;
extern const char *G_OP_TYPE_TOP_K;
extern const char *G_OP_TYPE_CAST;
extern const char *G_OP_TYPE_QUANTIZE;
extern const char *G_OP_TYPE_DEQUANTIZE;
extern const char *G_OP_TYPE_FUSION_DEQUANT_BN;
extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN;
extern const char *G_OP_TYPE_FUSION_DEQUANT_BN_RELU;
extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU;
extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_QUANT;
extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU_QUANT;
extern const char *G_OP_TYPE_TANH;
extern const char *G_OP_TYPE_FUSION_DECONV_RELU;
......
......@@ -24,8 +24,6 @@ namespace fpga {
#define USE_RELU 1
#define USE_BIAS 2
int get_align_image_cw(int cw) { return align_to_x(cw, IMAGE_ALIGNMENT); }
void format_image(framework::Tensor *image_tensor) {
auto dims = image_tensor->dims();
auto channel = dims[1], height = dims[2], width = dims[3];
......@@ -83,6 +81,13 @@ int get_plit_num(framework::Tensor *filter_tensor) {
int div_capacity = filter::calc_division_capacity(chw);
return filter::calc_split_num(num, div_capacity);
}
int get_deconv_plit_num(framework::Tensor *filter_tensor, int stride) {
auto dims = filter_tensor->dims();
auto chw = dims[1] * dims[2] / stride * dims[3] / stride;
auto num = dims[0] * stride;
int div_capacity = filter::calc_division_capacity(chw);
return filter::calc_split_num(num, div_capacity);
}
int get_filter_num_per_div(framework::Tensor *filter_tensor, int group_num) {
auto dims = filter_tensor->dims();
......@@ -92,12 +97,17 @@ int get_filter_num_per_div(framework::Tensor *filter_tensor, int group_num) {
return filter::calc_num_per_div(num, group_num, div_capacity);
}
int get_aligned_filter_element_num(int chw) {
return align_to_x(chw, FILTER_ELEMENT_ALIGNMENT);
int get_deconv_filter_num_per_div(framework::Tensor *filter_tensor,
int group_num, int stride) {
auto dims = filter_tensor->dims();
auto chw = dims[1] * dims[2] / stride * dims[3] / stride;
auto num = dims[0] * stride;
int div_capacity = filter::calc_division_capacity(chw);
return filter::calc_num_per_div(num, group_num, div_capacity);
}
int get_aligned_filter_num(int num) {
return align_to_x(num, FILTER_NUM_ALIGNMENT);
int get_aligned_filter_element_num(int chw) {
return align_to_x(chw, FILTER_ELEMENT_ALIGNMENT);
}
void format_filter(framework::Tensor *filter_tensor, float max_value,
......@@ -177,46 +187,37 @@ void format_concat_output(framework::Tensor *out, int height, int width,
void expand_conv_arg(ConvArgs *arg) {
ConvArgs args = *arg;
uint64_t filterlen = (uint64_t)args.kernel.width *
(uint64_t)args.kernel.height *
(uint64_t)args.image.channels;
filterlen = align_to_x(filterlen, FILTER_ELEMENT_ALIGNMENT);
filterlen *= align_to_x((uint64_t)args.filter_num, FILTER_NUM_ALIGNMENT);
uint64_t fpga_bias_scale_len =
auto fpga_bias_scale_len =
align_to_x(args.filter_num / args.group_num, 8) * args.group_num;
uint64_t output_height =
auto output_height =
(args.image.height + args.image.pad_height * 2 - args.kernel.height) /
args.kernel.stride_h +
1;
uint64_t output_width =
auto output_width =
(args.image.width + args.image.pad_width * 2 - args.kernel.width) /
args.kernel.stride_w +
1;
uint64_t output_size =
output_height * output_width * (uint64_t)args.filter_num;
auto filter_per_group = (uint64_t)(args.filter_num / args.group_num);
auto channel_per_group = (uint64_t)(args.image.channels / args.group_num);
uint64_t image_row_count = ((uint64_t)args.image.width) *
((uint64_t)args.image.channels); // without align
uint64_t image_amount_per_row = align_to_x(image_row_count, IMAGE_ALIGNMENT);
uint64_t image_one_pad_per_row =
align_to_x(image_row_count, IMAGE_ALIGNMENT) +
((uint64_t)args.image.pad_width) * ((uint64_t)args.image.channels);
uint64_t filter_amount_all =
align_to_x(((uint64_t)args.kernel.height) *
((uint64_t)args.kernel.width) * channel_per_group,
auto filter_per_group = args.filter_num / args.group_num;
auto channel_per_group = args.image.channels / args.group_num;
auto image_row_count = args.image.width * args.image.channels;
auto image_amount_per_row = align_to_x(image_row_count, IMAGE_ALIGNMENT);
auto image_one_pad_per_row = align_to_x(image_row_count, IMAGE_ALIGNMENT) +
args.image.pad_width * args.image.channels;
auto filter_amount_all =
align_to_x(args.kernel.height * args.kernel.width * channel_per_group,
FILTER_ELEMENT_ALIGNMENT);
uint64_t output_amount_per_row =
align_to_x(output_width * ((uint64_t)args.filter_num), IMAGE_ALIGNMENT);
auto output_amount_per_row =
align_to_x(output_width * args.filter_num, IMAGE_ALIGNMENT);
// find the opt partition strategy
uint64_t res_win;
uint64_t res_fit = 0;
for (res_win = 1; res_win <= output_width; res_win = res_win + 1) {
for (res_win = 1; res_win <= output_width; res_win++) {
if ((align_to_x(
(args.image.channels *
(args.kernel.width + (res_win - 1) * args.kernel.stride_w)),
......@@ -238,48 +239,48 @@ void expand_conv_arg(ConvArgs *arg) {
}
res_fit = res_win;
uint64_t block_num = (output_width + res_fit - 1) / res_fit;
uint64_t block_len = res_fit;
uint64_t block_last = output_width - res_fit * (block_num - 1);
auto block_num = (output_width + res_fit - 1) / res_fit;
auto block_len = res_fit;
auto block_last = output_width - res_fit * (block_num - 1);
uint64_t res_amount_per_row = output_width * args.filter_num;
uint64_t res_amount_per_row_pad = output_amount_per_row - res_amount_per_row;
auto res_amount_per_row = output_width * args.filter_num;
auto res_amount_per_row_pad = output_amount_per_row - res_amount_per_row;
uint64_t image_block_amount_per_row =
args.kernel.stride_w * (res_fit)*args.image.channels;
uint64_t filter_pad_width_mul_channel =
auto image_block_amount_per_row =
args.kernel.stride_w * res_fit * args.image.channels;
auto filter_pad_width_mul_channel =
args.image.pad_width * args.image.channels;
uint64_t image_amount_per_row_multi_win_first =
auto image_amount_per_row_multi_win_first =
image_amount_per_row * (4 * args.kernel.stride_h - args.image.pad_height);
uint64_t image_amount_per_row_multi_win =
auto image_amount_per_row_multi_win =
image_amount_per_row * (4 * args.kernel.stride_h);
uint64_t image_block_num = block_num;
uint64_t image_block_len =
auto image_block_num = block_num;
auto image_block_len =
align_to_x((args.image.channels *
(args.kernel.width + (block_len - 1) * args.kernel.stride_w)),
IMAGE_ALIGNMENT) /
16 +
1;
uint64_t image_block_len_last =
auto image_block_len_last =
align_to_x(
(args.image.channels *
(args.kernel.width + (block_last - 1) * args.kernel.stride_w)),
IMAGE_ALIGNMENT) /
16 +
1;
uint64_t image_win_cnt = block_len;
uint64_t image_win_cnt_last = block_last;
uint64_t res_row_data_align4_pad = res_amount_per_row_pad / 8;
uint64_t prog_full_cnt = 2048 / (filter_amount_all / 16 * 2) - 1;
auto image_win_cnt = block_len;
auto image_win_cnt_last = block_last;
auto res_row_data_align4_pad = res_amount_per_row_pad / 8;
auto prog_full_cnt = 2048 / (filter_amount_all / 16 * 2) - 1;
if (prog_full_cnt == 1023) {
prog_full_cnt--;
}
uint64_t post_prog_full_cnt =
auto post_prog_full_cnt =
(512 / (align_to_x(args.filter_num, 4) / 4 * 2) > 2)
? (512 / (align_to_x(args.filter_num, 4) / 4 * 2) - 2)
: 0;
uint64_t cmd = 0UL | (args.relu_enabled ? USE_RELU : 0) | USE_BIAS;
auto cmd = 0UL | (args.relu_enabled ? USE_RELU : 0) | USE_BIAS;
(*arg).driver.image_address_phy = vaddr_to_paddr(args.image.address);
(*arg).driver.sb_address_phy = vaddr_to_paddr(args.sb_address);
......@@ -449,7 +450,6 @@ void fill_deconv_arg(struct DeconvArgs *arg, framework::Tensor *input,
arg->sub_conv_num = (uint32_t)stride_h;
arg->filter_num = (uint32_t)filter->dims()[0];
int sub_conv_num = arg->sub_conv_num;
int sub_stride = 1;
int sub_pad = deconv_filter::deconv_calc_sub_pad((int)filter->dims()[3],
padding_w, stride_w);
int sub_filter_width = deconv_filter::deconv_get_sub_filter_axis(
......@@ -464,7 +464,9 @@ void fill_deconv_arg(struct DeconvArgs *arg, framework::Tensor *input,
arg->sub_output_height = (uint32_t)sub_output_height;
arg->omit_size = (uint32_t)deconv_filter::deconv_get_omit(
stride_w, (int)filter->dims()[3], padding_w);
arg->conv_args = (ConvArgs *)fpga_malloc(sub_conv_num * sizeof(ConvArgs));
arg->output.address = out_ptr;
arg->output.scale_address = out->scale;
int sub_channels = (int)input->dims()[1];
int omit_size = arg->omit_size;
......@@ -484,50 +486,160 @@ void fill_deconv_arg(struct DeconvArgs *arg, framework::Tensor *input,
int align_conv_sub_filter_count =
align_sub_filter_count * align_sub_filter_num;
int split_num =
group_num == 1 ? (uint32_t)get_deconv_plit_num(filter, sub_conv_num) : 1;
arg->split_conv_args =
(SplitConvArgs *)fpga_malloc(sub_conv_num * sizeof(SplitConvArgs));
for (int i = 0; i < sub_conv_num; ++i) {
arg->conv_args[i].filter_num = (arg->sub_conv_num) * (arg->filter_num);
arg->conv_args[i].group_num = (uint32_t)group_num;
arg->conv_args[i].filter_scale_address = filter->scale;
arg->conv_args[i].relu_enabled = relu_enabled;
arg->conv_args[i].kernel.width = (uint32_t)sub_filter_width;
arg->conv_args[i].kernel.height = (uint32_t)sub_filter_width;
arg->conv_args[i].kernel.stride_w = 1;
arg->conv_args[i].kernel.stride_h = 1;
// DeconvParam.conv_args[i].image.address = (void*)ptr_image;
arg->conv_args[i].image.scale_address = input->scale;
arg->conv_args[i].image.channels = (uint32_t)sub_channels;
arg->conv_args[i].image.width = (uint32_t)input->dims()[3];
arg->conv_args[i].image.height = (uint32_t)input->dims()[2];
arg->conv_args[i].image.pad_width = (uint32_t)sub_pad;
arg->conv_args[i].image.pad_height = (uint32_t)sub_pad;
arg->conv_args[i].image.address = input_ptr;
arg->conv_args[i].sb_address = (void *)bs_ptr;
auto filter_sub_space =
(char *)fpga_malloc(align_conv_sub_filter_count * sizeof(char));
fpga_copy(filter_sub_space,
(char *)filter_ptr + i * align_conv_sub_filter_count,
(size_t)align_conv_sub_filter_count);
arg->conv_args[i].filter_address = (void *)(filter_sub_space);
fpga_flush(filter_sub_space, (size_t)align_conv_sub_filter_count);
arg->split_conv_args[i].filter_num =
(arg->sub_conv_num) * (arg->filter_num);
arg->split_conv_args[i].group_num = (uint32_t)group_num;
arg->split_conv_args[i].split_num = split_num;
arg->split_conv_args[i].conv_arg =
(ConvArgs *)fpga_malloc(split_num * sizeof(ConvArgs));
arg->split_conv_args[i].concat_arg.height = sub_output_height;
arg->split_conv_args[i].concat_arg.width = sub_output_width;
arg->split_conv_args[i].concat_arg.image_num = split_num;
arg->split_conv_args[i].concat_arg.images_in =
(half **)fpga_malloc(split_num * sizeof(half *));
arg->split_conv_args[i].concat_arg.scales_in =
(float **)fpga_malloc(split_num * sizeof(float *));
arg->split_conv_args[i].concat_arg.channel_num =
(uint32_t *)fpga_malloc(split_num * sizeof(uint32_t));
// arg->split_conv_args[i].concat_arg.image_out =
// fpga_malloc(conv_output_size * sizeof(half));
// arg->split_conv_args[i].concat_arg.scale_out = fpga_malloc(2 *
// sizeof(float));
}
int filter_num_per_div =
get_deconv_filter_num_per_div(filter, group_num, stride_w);
int element_num = get_aligned_filter_element_num(
(int)(sub_channels * sub_filter_width * sub_filter_width));
int chw = sub_channels * sub_filter_width * sub_filter_width;
int division_capacity = filter::calc_division_capacity(chw);
int num_per_div_before_alignment =
filter::calc_num_per_div(sub_filter_num, group_num, division_capacity);
int num_per_div_after_alignment =
align_to_x(num_per_div_before_alignment, FILTER_NUM_ALIGNMENT);
int div_num = (sub_filter_num + num_per_div_before_alignment - 1) /
num_per_div_before_alignment;
int residual = sub_filter_num % num_per_div_before_alignment;
int num_after_alignment = num_per_div_after_alignment *
((residual == 0) ? div_num : (div_num - 1)) +
align_to_x(residual, FILTER_NUM_ALIGNMENT);
int filter_sub_conv_offset = element_num * num_after_alignment;
for (int i = 0; i < sub_conv_num; ++i) {
if (sub_conv_num == 1) {
arg->conv_args[i].output.address = out_ptr;
arg->conv_args[i].output.scale_address = out->scale;
arg->split_conv_args[i].output.address = arg->output.address;
arg->split_conv_args[i].output.scale_address = arg->output.scale_address;
} else {
auto ptr_output = (half *)fpga_malloc(conv_output_size * sizeof(half));
arg->conv_args[i].output.address = (void *)((half *)ptr_output);
arg->split_conv_args[i].output.address = (void *)((half *)ptr_output);
auto ptr_output_scale = (float *)fpga_malloc(2 * sizeof(float));
arg->conv_args[i].output.scale_address = ptr_output_scale;
arg->split_conv_args[i].output.scale_address = ptr_output_scale;
}
}
arg->output.address = out_ptr;
arg->output.scale_address = out->scale;
// fpga_free(filter_ptr);
for (int j = 0; j < split_num; ++j) {
arg->split_conv_args[i].conv_arg[j].relu_enabled = relu_enabled;
arg->split_conv_args[i].conv_arg[j].group_num = (uint32_t)group_num;
arg->split_conv_args[i].conv_arg[j].kernel.width =
(uint32_t)sub_filter_width;
arg->split_conv_args[i].conv_arg[j].kernel.height =
(uint32_t)sub_filter_width;
arg->split_conv_args[i].conv_arg[j].kernel.stride_w = 1;
arg->split_conv_args[i].conv_arg[j].kernel.stride_h = 1;
arg->split_conv_args[i].conv_arg[j].image.scale_address = input->scale;
arg->split_conv_args[i].conv_arg[j].image.channels =
(uint32_t)sub_channels;
arg->split_conv_args[i].conv_arg[j].image.width =
(uint32_t)input->dims()[3];
arg->split_conv_args[i].conv_arg[j].image.height =
(uint32_t)input->dims()[2];
arg->split_conv_args[i].conv_arg[j].image.pad_width = (uint32_t)sub_pad;
arg->split_conv_args[i].conv_arg[j].image.pad_height = (uint32_t)sub_pad;
arg->split_conv_args[i].conv_arg[j].image.address = input_ptr;
arg->split_conv_args[i].conv_arg[j].filter_scale_address = filter->scale;
arg->split_conv_args[i].conv_arg[j].filter_num = (uint32_t)(
j == split_num - 1
? sub_filter_num - (split_num - 1) * filter_num_per_div // NOLINT
: filter_num_per_div);
size_t filter_size =
element_num *
align_to_x(arg->split_conv_args[i].conv_arg[j].filter_num,
FILTER_NUM_ALIGNMENT) *
sizeof(int8_t);
auto filter_head =
&((int8_t *)filter_ptr)[j * element_num * filter_num_per_div +
i * filter_sub_conv_offset];
arg->split_conv_args[i].conv_arg[j].filter_address =
fpga_malloc(filter_size);
memcpy(arg->split_conv_args[i].conv_arg[j].filter_address, filter_head,
filter_size);
fpga_flush(arg->split_conv_args[i].conv_arg[j].filter_address,
filter_size);
{
static int test_cnt = 0;
signed char result = 0;
if (test_cnt <= 1) {
std::string filename = "deconv_split_flt" + std::to_string(test_cnt);
fpga::savefile<signed char>(
filename, arg->split_conv_args[i].conv_arg[j].filter_address,
filter_size, result);
test_cnt++;
}
}
size_t bs_align_num = align_to_x(
arg->split_conv_args[i].conv_arg[j].filter_num, BS_NUM_ALIGNMENT);
size_t bs_size = 2 * bs_align_num * sizeof(float);
auto bs_head = &bs_ptr[j * filter_num_per_div * 2];
arg->split_conv_args[i].conv_arg[j].sb_address = fpga_malloc(bs_size);
memcpy(arg->split_conv_args[i].conv_arg[j].sb_address, bs_head, bs_size);
fpga_flush(arg->split_conv_args[i].conv_arg[j].sb_address, bs_size);
if (split_num == 1) {
arg->split_conv_args[i].conv_arg[j].output.address =
arg->split_conv_args[i].output.address;
arg->split_conv_args[i].conv_arg[j].output.scale_address =
arg->split_conv_args[i].output.scale_address;
} else {
auto ptr_output = (half *)fpga_malloc(conv_output_size * sizeof(half));
arg->split_conv_args[i].conv_arg[j].output.address =
(void *)((half *)ptr_output);
auto ptr_output_scale = (float *)fpga_malloc(2 * sizeof(float));
arg->split_conv_args[i].conv_arg[j].output.scale_address =
ptr_output_scale;
}
arg->split_conv_args[i].concat_arg.images_in[j] =
(half *)arg->split_conv_args[i].conv_arg[j].output.address; // NOLINT
arg->split_conv_args[i].concat_arg.scales_in[j] =
arg->split_conv_args[i].conv_arg[j].output.scale_address;
arg->split_conv_args[i].concat_arg.channel_num[j] =
arg->split_conv_args[i].conv_arg[j].filter_num;
expand_conv_arg(&(arg->split_conv_args[i].conv_arg[j]));
}
arg->split_conv_args[i].concat_arg.image_out =
arg->split_conv_args[i].output.address;
arg->split_conv_args[i].concat_arg.scale_out =
arg->split_conv_args[i].output.scale_address;
}
filter->reset_data_ptr(nullptr);
fpga_free(bs_ptr);
} // fill_deconv_arg
} // namespace fpga
......
......@@ -21,16 +21,19 @@ limitations under the License. */
namespace paddle_mobile {
namespace fpga {
int get_align_image_cw(int cw);
void format_image(framework::Tensor* image_tensor);
void format_fp16_ofm(framework::Tensor* ofm_tensor); // only allocate memory
void format_fp32_ofm(framework::Tensor* ofm_tensor);
float filter_find_max(framework::Tensor* filter_tensor);
int get_filter_num_per_div(framework::Tensor* filter_tensor, int group_num);
int get_deconv_filter_num_per_div(framework::Tensor* filter_tensor,
int group_num, int stride);
int get_plit_num(framework::Tensor* filter_tensor);
int get_deconv_plit_num(framework::Tensor* filter_tensor, int stride);
int get_aligned_filter_element_num(int chw);
int get_aligned_filter_num(int num);
void format_filter(framework::Tensor* filter_tensor, float max_value,
int group_num);
void format_fc_filter(framework::Tensor* filter_tensor, float max_value);
......
......@@ -40,10 +40,9 @@ inverse kernel weights of each channel for every filter
void deconv_inverse_filter(float** data_in, int num, int channel, int width,
int height) {
float* tmp = *data_in;
// float fix_range = 127;// float scale = fix_range / max;
int data_size = num * channel * width * height;
int hw_len = height * width;
float* tmp_data = (float*)fpga_malloc(data_size * sizeof(float));
auto tmp_data = (float*)fpga_malloc(data_size * sizeof(float));
for (int i = 0; i < num; ++i) {
for (int j = 0; j < channel; ++j) {
for (int k = 0; k < hw_len; ++k) {
......@@ -52,7 +51,7 @@ void deconv_inverse_filter(float** data_in, int num, int channel, int width,
}
}
}
*data_in = (float*)tmp_data; //
*data_in = tmp_data;
fpga_free(tmp);
}
......@@ -61,8 +60,7 @@ void deconv_inverse_filter(float** data_in, int num, int channel, int width,
*/
int deconv_calc_sub_pad(int filter_axis, int pad, int stride) {
if (stride == 0 || ((filter_axis - pad - 1) < 0)) {
// error
return 0;
PADDLE_MOBILE_ENFORCE(false, "Wrong deconv parameters");
}
return (filter_axis - pad - 1) / stride;
}
......@@ -79,11 +77,8 @@ int deconv_get_sub_out_axis(int image_axis, int sub_pad, int sub_filter_axis) {
position. so the omit rows or columns is (stride - )
*/
int deconv_get_omit(int stride, int filter_width, int pad) {
if (((filter_width - pad) <= 0)) { // ((filter_width-pad) > stride) ||
// error
return 0;
}
int idx = 1;
PADDLE_MOBILE_ENFORCE(filter_width > pad, "Wrong deconv parameters");
int idx;
bool flag = false;
for (idx = 1; idx <= stride; ++idx) {
int j = idx;
......@@ -102,10 +97,6 @@ int deconv_get_omit(int stride, int filter_width, int pad) {
return (stride - idx);
}
int deconv_get_sub_filter_num(int filter_num, int stride) {
return filter_num * stride;
}
void deconv_get_sub_filter(char** data_in, int height, int width,
int sub_conv_n, int kernel_num, int channel) {
char* ptr_tmp = *data_in;
......@@ -245,7 +236,6 @@ void deconv_format_filter(float** data_in, int num, int channel, int height,
char* ptr_space = (char*)fpga_malloc(sub_conv_n * align_offset *
sizeof(char)); // continuous space
for (int i = 0; i < sub_conv_n; ++i) {
int offset = i * origin_offset;
char* ptr_tmp = (ptr_ptr_data)[i];
filter::align_element(&ptr_tmp, sub_num, sub_chw);
......
......@@ -21,7 +21,6 @@ namespace deconv_filter {
void deconv_inverse_filter(float** data_in, int num, int channel, int width,
int height);
int deconv_calc_sub_pad(int filter_axis, int pad, int stride);
int deconv_get_sub_filter_num(int filter_num, int stride);
int deconv_get_sub_filter_axis(int filter_axis, int stride);
int deconv_get_sub_out_axis(int image_axis, int sub_pad, int sub_filter_axis);
int deconv_get_omit(int stride, int filter_width, int pad);
......
此差异已折叠。
......@@ -153,10 +153,6 @@ int memory_request(struct fpga_memory *memory, size_t size, uint64_t *addr) {
uint64_t _nr = DIV_ROUND_UP(size, FPGA_PAGE_SIZE);
unsigned int nr = (unsigned int)_nr;
int ret = 0;
DLOG << size;
DLOG << _nr;
DLOG << nr;
uint64_t a_size = FPGA_PAGE_SIZE * nr;
DLOG << a_size;
......@@ -283,7 +279,7 @@ int fpga_memory_add() {
return 0;
}
uint64_t vaddr_to_paddr(void *address) {
uint64_t vaddr_to_paddr_driver(void *address) {
uint64_t paddr = 0;
auto iter = g_fpgainfo.fpga_vaddr2paddr_map.find(address);
if (iter != g_fpgainfo.fpga_vaddr2paddr_map.end()) {
......@@ -315,7 +311,7 @@ void *fpga_reg_free(void *ptr) {
g_fpgainfo.fpga_addr2size_map.erase(iter);
munmap(ptr, size);
} else {
DLOG << "Invalid pointer";
DLOG << "Invalid pointer" << ptr;
}
}
......@@ -347,7 +343,7 @@ void fpga_free_driver(void *ptr) {
g_fpgainfo.fpga_addr2size_map.erase(iter);
munmap(ptr, size);
p_addr = vaddr_to_paddr(ptr);
p_addr = vaddr_to_paddr_driver(ptr);
pos = (p_addr - g_fpgainfo.memory_info->mem_start) / FPGA_PAGE_SIZE;
/*clear bitmap*/
......@@ -361,7 +357,7 @@ void fpga_free_driver(void *ptr) {
g_fpgainfo.fpga_vaddr2paddr_map.erase(iter);
}
} else {
DLOG << "Invalid pointer";
DLOG << "Invalid pointer" << ptr;
}
}
......@@ -373,7 +369,7 @@ int fpga_flush_driver(void *address, size_t size) {
struct MemoryCacheArgs args;
uint64_t p_addr;
p_addr = vaddr_to_paddr(address);
p_addr = vaddr_to_paddr_driver(address);
args.offset = (void *)(p_addr - FPGA_MEM_PHY_ADDR); // NOLINT
args.size = size;
......@@ -385,7 +381,7 @@ int fpga_invalidate_driver(void *address, size_t size) {
struct MemoryCacheArgs args;
uint64_t p_addr;
p_addr = vaddr_to_paddr(address);
p_addr = vaddr_to_paddr_driver(address);
args.offset = (void *)(p_addr - FPGA_MEM_PHY_ADDR); // NOLINT
args.size = size;
......
......@@ -31,8 +31,8 @@ namespace driver {
#define FPGA_REG_PHY_ADDR 0xa0000000
#define FPGA_REG_SIZE 0x1000
#define FPGA_MEM_PHY_ADDR 0x20000000
#define FPGA_MEM_SIZE 0x20000000
#define FPGA_MEM_PHY_ADDR 0x40000000
#define FPGA_MEM_SIZE 0x80000000
#define FPGA_PAGE_SIZE (16UL * 1024UL)
......@@ -122,15 +122,11 @@ void *fpga_malloc_driver(size_t size);
void fpga_free_driver(void *ptr);
void fpga_copy_driver(void *dest, const void *src, size_t num);
int fpga_flush_driver(void *address, size_t size);
int fpga_invalidate_driver(void *address, size_t size);
/*pe*/
uint64_t vaddr_to_paddr(void *address);
uint64_t vaddr_to_paddr_driver(void *address);
int fpga_regpoll(uint64_t reg, uint64_t val, int time);
......
......@@ -59,6 +59,9 @@ int close_device() {
void *fpga_malloc(size_t size) {
static uint64_t counter = 0;
if (size <= 0) {
size = 1;
}
#ifdef PADDLE_MOBILE_ZU5
auto ptr = driver::fpga_malloc_driver(size);
#else
......@@ -115,7 +118,7 @@ int fpga_invalidate(void *address, size_t size) {
}
uint64_t vaddr_to_paddr(void *address) {
#ifdef PADDLE_MOBILE_ZU5
return driver::vaddr_to_paddr(address);
return driver::vaddr_to_paddr_driver(address);
#else
return 0;
#endif
......
......@@ -37,6 +37,18 @@ enum LayoutType {
LAYOUT_HWC = 0,
};
enum ActivationType {
NONE = 0,
LEAKYRELU = 1,
SIGMOID = 2,
TANH = 3,
};
struct ActivationArgs {
enum ActivationType activation_type;
int16_t leaky_relu_negative_slope;
};
struct KernelArgs {
uint32_t width;
uint32_t height;
......@@ -58,7 +70,10 @@ struct ImageOutputArgs {
void* address; // output result address;
float* scale_address; // output scale address;
uint64_t timer_cnt; // time counter for FPGA computation
struct ActivationArgs
activation; // To select activation and specify (Leaky)Relu parameter.
};
#ifdef PADDLE_MOBILE_FPGA_V1
struct ConvDriverParam {
uint64_t image_address_phy;
......@@ -195,10 +210,14 @@ struct DeconvArgs {
uint32_t sub_output_width;
uint32_t sub_output_height;
struct ImageOutputArgs output;
struct ConvArgs* conv_args;
struct SplitConvArgs* split_conv_args;
};
static inline int align_to_x(int num, int x) { return (num + x - 1) / x * x; }
// static inline int align_to_x(int num, int x) { return (num + x - 1) / x * x;
// }
static inline uint32_t align_to_x(int64_t num, int64_t x) {
return ((uint32_t)(num + x) - 1) / (uint32_t)x * (uint32_t)x;
}
int16_t fp32_2_fp16(float fp32_num);
float fp16_2_fp32(int16_t fp16_num);
......
......@@ -28,6 +28,10 @@ extern _PaddleMobile__Framework__Proto__VarType__Type ToDataType(
extern std::type_index ToTypeIndex(
_PaddleMobile__Framework__Proto__VarType__Type type);
inline _PaddleMobile__Framework__Proto__VarType__Type ToDataType(int type) {
return static_cast<_PaddleMobile__Framework__Proto__VarType__Type>(type);
}
template <typename Visitor>
inline void VisitDataType(_PaddleMobile__Framework__Proto__VarType__Type type,
Visitor visitor) {
......
此差异已折叠。
......@@ -17,6 +17,7 @@ limitations under the License. */
#include <map>
#include <memory>
#include <string>
#include <utility>
#include <vector>
#include "common/types.h"
#include "common/util.h"
......@@ -28,41 +29,29 @@ limitations under the License. */
namespace paddle_mobile {
namespace framework {
template <typename Dtype = CPU, Precision P = Precision::FP32>
template <typename Device, typename T = float>
class Executor {
public:
typedef typename PrecisionTrait<P>::ptype Ptype;
// exector constructor
// @param program program converted from proto program in PaddlePaddle
// @param use_optimize bool whether use operator fusion to speed up or not
// @param loddable bool
Executor(const framework::Program<Dtype> program, int batch_size = 1,
const bool use_optimize = true, const bool loddable = false);
// predict with tensor input
// @param t input tensor to do prediction
// @return predicted tensor
std::shared_ptr<framework::Tensor> Predict(const framework::Tensor &t);
// predict with lod tensor input
// @param t input lod tensor to do prediction
// @return predicted lod tensor
std::shared_ptr<framework::LoDTensor> PredictLod(
const framework::LoDTensor &t);
// predict with vector input and dims
// @param input vector whose elements will be formed
// @param input lod tensor to do prediction
// @param dims vector whose elements will be formed
// @param input tensor shape
// @return vector which is flatted from predicted tensor
std::vector<Ptype> Predict(const std::vector<Ptype> &input,
const std::vector<int64_t> &dims);
Executor(const Program<Device> &program, int batch_size = 1,
const bool use_optimize = true, const bool lod_mode = false);
PMStatus Predict(const std::vector<std::pair<std::string, Tensor>> &inputs);
PMStatus Predict(
const std::vector<std::pair<std::string, LoDTensor>> &inputs);
std::vector<T> Predict(const std::vector<T> &input,
const std::vector<int64_t> &dims);
PMStatus Predict();
void SetInput(const Tensor &input, const std::string &var_name);
void SetInput(const LoDTensor &input, const std::string &var_name);
std::shared_ptr<LoDTensor> GetOutput(const std::string &var_name);
#ifdef PADDLE_MOBILE_FPGA
void InjectVariable(const framework::Tensor &t, std::string var_name);
void FeedData(const framework::Tensor &t);
std::shared_ptr<framework::Tensor> FetchResult(int id = -1);
void InjectVariable(const Tensor &t, std::string var_name);
void FeedData(const Tensor &t);
std::shared_ptr<Tensor> FetchResult(int id = -1);
void Predict_From_To(int start = 0, int end = -1);
void Predict_From(int start);
void Predict_To(int end);
......@@ -70,26 +59,28 @@ class Executor {
protected:
Executor() = default;
std::shared_ptr<framework::Tensor> Predict(const framework::Tensor &t,
int block_id);
bool varInputMemory(const std::shared_ptr<framework::VarDesc> &var_desc,
framework::Variable *var,
framework::LoDTensor *tensor) const;
bool varInputMemory(const std::shared_ptr<VarDesc> &var_desc, Variable *var,
LoDTensor *tensor) const;
void InitMemory();
void InitCombineMemory();
void LoadMemory(void **data,
const std::shared_ptr<framework::VarDesc> var_desc,
framework::LoDTensor *tensor);
void LoadMemory(void **data, const std::shared_ptr<VarDesc> var_desc,
LoDTensor *tensor);
#ifdef PADDLE_MOBILE_CL
void LoadMemory(const framework::VarDesc var_desc, float *tensorInput,
char **data);
void LoadMemory(const VarDesc var_desc, float *tensorInput, char **data);
#endif
framework::Program<Dtype> program_;
int batch_size_ = 1;
std::shared_ptr<framework::ProgramDesc> to_predict_program_;
std::map<framework::BlockDesc,
std::vector<std::shared_ptr<framework::OperatorBase<Dtype>>>>
ops_of_block_;
int batch_size_;
bool use_optimize_;
bool lod_mode_;
Program<Device> program_;
std::shared_ptr<ProgramDesc> program_desc_;
typedef std::shared_ptr<OperatorBase<Device>> OperatorBasePtr;
std::vector<std::vector<OperatorBasePtr>> ops_of_block_;
// operators list
std::vector<OperatorBasePtr> ops_list_;
#ifdef PADDLE_MOBILE_PROFILE
struct ProfInfo {
int tid = 0;
......@@ -97,8 +88,6 @@ class Executor {
uint64_t runEnd = 0UL;
};
#endif
bool use_optimize_ = false;
bool loddable_ = false;
};
} // namespace framework
......
......@@ -191,6 +191,7 @@ LOAD_OP2(mul, CPU, MALI_GPU);
#endif
#ifdef RELU_OP
LOAD_OP2(relu, CPU, MALI_GPU);
LOAD_OP1(relu6, CPU);
#endif
#ifdef IM2SEQUENCE_OP
LOAD_OP1(im2sequence, CPU);
......@@ -227,12 +228,22 @@ LOAD_FUSION_MATCHER(fusion_conv_bn);
#ifdef ELEMENTWISESUB_OP
LOAD_OP1(elementwise_sub, CPU)
#endif
#ifdef TOP_K_OP
LOAD_OP1(top_k, CPU)
#endif
#ifdef CAST_OP
LOAD_OP1(cast, CPU)
#endif
#ifdef QUANT_OP
LOAD_OP1(quantize, CPU);
#endif
#ifdef DEQUANT_OP
LOAD_OP1(dequantize, CPU);
#endif
#ifdef FUSION_DEQUANT_BN_OP
LOAD_OP1(fusion_dequant_bn, CPU);
LOAD_FUSION_MATCHER(fusion_dequant_bn);
#endif
#ifdef FUSION_DEQUANT_ADD_BN_OP
LOAD_OP1(fusion_dequant_add_bn, CPU);
LOAD_FUSION_MATCHER(fusion_dequant_add_bn);
......@@ -245,3 +256,11 @@ LOAD_FUSION_MATCHER(fusion_dequant_bn_relu);
LOAD_OP1(fusion_dequant_add_bn_relu, CPU);
LOAD_FUSION_MATCHER(fusion_dequant_add_bn_relu);
#endif
#ifdef FUSION_DEQUANT_ADD_BN_QUANT_OP
LOAD_OP1(fusion_dequant_add_bn_quant, CPU);
LOAD_FUSION_MATCHER(fusion_dequant_add_bn_quant);
#endif
#ifdef FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
LOAD_OP1(fusion_dequant_add_bn_relu_quant, CPU);
LOAD_FUSION_MATCHER(fusion_dequant_add_bn_relu_quant);
#endif
......@@ -23,14 +23,8 @@ limitations under the License. */
namespace paddle_mobile {
namespace framework {
/**
* muteandresize tensor as originProgramDesc and scope in loadParams
*
* @param originProgramDesc
* @param scope
*/
template <typename Dtype, Precision P>
void Loader<Dtype, P>::InitMemoryFromProgram(
template <typename Device, typename T>
void Loader<Device, T>::InitMemoryFromProgram(
const std::shared_ptr<ProgramDesc> &originProgramDesc,
const std::shared_ptr<Scope> &scope) {
for (const auto &block : originProgramDesc.get()->Blocks()) {
......@@ -43,8 +37,6 @@ void Loader<Dtype, P>::InitMemoryFromProgram(
tensor->Resize(make_ddim(dim));
} else {
auto dim = var_desc->Tensor_desc().Dims();
// PADDLE_MOBILE_ENFORCE(dim.size() > 0, "dim size is 0");
// dim[0] = 1;
if (dim.size() == 0) {
auto tensor = var->GetMutable<LoDTensor>();
framework::DDim dDim = {0};
......@@ -60,7 +52,7 @@ void Loader<Dtype, P>::InitMemoryFromProgram(
}
}
} else {
// TODO(codeWorm): some.
// TODO(codeWorm)
}
}
}
......@@ -68,7 +60,7 @@ void Loader<Dtype, P>::InitMemoryFromProgram(
#ifdef PADDLE_MOBILE_CL
template <>
void Loader<GPU_CL, Precision::FP32>::InitMemoryFromProgram(
void Loader<GPU_CL, float>::InitMemoryFromProgram(
const std::shared_ptr<ProgramDesc> &originProgramDesc,
const std::shared_ptr<Scope> &scope) {
for (const auto &block : originProgramDesc.get()->Blocks()) {
......@@ -77,7 +69,6 @@ void Loader<GPU_CL, Precision::FP32>::InitMemoryFromProgram(
if (var_desc->Type() == VARTYPE_TYPE_LOD_TENSOR) {
if (var_desc->Persistable()) {
auto dim = var_desc->Tensor_desc().Dims();
// auto tensor = var->GetMutable<LoDTensor>();
auto cl_image = var->GetMutable<framework::CLImage>();
cl_image->Resize(make_ddim(dim));
} else {
......@@ -88,14 +79,13 @@ void Loader<GPU_CL, Precision::FP32>::InitMemoryFromProgram(
cl_image->Resize(make_ddim(dim));
}
} else {
// TODO(codeWorm): some.
// TODO(codeWorm)
}
}
}
}
template <>
const Program<GPU_CL, Precision::FP32>
Loader<GPU_CL, Precision::FP32>::LoadCombinedMemory(
const Program<GPU_CL, float> Loader<GPU_CL, float>::LoadCombinedMemory(
size_t read_size, const uint8_t *buf, size_t combined_params_len,
uint8_t *combined_params_buf, bool optimize, bool quantification) {
bool can_add_split = false;
......@@ -113,7 +103,7 @@ Loader<GPU_CL, Precision::FP32>::LoadCombinedMemory(
auto originProgramDesc = std::make_shared<ProgramDesc>(c_program);
Program<GPU_CL, Precision::FP32> program;
Program<GPU_CL, float> program;
program.combined = true;
program.originProgram = originProgramDesc;
program.quantification = quantification;
......@@ -145,16 +135,16 @@ Loader<GPU_CL, Precision::FP32>::LoadCombinedMemory(
/**
* fusion and print someinfos
* @tparam Dtype
* @tparam Device
* @tparam P
* @param optimize
* @param can_add_split
* @param program
* @param originProgramDesc
*/
template <typename Dtype, Precision P>
template <typename Device, typename T>
void FusionAndPrintInfos(
bool optimize, bool can_add_split, Program<Dtype, P> *program,
bool optimize, bool can_add_split, Program<Device, T> *program,
const std::shared_ptr<ProgramDesc> &originProgramDesc) {
if (optimize) {
ProgramOptimize program_optimize;
......@@ -193,22 +183,22 @@ static size_t ReadBuffer(const char *file_name, uint8_t **out) {
return cur_len;
}
template <typename Dtype, Precision P>
const Program<Dtype, P> Loader<Dtype, P>::Load(const std::string &dirname,
bool optimize,
bool quantification,
bool can_add_split) {
template <typename Device, typename T>
const Program<Device, T> Loader<Device, T>::Load(const std::string &dirname,
bool optimize,
bool quantification,
bool can_add_split) {
auto program = this->LoadProgram(dirname + "/__model__", optimize,
quantification, can_add_split);
program.model_path = dirname;
return program;
}
template <typename Dtype, Precision P>
const Program<Dtype, P> Loader<Dtype, P>::Load(const std::string &model_path,
const std::string &para_path,
bool optimize,
bool quantification) {
template <typename Device, typename T>
const Program<Device, T> Loader<Device, T>::Load(const std::string &model_path,
const std::string &para_path,
bool optimize,
bool quantification) {
auto program = this->LoadProgram(model_path, optimize, quantification);
program.para_path = para_path;
......@@ -217,8 +207,8 @@ const Program<Dtype, P> Loader<Dtype, P>::Load(const std::string &model_path,
return program;
}
template <typename Dtype, Precision P>
const Program<Dtype, P> Loader<Dtype, P>::LoadProgram(
template <typename Device, typename T>
const Program<Device, T> Loader<Device, T>::LoadProgram(
const std::string &model_path, bool optimize, bool quantification,
bool can_add_split) {
std::string model_filename = model_path;
......@@ -237,7 +227,7 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadProgram(
//
auto originProgramDesc = std::make_shared<ProgramDesc>(c_program);
Program<Dtype, P> program;
Program<Device, T> program;
program.originProgram = originProgramDesc;
program.quantification = quantification;
program.combined_params_len = 0;
......@@ -254,8 +244,8 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadProgram(
return program;
}
template <typename Dtype, Precision P>
const Program<Dtype, P> Loader<Dtype, P>::LoadCombinedMemory(
template <typename Device, typename T>
const Program<Device, T> Loader<Device, T>::LoadCombinedMemory(
size_t read_size, const uint8_t *buf, size_t combined_params_len,
uint8_t *combined_params_buf, bool optimize, bool quantification) {
bool can_add_split = false;
......@@ -273,7 +263,7 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadCombinedMemory(
auto originProgramDesc = std::make_shared<ProgramDesc>(c_program);
Program<Dtype, P> program;
Program<Device, T> program;
program.combined = true;
program.originProgram = originProgramDesc;
program.quantification = quantification;
......@@ -289,13 +279,13 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadCombinedMemory(
return program;
}
template class Loader<CPU, Precision::FP32>;
template class Loader<CPU, float>;
template class Loader<FPGA, Precision::FP32>;
template class Loader<FPGA, float>;
template class Loader<GPU_MALI, Precision::FP32>;
template class Loader<GPU_MALI, float>;
template class Loader<GPU_CL, Precision::FP32>;
template class Loader<GPU_CL, float>;
} // namespace framework
} // namespace paddle_mobile
......@@ -22,39 +22,39 @@ limitations under the License. */
namespace paddle_mobile {
namespace framework {
template <typename Dtype = CPU, Precision P = Precision::FP32>
template <typename Device = CPU, typename T = float>
class Loader {
public:
/*
* @b load separate format fluid model
* @b 加载分开形式的 fluid 模型
* @b 加载分开存储的fluid模型
* */
const Program<Dtype, P> Load(const std::string &dirname,
bool optimize = false,
bool quantification = false,
bool can_add_split = false);
const Program<Device, T> Load(const std::string &dirname,
bool optimize = false,
bool quantification = false,
bool can_add_split = false);
/*
* @b load combine format fluid mode
* @b 加载结合在一起格式的模型
* @b 加载统一存储的fluid模型
* */
const Program<Dtype, P> Load(const std::string &model_path,
const std::string &para_path,
bool optimize = false,
bool quantification = false);
const Program<Device, T> Load(const std::string &model_path,
const std::string &para_path,
bool optimize = false,
bool quantification = false);
const Program<Dtype, P> LoadCombinedMemory(size_t model_len,
const uint8_t *model_buf,
size_t combined_params_len,
uint8_t *combined_params_buf,
bool optimize = false,
bool quantification = false);
const Program<Device, T> LoadCombinedMemory(size_t model_len,
const uint8_t *model_buf,
size_t combined_params_len,
uint8_t *combined_params_buf,
bool optimize = false,
bool quantification = false);
private:
const Program<Dtype, P> LoadProgram(const std::string &model_path,
bool optimize = false,
bool quantification = false,
bool can_add_split = false);
const Program<Device, T> LoadProgram(const std::string &model_path,
bool optimize = false,
bool quantification = false,
bool can_add_split = false);
void InitMemoryFromProgram(
const std::shared_ptr<ProgramDesc> &originProgramDesc,
......
......@@ -16,12 +16,12 @@ limitations under the License. */
#include <memory>
#include <string>
#include <utility>
#include <vector>
#include "tensor.h"
#include "tensor_util.h"
#include "framework/tensor.h"
#include "framework/tensor_util.h"
namespace paddle_mobile {
namespace framework {
/*
......@@ -202,5 +202,29 @@ void SerializeToStream(std::ostream &os, const LoDTensor &tensor);
void DeserializeFromStream(std::istream &is, LoDTensor *tensor);
#ifdef PADDLE_MOBILE_DEBUG
inline Print &operator<<(Print &printer, const LoDTensor &tensor) {
printer << " dims: " << tensor.dims() << "\n";
int stride = tensor.numel() / 20;
stride = stride > 0 ? stride : 1;
#ifndef PADDLE_MOBILE_FPGA
for (int i = 0; i < tensor.numel(); i += stride) {
if (tensor.type() == typeid(float)) {
printer << tensor.data<float>()[i] << " ";
} else if (tensor.type() == typeid(int32_t)) {
printer << tensor.data<int32_t>()[i] << " ";
} else if (tensor.type() == typeid(int64_t)) {
printer << tensor.data<int64_t>()[i] << " ";
} else if (tensor.type() == typeid(int8_t)) {
printer << static_cast<int>(tensor.data<int8_t>()[i]) << " ";
} else if (tensor.type() == typeid(int32_t)) {
printer << tensor.data<int32_t>()[i] << " ";
}
}
#endif // PADDLE_MOBILE_FPGA
return printer;
}
#endif // PADDLE_MOBILE_DEBUG
} // namespace framework
} // namespace paddle_mobile
......@@ -98,24 +98,6 @@ class OpRegistry {
}
};
#define REGISTER_OPERATOR_INT8(op_type, op_class, device_name, device_type) \
template class op_class<device_type, int8_t>; \
template <typename Dtype, typename T> \
class _OpClass_##op_type##_##device_name : public op_class<Dtype, T> { \
public: \
DEFINE_OP_CONSTRUCTOR(_OpClass_##op_type##_##device_name, op_class); \
}; \
static paddle_mobile::framework::OperatorRegistrar< \
device_type, _OpClass_##op_type##_##device_name<device_type, int8_t>> \
__op_registrar_##op_type##_##device_name(#op_type); \
int TouchOpRegistrar_##op_type##_##device_name() { \
__op_registrar_##op_type##_##device_name.Touch(); \
return 0; \
}
#define REGISTER_OPERATOR_CPU_INT8(op_type, op_class) \
REGISTER_OPERATOR_INT8(op_type, op_class, cpu, paddle_mobile::CPU);
#define REGISTER_OPERATOR(op_type, op_class, device_name, device_type) \
template class op_class<device_type, float>; \
template <typename Dtype, typename T> \
......
......@@ -220,7 +220,16 @@ void Node::Folder(
}
} else {
for (auto &op_output : this->op_desc_->outputs_) {
op_desc->outputs_.emplace(op_output.first, op_output.second);
auto output_key = op_output.first;
if (change->find(this->type_) != change->end()) {
const auto change_pairs = (*change)[this->type_];
for (const auto &target : change_pairs) {
if (target.first == output_key) {
output_key = target.second;
}
}
}
op_desc->outputs_.emplace(output_key, op_output.second);
}
for (auto &output : this->outputs_) {
......
......@@ -14,16 +14,15 @@ limitations under the License. */
#pragma once
#include <string>
#include "common/types.h"
#include "framework/program/program_desc.h"
#include "framework/scope.h"
#include <string>
namespace paddle_mobile {
namespace framework {
template <typename Dtype, Precision P = Precision::FP32>
template <typename Device, typename T = float>
class Program {
public:
std::shared_ptr<ProgramDesc> originProgram;
......
......@@ -26,6 +26,7 @@ limitations under the License. */
namespace paddle_mobile {
namespace framework {
class Scope {
public:
Scope() = default;
......
......@@ -148,8 +148,8 @@ class Tensor : public TensorBase {
PADDLE_MOBILE_ENFORCE(
(std::is_same<T, void>::value ||
holder_->type().hash_code() == typeid(T).hash_code()),
"Tensor holds the wrong type, it holds %s",
this->holder_->type().name());
"Tensor holds the wrong type, it holds %s, requested %s",
this->holder_->type().name(), typeid(T).name());
return reinterpret_cast<T *>(reinterpret_cast<uintptr_t>(holder_->ptr()) +
offset_);
......@@ -162,7 +162,7 @@ class Tensor : public TensorBase {
PADDLE_MOBILE_ENFORCE(
(std::is_same<T, void>::value ||
holder_->type().hash_code() == typeid(T).hash_code()),
"Tensor holds the wrong type, it holds %s ,requested:%s",
"Tensor holds the wrong type, it holds %s, requested %s",
this->holder_->type().name(), typeid(T).name());
return reinterpret_cast<const T *>(
......@@ -226,7 +226,6 @@ inline Print &operator<<(Print &printer, const Tensor &tensor) {
}
}
#endif
return printer;
}
......
......@@ -18,17 +18,17 @@
namespace paddle_mobile {
template <typename Dtype, Precision P>
PaddleMobilePredictor<Dtype, P>::PaddleMobilePredictor(
template <typename Device, typename T>
PaddleMobilePredictor<Device, T>::PaddleMobilePredictor(
const PaddleMobileConfig &config) {
PADDLE_MOBILE_ENFORCE(Init(config) == true,
"paddle mobile predictor init failed!");
config_ = config;
}
template <typename Dtype, Precision P>
bool PaddleMobilePredictor<Dtype, P>::Init(const PaddleMobileConfig &config) {
paddle_mobile_.reset(new PaddleMobile<Dtype, P>());
template <typename Device, typename T>
bool PaddleMobilePredictor<Device, T>::Init(const PaddleMobileConfig &config) {
paddle_mobile_.reset(new PaddleMobile<Device, T>());
#ifdef PADDLE_MOBILE_CL
paddle_mobile_->SetCLPath(config.cl_path);
#endif
......@@ -52,8 +52,8 @@ bool PaddleMobilePredictor<Dtype, P>::Init(const PaddleMobileConfig &config) {
paddle_mobile_->SetThreadNum(config.thread_num);
return true;
}
template <typename Dtype, Precision P>
bool PaddleMobilePredictor<Dtype, P>::Run(
template <typename Device, typename T>
bool PaddleMobilePredictor<Device, T>::Run(
const std::vector<PaddleTensor> &inputs,
std::vector<PaddleTensor> *output_data, int batch_size) {
if (inputs.empty()) {
......@@ -78,12 +78,12 @@ bool PaddleMobilePredictor<Dtype, P>::Run(
framework::Tensor input_tensor;
input_tensor.Resize(ddim);
int input_length = framework::product(ddim);
typedef typename PrecisionTrait<P>::ptype PType;
auto input_ptr = input_tensor.mutable_data<PType>();
auto input_ptr = input_tensor.mutable_data<T>();
memcpy(input_ptr, static_cast<PType *>(input.data.data()),
input_length * sizeof(PType));
auto output_tensor = paddle_mobile_->Predict(input_tensor);
memcpy(input_ptr, static_cast<T *>(input.data.data()),
input_length * sizeof(T));
paddle_mobile_->Predict(input_tensor);
auto output_tensor = paddle_mobile_->Fetch();
if (output_data->empty()) {
LOG(kLOG_ERROR) << "At least one output should be set with tensors' names.";
......@@ -99,18 +99,18 @@ bool PaddleMobilePredictor<Dtype, P>::Run(
output.shape.push_back(static_cast<int>(d));
}
if (output.data.length() < output_length * sizeof(PType)) {
output.data.Resize(output_length * sizeof(PType));
if (output.data.length() < output_length * sizeof(T)) {
output.data.Resize(output_length * sizeof(T));
}
memcpy(output.data.data(), output_tensor->template data<PType>(),
output_length * sizeof(PType));
memcpy(output.data.data(), output_tensor->template data<T>(),
output_length * sizeof(T));
return true;
}
template <typename Dtype, Precision P>
PaddleMobilePredictor<Dtype, P>::~PaddleMobilePredictor() {
template <typename Device, typename T>
PaddleMobilePredictor<Device, T>::~PaddleMobilePredictor() {
paddle_mobile_->Clear();
}
......@@ -122,13 +122,13 @@ CreatePaddlePredictor<PaddleMobileConfig, PaddleEngineKind::kPaddleMobile>(
std::unique_ptr<PaddlePredictor> x;
if (config.precision == PaddleMobileConfig::FP32) {
if (config.device == PaddleMobileConfig::kCPU) {
x.reset(new PaddleMobilePredictor<CPU, Precision::FP32>(config));
x.reset(new PaddleMobilePredictor<CPU, float>(config));
} else if (config.device == PaddleMobileConfig::kFPGA) {
x.reset(new PaddleMobilePredictor<FPGA, Precision::FP32>(config));
x.reset(new PaddleMobilePredictor<FPGA, float>(config));
} else if (config.device == PaddleMobileConfig::kGPU_MALI) {
x.reset(new PaddleMobilePredictor<GPU_MALI, Precision::FP32>(config));
x.reset(new PaddleMobilePredictor<GPU_MALI, float>(config));
} else if (config.device == PaddleMobileConfig::kGPU_CL) {
x.reset(new PaddleMobilePredictor<GPU_CL, Precision::FP32>(config));
x.reset(new PaddleMobilePredictor<GPU_CL, float>(config));
} else {
LOG(kLOG_ERROR) << "unsupport device type!";
return nullptr;
......
......@@ -29,7 +29,7 @@ limitations under the License. */
namespace paddle_mobile {
template <typename Dtype = CPU, Precision P = Precision::FP32>
template <typename Device = CPU, typename T = float>
class PaddleMobilePredictor : public PaddlePredictor {
public:
PaddleMobilePredictor() = delete;
......@@ -43,7 +43,7 @@ class PaddleMobilePredictor : public PaddlePredictor {
~PaddleMobilePredictor() override;
private:
std::unique_ptr<PaddleMobile<Dtype, P>> paddle_mobile_;
std::unique_ptr<PaddleMobile<Device, T>> paddle_mobile_;
bool Init(const PaddleMobileConfig& config);
PaddleMobileConfig config_;
......
......@@ -59,7 +59,7 @@
@interface PaddleMobileCPU()
{
paddle_mobile::PaddleMobile<paddle_mobile::CPU, paddle_mobile::Precision::FP32> *pam_;
paddle_mobile::PaddleMobile<paddle_mobile::CPU, float> *pam_;
BOOL loaded_;
}
......@@ -73,7 +73,7 @@ static std::mutex shared_mutex;
- (instancetype)initWithConfig:(PaddleMobileCPUConfig *)config {
if (self = [super init]) {
pam_ = new paddle_mobile::PaddleMobile<paddle_mobile::CPU, paddle_mobile::Precision::FP32>();
pam_ = new paddle_mobile::PaddleMobile<paddle_mobile::CPU, float>();
_config = config;
}
return self;
......@@ -82,6 +82,7 @@ static std::mutex shared_mutex;
-(instancetype)init {
if (self = [super init]) {
_config = [[PaddleMobileCPUConfig alloc] init];
pam_ = new paddle_mobile::PaddleMobile<paddle_mobile::CPU, float>();
}
return self;
}
......@@ -246,7 +247,8 @@ static std::mutex shared_mutex;
memcpy(input_ptr, input,
numel * sizeof(float));
std::shared_ptr<paddle_mobile::framework::Tensor> output = pam_->Predict(input_tensor);
pam_->Predict(input_tensor);
std::shared_ptr<paddle_mobile::framework::Tensor> output = pam_->Fetch();
float *output_pointer = new float[output->numel()];
......
......@@ -16,21 +16,23 @@ limitations under the License. */
#include "paddle_mobile_jni.h"
#include <cmath>
#include <string>
#include <vector>
#include "common/log.h"
#include "framework/tensor.h"
#include "io/paddle_mobile.h"
#ifdef ENABLE_EXCEPTION
#include "common/enforce.h"
#endif
#ifdef __cplusplus
extern "C" {
#endif
namespace paddle_mobile {
namespace jni {
using framework::DDim;
using framework::Program;
using framework::Tensor;
......@@ -200,7 +202,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictImage(
for (int i = 0; i < length; i++) {
input_ptr[i] = dataPointer[i];
}
auto output = getPaddleMobileInstance()->Predict(input);
getPaddleMobileInstance()->Predict(input);
auto output = getPaddleMobileInstance()->Fetch();
count = output->numel();
result = env->NewFloatArray(count);
env->SetFloatArrayRegion(result, 0, count, output->data<float>());
......@@ -233,7 +236,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictImage(
for (int i = 0; i < length; i++) {
input_ptr[i] = dataPointer[i];
}
auto output = getPaddleMobileInstance()->Predict(input);
getPaddleMobileInstance()->Predict(input);
auto output = getPaddleMobileInstance()->Fetch();
count = output->numel();
result = env->NewFloatArray(count);
env->SetFloatArrayRegion(result, 0, count, output->data<float>());
......@@ -328,7 +332,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictYuv(
for (int i = 0; i < length; i++) {
input_ptr[i] = matrix[i];
}
auto output = getPaddleMobileInstance()->Predict(input);
getPaddleMobileInstance()->Predict(input);
auto output = getPaddleMobileInstance()->Fetch();
count = output->numel();
result = env->NewFloatArray(count);
env->SetFloatArrayRegion(result, 0, count, output->data<float>());
......@@ -363,7 +368,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictYuv(
for (int i = 0; i < length; i++) {
input_ptr[i] = matrix[i];
}
auto output = getPaddleMobileInstance()->Predict(input);
getPaddleMobileInstance()->Predict(input);
auto output = getPaddleMobileInstance()->Fetch();
count = output->numel();
result = env->NewFloatArray(count);
env->SetFloatArrayRegion(result, 0, count, output->data<float>());
......@@ -399,7 +405,8 @@ Java_com_baidu_paddle_PML_predictLod(JNIEnv *env, jclass thiz, jlongArray buf) {
auto *pdata = words.mutable_data<int64_t>();
size_t n = words.numel() * sizeof(int64_t);
memcpy(pdata, ids.data(), n);
auto vec_result = paddle_mobile.PredictLod(words);
paddle_mobile.Predict(words);
auto vec_result = paddle_mobile.Fetch();
int count = vec_result->numel();
jlongArray result = NULL;
ANDROIDLOGE("predict nlp size %d", count);
......
......@@ -13,66 +13,68 @@ See the License for the specific language governing permissions and
limitations under the License. */
#include "io/paddle_mobile.h"
#include <utility>
#include "common/common.h"
#ifdef PADDLE_MOBILE_CL
#include <CL/cl.h>
#include "framework/cl/cl_tensor.h"
#endif
#include "common/common.h"
#include "operators/math/gemm.h"
namespace paddle_mobile {
template <typename Dtype, Precision P>
void PaddleMobile<Dtype, P>::SetThreadNum(int num) {
template <typename Device, typename T>
void PaddleMobile<Device, T>::SetThreadNum(int num) {
#ifdef _OPENMP
omp_set_num_threads(num);
#endif
}
template <typename Dtype, Precision P>
bool PaddleMobile<Dtype, P>::Load(const std::string &dirname, bool optimize,
bool quantification, int batch_size,
bool loddable) {
template <typename Device, typename T>
PMStatus PaddleMobile<Device, T>::Load(const std::string &dirname,
bool optimize, bool quantification,
int batch_size, bool loddable) {
if (loader_.get() == nullptr) {
loader_ = std::make_shared<framework::Loader<Dtype, P>>();
loader_ = std::make_shared<framework::Loader<Device, T>>();
} else {
LOG(kLOG_INFO) << "loader inited";
}
if (executor_.get() == nullptr) {
executor_ = std::make_shared<framework::Executor<Dtype, P>>(
executor_ = std::make_shared<framework::Executor<Device, T>>(
loader_->Load(dirname, optimize, quantification), batch_size, optimize,
loddable);
} else {
LOG(kLOG_INFO) << "executor inited";
}
return true;
return PMSuccess;
}
template <typename Dtype, Precision P>
bool PaddleMobile<Dtype, P>::Load(const std::string &model_path,
const std::string &para_path, bool optimize,
bool quantification, int batch_size,
bool loddable) {
template <typename Device, typename T>
PMStatus PaddleMobile<Device, T>::Load(const std::string &model_path,
const std::string &para_path,
bool optimize, bool quantification,
int batch_size, bool loddable) {
if (loader_.get() == nullptr) {
loader_ = std::make_shared<framework::Loader<Dtype, P>>();
loader_ = std::make_shared<framework::Loader<Device, T>>();
} else {
LOG(kLOG_INFO) << "loader inited";
}
if (executor_.get() == nullptr) {
executor_ = std::make_shared<framework::Executor<Dtype, P>>(
executor_ = std::make_shared<framework::Executor<Device, T>>(
loader_->Load(model_path, para_path, optimize, quantification),
batch_size, optimize, loddable);
} else {
LOG(kLOG_INFO) << "executor inited";
}
return true;
return PMSuccess;
}
template <typename Dtype, Precision P>
bool PaddleMobile<Dtype, P>::LoadCombinedMemory(size_t model_len,
template <typename Dtype, typename T>
bool PaddleMobile<Dtype, T>::LoadCombinedMemory(size_t model_len,
const uint8_t *model_buf,
size_t combined_params_len,
uint8_t *combined_params_buf,
......@@ -80,13 +82,12 @@ bool PaddleMobile<Dtype, P>::LoadCombinedMemory(size_t model_len,
bool quantification, int batch_size,
bool loddable) {
if (loader_.get() == nullptr) {
loader_ = std::make_shared<framework::Loader<Dtype, P>>();
loader_ = std::make_shared<framework::Loader<Device, T>>();
} else {
LOG(kLOG_INFO) << "loader inited";
}
if (executor_.get() == nullptr) {
executor_ = std::make_shared<framework::Executor<Dtype, P>>(
executor_ = std::make_shared<framework::Executor<Device, T>>(
loader_->LoadCombinedMemory(model_len, model_buf, combined_params_len,
combined_params_buf, optimize,
quantification),
......@@ -95,38 +96,76 @@ bool PaddleMobile<Dtype, P>::LoadCombinedMemory(size_t model_len,
LOG(kLOG_INFO) << "executor inited";
}
return true;
return PMSuccess;
}
template <typename Dtype, Precision P>
std::shared_ptr<framework::Tensor> PaddleMobile<Dtype, P>::Predict(
const framework::Tensor &t) {
return executor_->Predict(t);
template <typename Device, typename T>
PMStatus PaddleMobile<Device, T>::Predict(const framework::Tensor &input) {
std::vector<std::pair<std::string, framework::Tensor>> inputs;
inputs.push_back(std::make_pair("feed", input));
return this->Predict(inputs);
}
template <typename Device, typename T>
PMStatus PaddleMobile<Device, T>::Predict(const framework::LoDTensor &input) {
std::vector<std::pair<std::string, framework::LoDTensor>> inputs;
inputs.push_back(std::make_pair("feed", input));
return this->Predict(inputs);
}
template <typename Dtype, Precision P>
std::shared_ptr<framework::Tensor> PaddleMobile<Dtype, P>::PredictLod(
const framework::LoDTensor &t) {
return executor_->PredictLod(t);
template <typename Device, typename T>
PMStatus PaddleMobile<Device, T>::Predict(
const std::vector<std::pair<std::string, framework::Tensor>> &inputs) {
return executor_->Predict(inputs);
}
template <typename Dtype, Precision P>
std::vector<typename PaddleMobile<Dtype, P>::Ptype>
PaddleMobile<Dtype, P>::Predict(const std::vector<Ptype> &input,
const std::vector<int64_t> &dims) {
template <typename Device, typename T>
PMStatus PaddleMobile<Device, T>::Predict(
const std::vector<std::pair<std::string, framework::LoDTensor>> &inputs) {
return executor_->Predict(inputs);
}
template <typename Device, typename T>
std::vector<T> PaddleMobile<Device, T>::Predict(
const std::vector<T> &input, const std::vector<int64_t> &dims) {
return executor_->Predict(input, dims);
}
template <typename Dtype, Precision P>
void PaddleMobile<Dtype, P>::Clear() {
template <typename Device, typename T>
PMStatus PaddleMobile<Device, T>::Predict() {
return executor_->Predict();
}
template <typename Device, typename T>
void PaddleMobile<Device, T>::Feed(const framework::Tensor &input,
const std::string &var_name) {
executor_->SetInput(input, var_name);
}
template <typename Device, typename T>
void PaddleMobile<Device, T>::Feed(const framework::LoDTensor &input,
const std::string &var_name) {
executor_->SetInput(input, var_name);
}
typedef std::shared_ptr<framework::LoDTensor> LoDTensorPtr;
template <typename Device, typename T>
LoDTensorPtr PaddleMobile<Device, T>::Fetch(const std::string &var_name) {
return executor_->GetOutput(var_name);
}
template <typename Device, typename T>
void PaddleMobile<Device, T>::Clear() {
executor_ = nullptr;
loader_ = nullptr;
}
template <typename Dtype, Precision P>
double PaddleMobile<Dtype, P>::GetPredictTime() {}
template <typename Device, typename T>
double PaddleMobile<Device, T>::GetPredictTime() {}
#ifdef PADDLE_MOBILE_CPU
template <>
double PaddleMobile<CPU, Precision::FP32>::GetPredictTime() {
double PaddleMobile<CPU, float>::GetPredictTime() {
int m = 32;
int n = 224 * 224;
int k = 27;
......@@ -142,14 +181,13 @@ double PaddleMobile<CPU, Precision::FP32>::GetPredictTime() {
int t1 = 1;
int t2 = 1;
for (int i = 0; i < m * k; ++i) {
unsigned int seed = 100;
a[i] = t1 + rand_r(&seed) % t2;
a[i] = t1 + rand() % t2; // NOLINT
}
for (int i = 0; i < k * n; ++i) {
unsigned int seed = 200;
b[i] = t1 + rand_r(&seed) % t2;
b[i] = t1 + rand() % t2; // NOLINT
}
paddle_mobile::operators::math::Gemm gemm;
operators::math::Gemm gemm;
auto time1 = paddle_mobile::time();
gemm.Sgemm(m, n, k, static_cast<float>(1), a, lda, b, ldb,
static_cast<float>(0), c, ldc, false,
......@@ -163,57 +201,51 @@ double PaddleMobile<CPU, Precision::FP32>::GetPredictTime() {
}
#endif
template <typename Dtype, Precision P>
PaddleMobile<Dtype, P>::~PaddleMobile() {
executor_ = nullptr;
loader_ = nullptr;
}
#ifdef PADDLE_MOBILE_FPGA
template <typename Dtype, Precision P>
void PaddleMobile<Dtype, P>::InjectVariable(const framework::Tensor &t,
std::string var_name) {
template <typename Device, typename T>
void PaddleMobile<Device, T>::InjectVariable(const framework::Tensor &t,
std::string var_name) {
executor_->InjectVariable(t, var_name);
}
template <typename Dtype, Precision P>
void PaddleMobile<Dtype, P>::FeedData(const framework::Tensor &t) {
template <typename Device, typename T>
void PaddleMobile<Device, T>::FeedData(const framework::Tensor &t) {
executor_->FeedData(t);
}
template <typename Dtype, Precision P>
std::shared_ptr<framework::Tensor> PaddleMobile<Dtype, P>::FetchResult(int id) {
template <typename Device, typename T>
std::shared_ptr<framework::Tensor> PaddleMobile<Device, T>::FetchResult(
int id) {
return executor_->FetchResult(id);
}
template <typename Dtype, Precision P>
void PaddleMobile<Dtype, P>::Predict_From_To(int start, int end) {
template <typename Device, typename T>
void PaddleMobile<Device, T>::Predict_From_To(int start, int end) {
executor_->Predict_From_To(start, end);
}
template <typename Dtype, Precision P>
void PaddleMobile<Dtype, P>::Predict_From(int start) {
template <typename Device, typename T>
void PaddleMobile<Device, T>::Predict_From(int start) {
executor_->Predict_From(start);
}
template <typename Dtype, Precision P>
void PaddleMobile<Dtype, P>::Predict_To(int end) {
template <typename Device, typename T>
void PaddleMobile<Device, T>::Predict_To(int end) {
executor_->Predict_To(end);
}
#endif
#ifdef PADDLE_MOBILE_CL
static std::mutex lc;
template <typename Dtype, Precision P>
void PaddleMobile<Dtype, P>::SetCLPath(std::string path) {
template <typename Device, typename T>
void PaddleMobile<Device, T>::SetCLPath(std::string path) {
std::lock_guard<std::mutex> lock(lc);
if (framework::CLEngine::Instance()->GetCLPath() == "") {
framework::CLEngine::Instance()->setClPath(path);
}
}
template <>
double PaddleMobile<GPU_CL, Precision::FP32>::GetPredictTime() {
double PaddleMobile<GPU_CL, float>::GetPredictTime() {
cl_int status;
cl_uint nPlatform;
clGetPlatformIDs(0, NULL, &nPlatform);
......@@ -411,8 +443,8 @@ double PaddleMobile<GPU_CL, Precision::FP32>::GetPredictTime() {
return -1;
}
}
template <typename Dtype, Precision P>
int PaddleMobile<Dtype, P>::readText(
template <typename Device, typename T>
int PaddleMobile<Device, T>::readText(
const char *kernelPath,
char **pcode) { // 读取文本文件放入 pcode,返回字符串长度
FILE *fp;
......@@ -441,13 +473,11 @@ int PaddleMobile<Dtype, P>::readText(
fclose(fp);
return size + 1;
}
#endif
template class PaddleMobile<CPU, Precision::FP32>;
template class PaddleMobile<FPGA, Precision::FP32>;
template class PaddleMobile<GPU_MALI, Precision::FP32>;
template class PaddleMobile<GPU_CL, Precision::FP32>;
template class PaddleMobile<CPU, float>;
template class PaddleMobile<FPGA, float>;
template class PaddleMobile<GPU_MALI, float>;
template class PaddleMobile<GPU_CL, float>;
} // namespace paddle_mobile
......@@ -16,6 +16,7 @@ limitations under the License. */
#include <memory>
#include <string>
#include <utility>
#include <vector>
#ifdef _OPENMP
#include <omp.h>
......@@ -32,44 +33,53 @@ limitations under the License. */
namespace paddle_mobile {
template <typename Dtype = CPU, Precision P = Precision::FP32>
template <typename Device, typename T = float>
class PaddleMobile {
typedef typename PrecisionTrait<P>::ptype Ptype;
public:
PaddleMobile() {
#ifndef PADDLE_MOBILE_CL
bool is_gpu = std::is_same<DeviceType<kGPU_CL>, Dtype>::value;
PADDLE_MOBILE_ENFORCE(!is_gpu,
"Not Enable GPU in CmakeList but run gpu codes ");
bool is_gpu = std::is_same<DeviceType<kGPU_CL>, Device>::value;
PADDLE_MOBILE_ENFORCE(!is_gpu, "Please recompile with GPU_CL is on");
#endif
}
bool Load(const std::string &dirname, bool optimize = false,
bool quantification = false, int batch_size = 1,
bool loddable = false);
~PaddleMobile() {}
PMStatus Load(const std::string &dirname, const bool optimize = false,
const bool quantification = false, const int batch_size = 1,
const bool lod = false);
PMStatus Load(const std::string &model_path, const std::string &para_path,
const bool optimize = false, const bool quantification = false,
const int batch_size = 1, const bool lod = false);
PMStatus Predict(const framework::Tensor &input);
PMStatus Predict(const framework::LoDTensor &input);
bool Load(const std::string &model_path, const std::string &para_path,
bool optimize = false, bool quantification = false,
int batch_size = 1, bool loddable = false);
PMStatus Predict(
const std::vector<std::pair<std::string, framework::Tensor>> &inputs);
PMStatus Predict(
const std::vector<std::pair<std::string, framework::LoDTensor>> &inputs);
std::shared_ptr<framework::Tensor> Predict(const framework::Tensor &t);
std::vector<T> Predict(const std::vector<T> &input,
const std::vector<int64_t> &dims);
PMStatus Predict();
std::shared_ptr<framework::Tensor> PredictLod(const framework::LoDTensor &t);
void Feed(const framework::LoDTensor &input, const std::string &var_name);
void Feed(const framework::Tensor &input, const std::string &var_name);
std::vector<Ptype> Predict(const std::vector<Ptype> &input,
const std::vector<int64_t> &dims);
typedef std::shared_ptr<framework::LoDTensor> LoDTensorPtr;
LoDTensorPtr Fetch(const std::string &var_name);
LoDTensorPtr Fetch() { return Fetch("fetch"); }
bool LoadCombinedMemory(size_t model_len, const uint8_t *model_buf,
size_t combined_params_len,
uint8_t *combined_params_buf, bool optimize = false, bool quantification = false,
int batch_size = 1, bool loddable = false);
void SetThreadNum(int num);
void SetThreadNum(int count);
void Clear();
double GetPredictTime();
~PaddleMobile();
#ifdef PADDLE_MOBILE_FPGA
void InjectVariable(const framework::Tensor &t, std::string var_name);
void FeedData(const framework::Tensor &t);
......@@ -80,15 +90,15 @@ class PaddleMobile {
#endif
#ifdef PADDLE_MOBILE_CL
public:
public: // NOLINT
void SetCLPath(std::string cl_path);
int readText(const char *kernelPath,
char **pcode); // 读取文本文件放入 pcode,返回字符串长度
#endif
private:
std::shared_ptr<framework::Loader<Dtype, P>> loader_;
std::shared_ptr<framework::Executor<Dtype, P>> executor_;
std::shared_ptr<framework::Loader<Device, T>> loader_;
std::shared_ptr<framework::Executor<Device, T>> executor_;
};
} // namespace paddle_mobile
......@@ -14,10 +14,12 @@ limitations under the License. */
#include "io/paddle_test_inference_api.h"
#include "io/paddle_mobile.h"
namespace paddle_mobile {
template <typename Dtype, Precision P>
double PaddleTester<Dtype, P>::CaculatePredictTime(std::string *cl_path) {
PaddleMobile<Dtype, P> paddle_mobile;
template <typename Device, typename T>
double PaddleTester<Device, T>::CaculatePredictTime(std::string *cl_path) {
PaddleMobile<Device, T> paddle_mobile;
#ifdef PADDLE_MOBILE_CL
if (cl_path) {
paddle_mobile.SetCLPath(*cl_path);
......@@ -26,10 +28,10 @@ double PaddleTester<Dtype, P>::CaculatePredictTime(std::string *cl_path) {
#endif
return paddle_mobile.GetPredictTime();
}
template class PaddleTester<CPU, Precision::FP32>;
template class PaddleTester<FPGA, Precision::FP32>;
template class PaddleTester<GPU_MALI, Precision::FP32>;
template class PaddleTester<CPU, float>;
template class PaddleTester<FPGA, float>;
template class PaddleTester<GPU_MALI, float>;
template class PaddleTester<GPU_CL, Precision::FP32>;
template class PaddleTester<GPU_CL, float>;
} // namespace paddle_mobile
......@@ -20,10 +20,13 @@ limitations under the License. */
*/
#pragma once
#include "common/types.h"
#include "string"
namespace paddle_mobile {
template <typename Dtype, Precision P = Precision::FP32>
template <typename Device, typename T = float>
class PaddleTester {
public:
double CaculatePredictTime(std::string *cl_path = nullptr);
......
......@@ -12,26 +12,25 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#pragma once
#ifdef CAST_OP
#ifdef FUSION_DEQUANT_ADD_BN_OP
#include "framework/operator.h"
#include "operators/op_param.h"
#include "operators/cast_op.h"
namespace paddle_mobile {
namespace operators {
template <typename DeviceType, typename T>
class FusionDequantAddBNKernel
: public framework::OpKernelBase<DeviceType,
FusionDequantAddBNParam<DeviceType>> {
public:
void Compute(const FusionDequantAddBNParam<DeviceType> &param);
bool Init(FusionDequantAddBNParam<DeviceType> *param);
};
void CastOp<DeviceType, T>::InferShape() const {
const auto &dims = this->param_.input_->dims();
this->param_.output_->Resize(dims);
}
} // namespace operators
} // namespace paddle_mobile
namespace ops = paddle_mobile::operators;
#ifdef PADDLE_MOBILE_CPU
REGISTER_OPERATOR_CPU(cast, ops::CastOp);
#endif
#endif // CAST_OP
......@@ -12,39 +12,34 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef FUSION_FC_INT8_OP
#ifdef CAST_OP
#pragma once
#include <string>
#include <vector>
#include "framework/operator.h"
#include "framework/program/program-optimize/fusion_op_register.h"
#include "operators/kernel/fusion_fc_kernel.h"
#include "operators/kernel/kernels.h"
#include "operators/op_param.h"
namespace paddle_mobile {
namespace operators {
template <typename DeviceType, typename T>
class FusionFcInt8Op
: public framework::OperatorWithKernel<DeviceType,
FusionFcParam<DeviceType>,
FusionFcKernel<DeviceType, T>> {
class CastOp : public framework::OperatorWithKernel<
DeviceType, CastParam<DeviceType>,
operators::CastKernel<DeviceType, T>> {
public:
FusionFcInt8Op(const std::string &type, const VariableNameMap &inputs,
const VariableNameMap &outputs,
const framework::AttributeMap &attrs,
std::shared_ptr<framework::Scope> scope)
: framework::OperatorWithKernel<DeviceType, FusionFcParam<DeviceType>,
FusionFcKernel<DeviceType, T>>(
CastOp(const std::string &type, const VariableNameMap &inputs,
const VariableNameMap &outputs, const framework::AttributeMap &attrs,
std::shared_ptr<framework::Scope> scope)
: framework::OperatorWithKernel<DeviceType, CastParam<DeviceType>,
operators::CastKernel<DeviceType, T>>(
type, inputs, outputs, attrs, scope) {}
// inference output shape
void InferShape() const override;
};
} // namespace operators
} // namespace paddle_mobile
#endif // FUSION_FC_INT8_OP
#endif // CAST_OP
......@@ -33,4 +33,4 @@ namespace ops = paddle_mobile::operators;
REGISTER_OPERATOR_CPU(dequantize, ops::DequantizeOp);
#endif
#endif
#endif // DEQUANT_OP
......@@ -44,4 +44,4 @@ class DequantizeOp
} // namespace operators
} // namespace paddle_mobile
#endif
#endif // DEQUANT_OP
......@@ -25,12 +25,11 @@ limitations under the License. */
namespace paddle_mobile {
namespace operators {
using std::string;
template <typename DeviceType, typename T>
class FillConstantOp : public framework::OperatorBase<DeviceType> {
public:
FillConstantOp(const string &type, const VariableNameMap &inputs,
FillConstantOp(const std::string &type, const VariableNameMap &inputs,
const VariableNameMap &outputs,
const framework::AttributeMap attrs,
std::shared_ptr<framework::Scope> scope)
......@@ -58,7 +57,7 @@ class FillConstantOp : public framework::OperatorBase<DeviceType> {
tensor->Resize(framework::make_ddim(param_.Shape()));
tensor->mutable_data(framework::ToTypeIndex(data_type));
math::set_constant(tensor, value);
math::SetConstant(tensor, value);
}
void Init() {}
......
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef FUSION_CONVADDRELU_INT8_OP
#include "operators/fusion_conv_add_relu_int8_op.h"
#include <vector>
#include "operators/math/conv_func.h"
namespace paddle_mobile {
namespace operators {
template <typename Dtype, typename T>
void FusionConvAddReluInt8Op<Dtype, T>::InferShape() const {
auto in_dims = this->param_.Input()->dims();
auto filter_dims = this->param_.Filter()->dims();
const std::vector<int> &strides = this->param_.Strides();
std::vector<int> paddings = this->param_.Paddings();
int groups = this->param_.Groups();
std::vector<int> dilations = this->param_.Dilations();
PADDLE_MOBILE_ENFORCE((in_dims.size() == filter_dims.size() &&
dilations.size() == paddings.size() &&
paddings.size() == strides.size()),
"ConvParam is not suitable");
std::vector<int64_t> output_shape({in_dims[0], filter_dims[0]});
for (size_t i = 0; i < strides.size(); ++i) {
output_shape.push_back(
math::ConvOutputSize(in_dims[i + 2], filter_dims[i + 2], dilations[i],
paddings[i], strides[i]));
}
framework::DDim ddim = framework::make_ddim(output_shape);
this->param_.Output()->Resize(ddim);
}
} // namespace operators
} // namespace paddle_mobile
namespace ops = paddle_mobile::operators;
#ifdef PADDLE_MOBILE_CPU
REGISTER_OPERATOR_CPU_INT8(fusion_conv_add_relu_int8,
ops::FusionConvAddReluInt8Op);
#endif
#endif // FUSION_CONVADDRELU_INT8_OP
......@@ -20,7 +20,7 @@ limitations under the License. */
#include <vector>
#include "framework/operator.h"
#include "framework/program/program-optimize/fusion_op_register.h"
#include "operators/kernel/dequant_add_bn_kernel.h"
#include "operators/kernel/dequant_bn_kernel.h"
#include "operators/op_param.h"
namespace paddle_mobile {
......@@ -43,7 +43,8 @@ class FusionDequantAddBNMatcher : public framework::FusionOpMatcher {
{{"Scale", "BNScale"},
{"Mean", "BNMean"},
{"Bias", "BNBias"},
{"Variance", "BNVariance"}}}},
{"Variance", "BNVariance"},
{"Y", "Out"}}}},
removed_nodes);
}
......
......@@ -20,7 +20,7 @@ limitations under the License. */
#include <vector>
#include "framework/operator.h"
#include "framework/program/program-optimize/fusion_op_register.h"
#include "operators/kernel/dequant_bn_relu_kernel.h"
#include "operators/kernel/dequant_bn_kernel.h"
#include "operators/op_param.h"
namespace paddle_mobile {
......@@ -44,7 +44,8 @@ class FusionDequantAddBNReluMatcher : public framework::FusionOpMatcher {
{{"Scale", "BNScale"},
{"Mean", "BNMean"},
{"Bias", "BNBias"},
{"Variance", "BNVariance"}}}},
{"Variance", "BNVariance"},
{"Y", "Out"}}}},
removed_nodes);
}
......@@ -54,7 +55,7 @@ class FusionDequantAddBNReluMatcher : public framework::FusionOpMatcher {
template <typename DeviceType, typename T>
class FusionDequantAddBNReluOp
: public framework::OperatorWithKernel<
DeviceType, FusionDequantAddBNReluParam<DeviceType>,
DeviceType, FusionDequantAddBNParam<DeviceType>,
operators::FusionDequantAddBNReluKernel<DeviceType, T>> {
public:
FusionDequantAddBNReluOp(const std::string &type,
......@@ -63,7 +64,7 @@ class FusionDequantAddBNReluOp
const framework::AttributeMap &attrs,
std::shared_ptr<framework::Scope> scope)
: framework::OperatorWithKernel<
DeviceType, FusionDequantAddBNReluParam<DeviceType>,
DeviceType, FusionDequantAddBNParam<DeviceType>,
operators::FusionDequantAddBNReluKernel<DeviceType, T>>(
type, inputs, outputs, attrs, scope) {}
// inference output shape
......
......@@ -12,50 +12,51 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef FUSION_FC_INT8_OP
#include "operators/fusion_fc_int8_op.h"
#include "operators/fusion_dequant_add_bn_relu_quant_op.h"
#ifdef FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
namespace paddle_mobile {
namespace operators {
template <typename Dtype, typename T>
void FusionFcInt8Op<Dtype, T>::InferShape() const {
auto x_dims = this->param_.InputX()->dims();
auto y_dims = this->param_.InputY()->dims();
int x_num_col_dims = this->param_.XNumColDims();
int y_num_col_dims = this->param_.YNumColDims();
assert(x_dims.size() > x_num_col_dims);
assert(y_dims.size() > y_num_col_dims);
/// (1,2,3,4) , x_num_col_dims = 2 -> (2,12)
auto x_mat_dims = framework::flatten_to_2d(x_dims, x_num_col_dims);
auto y_mat_dims = framework::flatten_to_2d(y_dims, y_num_col_dims);
void FusionDequantAddBNReluQuantOp<Dtype, T>::InferShape() const {
const auto& input_dims = this->param_.input_->dims();
this->param_.output_->Resize(input_dims);
}
assert(x_mat_dims[1] == y_mat_dims[0]);
} // namespace operators
} // namespace paddle_mobile
std::vector<int64_t> output_dims;
output_dims.reserve(
static_cast<size_t>(x_num_col_dims + y_dims.size() - y_num_col_dims));
namespace ops = paddle_mobile::operators;
REGISTER_FUSION_MATCHER(fusion_dequant_add_bn_relu_quant,
ops::FusionDequantAddBNReluQuantMatcher);
for (int i = 0; i < x_num_col_dims; ++i) {
output_dims.push_back(x_dims[i]);
}
#ifdef PADDLE_MOBILE_CPU
REGISTER_OPERATOR_CPU(fusion_dequant_add_bn_relu_quant,
ops::FusionDequantAddBNReluQuantOp);
#endif
#endif // FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
for (int i = y_num_col_dims; i < y_dims.size(); ++i) {
output_dims.push_back(y_dims[i]);
}
#ifdef FUSION_DEQUANT_ADD_BN_QUANT_OP
namespace paddle_mobile {
namespace operators {
framework::DDim ddim = framework::make_ddim(output_dims);
this->param_.Out()->Resize(ddim);
template <typename Dtype, typename T>
void FusionDequantAddBNQuantOp<Dtype, T>::InferShape() const {
const auto& input_dims = this->param_.input_->dims();
this->param_.output_->Resize(input_dims);
}
} // namespace operators
} // namespace paddle_mobile
namespace ops = paddle_mobile::operators;
REGISTER_FUSION_MATCHER(fusion_dequant_add_bn_quant,
ops::FusionDequantAddBNQuantMatcher);
#ifdef PADDLE_MOBILE_CPU
REGISTER_OPERATOR_CPU_INT8(fusion_fc_int8, ops::FusionFcInt8Op);
REGISTER_OPERATOR_CPU(fusion_dequant_add_bn_quant,
ops::FusionDequantAddBNQuantOp);
#endif
#endif // FUSION_FC_INT8_OP
#endif // FUSION_DEQUANT_ADD_BN_QUANT_OP
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#pragma once
#include <string>
#include <vector>
#include "framework/operator.h"
#include "framework/program/program-optimize/fusion_op_register.h"
#include "operators/kernel/dequant_bn_kernel.h"
#include "operators/op_param.h"
namespace paddle_mobile {
namespace operators {
#ifdef FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
class FusionDequantAddBNReluQuantMatcher : public framework::FusionOpMatcher {
public:
FusionDequantAddBNReluQuantMatcher() {
node_ = framework::Node(G_OP_TYPE_DEQUANTIZE);
node_ > std::make_shared<framework::Node>(G_OP_TYPE_ELEMENTWISE_ADD) >
std::make_shared<framework::Node>(G_OP_TYPE_BATCHNORM) >
std::make_shared<framework::Node>(G_OP_TYPE_RELU) >
std::make_shared<framework::Node>(G_OP_TYPE_QUANTIZE);
}
void FolderNodes(
framework::Node *node,
std::vector<std::shared_ptr<framework::Node>> *removed_nodes) {
node->Folder(node_.Depth(), Type(),
{{G_OP_TYPE_ELEMENTWISE_ADD, {{"Y", "Y"}}},
{G_OP_TYPE_BATCHNORM,
{{"Scale", "BNScale"},
{"Mean", "BNMean"},
{"Bias", "BNBias"},
{"Variance", "BNVariance"},
{"Y", "Out"}}}},
removed_nodes);
}
std::string Type() { return G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU_QUANT; }
};
template <typename DeviceType, typename T>
class FusionDequantAddBNReluQuantOp
: public framework::OperatorWithKernel<
DeviceType, FusionDequantAddBNReluQuantParam<DeviceType>,
operators::FusionDequantAddBNReluQuantKernel<DeviceType, T>> {
public:
FusionDequantAddBNReluQuantOp(const std::string &type,
const VariableNameMap &inputs,
const VariableNameMap &outputs,
const framework::AttributeMap &attrs,
std::shared_ptr<framework::Scope> scope)
: framework::OperatorWithKernel<
DeviceType, FusionDequantAddBNReluQuantParam<DeviceType>,
operators::FusionDequantAddBNReluQuantKernel<DeviceType, T>>(
type, inputs, outputs, attrs, scope) {}
// inference output shape
void InferShape() const override;
};
#endif // FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
#ifdef FUSION_DEQUANT_ADD_BN_QUANT_OP
class FusionDequantAddBNQuantMatcher : public framework::FusionOpMatcher {
public:
FusionDequantAddBNQuantMatcher() {
node_ = framework::Node(G_OP_TYPE_DEQUANTIZE);
node_ > std::make_shared<framework::Node>(G_OP_TYPE_ELEMENTWISE_ADD) >
std::make_shared<framework::Node>(G_OP_TYPE_BATCHNORM) >
std::make_shared<framework::Node>(G_OP_TYPE_QUANTIZE);
}
void FolderNodes(
framework::Node *node,
std::vector<std::shared_ptr<framework::Node>> *removed_nodes) {
node->Folder(node_.Depth(), Type(),
{{G_OP_TYPE_ELEMENTWISE_ADD, {{"Y", "Y"}}},
{G_OP_TYPE_BATCHNORM,
{{"Scale", "BNScale"},
{"Mean", "BNMean"},
{"Bias", "BNBias"},
{"Variance", "BNVariance"},
{"Y", "Out"}}}},
removed_nodes);
}
std::string Type() { return G_OP_TYPE_FUSION_DEQUANT_ADD_BN_QUANT; }
};
template <typename DeviceType, typename T>
class FusionDequantAddBNQuantOp
: public framework::OperatorWithKernel<
DeviceType, FusionDequantAddBNQuantParam<DeviceType>,
operators::FusionDequantAddBNQuantKernel<DeviceType, T>> {
public:
FusionDequantAddBNQuantOp(const std::string &type,
const VariableNameMap &inputs,
const VariableNameMap &outputs,
const framework::AttributeMap &attrs,
std::shared_ptr<framework::Scope> scope)
: framework::OperatorWithKernel<
DeviceType, FusionDequantAddBNQuantParam<DeviceType>,
operators::FusionDequantAddBNQuantKernel<DeviceType, T>>(
type, inputs, outputs, attrs, scope) {}
// inference output shape
void InferShape() const override;
};
#endif // FUSION_DEQUANT_ADD_BN_QUANT_OP
} // namespace operators
} // namespace paddle_mobile
......@@ -12,28 +12,43 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef FUSION_DEQUANT_BN_RELU_OP
#include "operators/fusion_dequant_bn_relu_op.h"
#include "operators/fusion_dequant_bn_op.h"
namespace paddle_mobile {
namespace operators {
#ifdef FUSION_DEQUANT_BN_OP
template <typename Dtype, typename T>
void FusionDequantBNOp<Dtype, T>::InferShape() const {
const auto& input_dims = this->param_.input_->dims();
this->param_.output_->Resize(input_dims);
}
#endif // FUSION_DEQUANT_BN_OP
#ifdef FUSION_DEQUANT_BN_RELU_OP
template <typename Dtype, typename T>
void FusionDequantBNReluOp<Dtype, T>::InferShape() const {
const auto& input_dims = this->param_.input_->dims();
this->param_.output_->Resize(input_dims);
}
#endif // FUSION_DEQUANT_BN_RELU_OP
} // namespace operators
} // namespace paddle_mobile
namespace ops = paddle_mobile::operators;
#ifdef FUSION_DEQUANT_BN_OP
REGISTER_FUSION_MATCHER(fusion_dequant_bn, ops::FusionDequantBNMatcher);
#ifdef PADDLE_MOBILE_CPU
REGISTER_OPERATOR_CPU(fusion_dequant_bn, ops::FusionDequantBNOp);
#endif // PADDLE_MOBILE_CPU
#endif // FUSION_DEQUANT_BN_OP
#ifdef FUSION_DEQUANT_BN_RELU_OP
REGISTER_FUSION_MATCHER(fusion_dequant_bn_relu,
ops::FusionDequantBNReluMatcher);
#ifdef PADDLE_MOBILE_CPU
REGISTER_OPERATOR_CPU(fusion_dequant_bn_relu, ops::FusionDequantBNReluOp);
#endif
#endif
#endif // PADDLE_MOBILE_CPU
#endif // FUSION_DEQUANT_BN_RELU_OP
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#pragma once
#include <string>
#include <vector>
#include "framework/operator.h"
#include "framework/program/program-optimize/fusion_op_register.h"
#include "operators/kernel/dequant_bn_kernel.h"
#include "operators/op_param.h"
namespace paddle_mobile {
namespace operators {
#if defined(FUSION_DEQUANT_BN_OP) || defined(FUSION_DEQUANT_BN_RELU_OP)
class FusionDequantBNMatcher : public framework::FusionOpMatcher {
public:
FusionDequantBNMatcher() {
node_ = framework::Node(G_OP_TYPE_DEQUANTIZE);
node_ > std::make_shared<framework::Node>(G_OP_TYPE_BATCHNORM);
}
virtual void FolderNodes(
framework::Node *node,
std::vector<std::shared_ptr<framework::Node>> *removed_nodes) {
node->Folder(node_.Depth(), Type(),
{{G_OP_TYPE_BATCHNORM,
{{"Scale", "BNScale"},
{"Mean", "BNMean"},
{"Bias", "BNBias"},
{"Variance", "BNVariance"},
{"Y", "Out"}}}},
removed_nodes);
}
std::string Type() override { return G_OP_TYPE_FUSION_DEQUANT_BN; }
};
#endif // FUSION_DEQUANT_BN_OP || FUSION_DEQUANT_BN_RELU_OP
#ifdef FUSION_DEQUANT_BN_OP
template <typename DeviceType, typename T>
class FusionDequantBNOp : public framework::OperatorWithKernel<
DeviceType, FusionDequantBNParam<DeviceType>,
operators::FusionDequantBNKernel<DeviceType, T>> {
public:
FusionDequantBNOp(const std::string &type, const VariableNameMap &inputs,
const VariableNameMap &outputs,
const framework::AttributeMap &attrs,
std::shared_ptr<framework::Scope> scope)
: framework::OperatorWithKernel<
DeviceType, FusionDequantBNParam<DeviceType>,
operators::FusionDequantBNKernel<DeviceType, T>>(
type, inputs, outputs, attrs, scope) {}
// inference output shape
void InferShape() const override;
};
#endif // FUSION_DEQUANT_BN_OP
#ifdef FUSION_DEQUANT_BN_RELU_OP
class FusionDequantBNReluMatcher : public FusionDequantBNMatcher {
public:
FusionDequantBNReluMatcher() : FusionDequantBNMatcher() {
node_ > std::make_shared<framework::Node>(G_OP_TYPE_RELU);
}
virtual std::string Type() { return G_OP_TYPE_FUSION_DEQUANT_BN_RELU; }
};
template <typename DeviceType, typename T>
class FusionDequantBNReluOp
: public framework::OperatorWithKernel<
DeviceType, FusionDequantBNParam<DeviceType>,
operators::FusionDequantBNReluKernel<DeviceType, T>> {
public:
FusionDequantBNReluOp(const std::string &type, const VariableNameMap &inputs,
const VariableNameMap &outputs,
const framework::AttributeMap &attrs,
std::shared_ptr<framework::Scope> scope)
: framework::OperatorWithKernel<
DeviceType, FusionDequantBNParam<DeviceType>,
operators::FusionDequantBNReluKernel<DeviceType, T>>(
type, inputs, outputs, attrs, scope) {}
void InferShape() const override;
};
#endif // FUSION_DEQUANT_BN_RELU_OP
} // namespace operators
} // namespace paddle_mobile
......@@ -42,7 +42,8 @@ class FusionDequantBNReluMatcher : public framework::FusionOpMatcher {
{{"Scale", "BNScale"},
{"Mean", "BNMean"},
{"Bias", "BNBias"},
{"Variance", "BNVariance"}}}},
{"Variance", "BNVariance"},
{"Y", "Out"}}}},
removed_nodes);
}
......
......@@ -14,19 +14,15 @@ limitations under the License. */
#ifdef GRU_OP
#include "operators/gru_op.h"
#include <vector>
#include "common/enforce.h"
#include "operators/gru_op.h"
namespace paddle_mobile {
namespace operators {
template <typename Dtype, typename T>
void GruOp<Dtype, T>::InferShape() const {
auto lod_size = this->param_.InputInput()->lod().size();
PADDLE_MOBILE_ENFORCE((lod_size == 1),
"Current LoD only supports one dimension.");
auto input_dims = this->param_.InputInput()->dims();
auto weight_dims = this->param_.InputWeight()->dims();
int input_size = input_dims[1];
......
......@@ -15,6 +15,7 @@ limitations under the License. */
#ifdef IM2SEQUENCE_OP
#include "operators/im2sequence_op.h"
#include <vector>
namespace paddle_mobile {
namespace operators {
......@@ -29,20 +30,16 @@ int Im2SequenceOutputSize(int input_size, int kernel, int padding_1,
template <typename Dtype, typename T>
void Im2SequenceOp<Dtype, T>::InferShape() const {
auto in_x_dims = this->param_.Input()->dims();
const std::vector<int> &kernels = this->param_.Kernels();
const std::vector<int> &strides = this->param_.Strides();
std::vector<int> paddings = this->param_.Paddings();
std::vector<int64_t> output_shape({in_x_dims[0], in_x_dims[1]});
for (size_t i = 0; i < strides.size(); ++i) {
output_shape.push_back(Im2SequenceOutputSize(in_x_dims[i + 2], kernels[i],
paddings[i], paddings[i + 2],
strides[i]));
}
framework::DDim ddim = framework::make_ddim(output_shape);
this->param_.Output()->Resize(ddim);
}
......@@ -54,9 +51,5 @@ namespace ops = paddle_mobile::operators;
#ifdef PADDLE_MOBILE_CPU
REGISTER_OPERATOR_CPU(im2sequence, ops::Im2SequenceOp);
#endif
#ifdef PADDLE_MOBILE_MALI_GPU
#endif
#ifdef PADDLE_MOBILE_FPGA
#endif
#endif
#endif // IM2SEQUENCE_OP
......@@ -12,39 +12,63 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef POOL_OP
#ifdef CAST_OP
#pragma once
#ifdef _OPENMP
#include <omp.h>
#endif
#include <algorithm>
#include <vector>
#include "framework/tensor.h"
#if __ARM_NEON
#include <arm_neon.h>
#endif // __ARM_NEON
#include "framework/data_type.h"
#include "operators/kernel/kernels.h"
namespace paddle_mobile {
namespace operators {
namespace math {
void Pool3x3Avgs1p1(const framework::Tensor *input, framework::Tensor *output);
void Pool3x3Maxs1p1(const framework::Tensor *input, framework::Tensor *output);
void Pool3x3Max(std::vector<int> strides, std::vector<int> paddings,
const framework::Tensor *input, framework::Tensor *output);
void Pool3x3Avg(std::vector<int> strides, std::vector<int> paddings,
const framework::Tensor *in_x, framework::Tensor *out);
void Pool3x3Maxs1_int8(const framework::Tensor *input,
framework::Tensor *output, int32_t pad_h, int32_t pad_w);
void Pool3x3Maxs2_int8(const framework::Tensor *input,
framework::Tensor *output, int32_t pad_h, int32_t pad_w);
void Pool3x3Max_int8(const std::vector<int> &strides,
const std::vector<int> &paddings,
const framework::Tensor *input, framework::Tensor *output);
} // namespace math
template <typename InT>
struct CastOutOpFunctor {
const framework::Tensor* in_;
framework::Tensor* out_;
CastOutOpFunctor(const framework::Tensor* in, framework::Tensor* out)
: in_(in), out_(out) {}
template <typename OutT>
void apply() const {
const InT* input = in_->data<InT>();
OutT* output = out_->mutable_data<OutT>();
size_t numel = in_->numel();
for (int i = 0; i < numel; ++i) {
output[i] = static_cast<OutT>(input[i]);
}
}
};
struct CastOpFunctor {
const framework::Tensor* in_;
framework::Tensor* out_;
int output_type_;
CastOpFunctor(const framework::Tensor* in, framework::Tensor* out,
const int output_type)
: in_(in), out_(out), output_type_(output_type) {}
template <typename InT>
void apply() const {
framework::VisitDataType(framework::ToDataType(output_type_),
CastOutOpFunctor<InT>(in_, out_));
}
};
template <>
bool CastKernel<CPU, float>::Init(CastParam<CPU>* param) {
return true;
}
template <>
void CastKernel<CPU, float>::Compute(const CastParam<CPU>& param) {
const Tensor* input = param.input_;
Tensor* output = param.output_;
framework::VisitDataType(framework::ToDataType(param.input_type_),
CastOpFunctor(input, output, param.output_type_));
}
} // namespace operators
} // namespace paddle_mobile
#endif
#endif // CAST_OP
......@@ -15,6 +15,7 @@ limitations under the License. */
#ifdef FUSION_CONVADDBNRELU_OP
#include "operators/kernel/conv_add_bn_relu_kernel.h"
#include <math.h>
#include "operators/kernel/central-arm-func/conv_add_bn_relu_arm_func.h"
namespace paddle_mobile {
......
......@@ -32,20 +32,6 @@ void ConvAddReluKernel<CPU, float>::Compute(
}
template class ConvAddReluKernel<CPU, float>;
#ifdef FUSION_CONVADDRELU_INT8_OP
template <>
bool ConvAddReluKernel<CPU, int8_t>::Init(FusionConvAddReluParam<CPU> *param) {
return true;
}
template <>
void ConvAddReluKernel<CPU, int8_t>::Compute(
const FusionConvAddReluParam<CPU> &param) {
ConvAddReluCompute<int8_t, int32_t>(param);
}
template class ConvAddReluKernel<CPU, int8_t>;
#endif
} // namespace operators
} // namespace paddle_mobile
......
......@@ -55,10 +55,9 @@ bool ConvKernel<CPU, float>::Init(ConvParam<CPU> *param) {
param->Input()->dims()[2] <= 140 /* refered from ncnn */) {
param->ExecMode() = ConvParam<CPU>::EXEC_WINOGRAD3X3_FLOAT;
// transform weight
framework::Tensor transformed_weight;
operators::math::winograd_transform_weight<8, 3>(*param->Filter(),
&transformed_weight);
framework::TensorCopy(transformed_weight, param->Filter());
param->transformed_filter_ = new framework::Tensor;
operators::math::winograd_transform_weight<8, 3>(
*param->Filter(), param->transformed_filter_);
#endif
} else {
param->ExecMode() = ConvParam<CPU>::EXEC_GEMM_FLOAT;
......
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef FUSION_DEQUANT_ADD_BN_OP
#include "operators/kernel/dequant_add_bn_kernel.h"
#include <cmath>
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#include <arm_neon.h>
#endif
namespace paddle_mobile {
namespace operators {
template <>
bool FusionDequantAddBNKernel<CPU, float>::Init(
FusionDequantAddBNParam<CPU> *param) {
// elementwise add params
const Tensor *bias = param->bias_;
// batch norm params
const Tensor *bn_mean = param->bn_mean_;
const Tensor *bn_variance = param->bn_variance_;
Tensor *bn_scale = param->bn_scale_;
Tensor *bn_bias = param->bn_bias_;
const float epsilon = param->epsilon_;
const float *bias_ptr = bias->data<float>();
const float *mean_ptr = bn_mean->data<float>();
const float *var_ptr = bn_variance->data<float>();
float *bn_scale_ptr = bn_scale->mutable_data<float>();
float *bn_bias_ptr = bn_bias->mutable_data<float>();
for (int c = 0; c < bn_scale->numel(); ++c) {
float inv_scale = bn_scale_ptr[c] / (std::sqrt(var_ptr[c] + epsilon));
bn_scale_ptr[c] = inv_scale;
bn_bias_ptr[c] = inv_scale * (bias_ptr[c] - mean_ptr[c]) + bn_bias_ptr[c];
}
return true;
}
template <>
void FusionDequantAddBNKernel<CPU, float>::Compute(
const FusionDequantAddBNParam<CPU> &param) {
const int32_t *input = param.input_->data<int32_t>();
const float *bn_scale = param.bn_scale_->data<float>();
const float *bn_bias = param.bn_bias_->data<float>();
// dequantize params
const float activation_scale = param.activation_scale_->data<float>()[0];
const float weight_scale = param.weight_scale_;
const float dequant_scale = activation_scale / weight_scale;
float *output = param.output_->mutable_data<float>();
int batch_size = param.input_->dims()[0];
int channels = param.input_->dims()[1];
size_t spatial_size = param.input_->dims()[2] * param.input_->dims()[3];
#pragma omp parallel for collapse(2)
for (int batch = 0; batch < batch_size; ++batch) {
for (int c = 0; c < channels; ++c) {
float scale = bn_scale[c] * dequant_scale;
float bias = bn_bias[c];
size_t offset = (batch * channels + c) * spatial_size;
const int32_t *x = input + offset;
float *y = output + offset;
size_t remain = spatial_size;
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
int loop = spatial_size >> 4;
remain = spatial_size & 0xF;
float32x4_t __scale = vdupq_n_f32(scale);
float32x4_t __bias = vdupq_n_f32(bias);
for (int k = 0; k < loop; ++k, x += 16, y += 16) {
int32x4_t r0 = vld1q_s32(x);
int32x4_t r1 = vld1q_s32(x + 4);
int32x4_t r2 = vld1q_s32(x + 8);
int32x4_t r3 = vld1q_s32(x + 12);
float32x4_t f0 = vcvtq_f32_s32(r0);
float32x4_t f1 = vcvtq_f32_s32(r1);
float32x4_t f2 = vcvtq_f32_s32(r2);
float32x4_t f3 = vcvtq_f32_s32(r3);
f0 = vmlaq_f32(__bias, __scale, f0);
f1 = vmlaq_f32(__bias, __scale, f1);
f2 = vmlaq_f32(__bias, __scale, f2);
f3 = vmlaq_f32(__bias, __scale, f3);
vst1q_f32(y, f0);
vst1q_f32(y + 4, f1);
vst1q_f32(y + 8, f2);
vst1q_f32(y + 12, f3);
}
#endif // __ARM_NEON__
for (int k = 0; k < remain; ++k) {
y[k] = scale * x[k] + bias;
}
}
}
}
} // namespace operators
} // namespace paddle_mobile
#endif // FUSION_DEQUANT_ADD_BN_OP
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include "operators/kernel/dequant_bn_relu_kernel.h"
#include <cmath>
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#include <arm_neon.h>
#endif
namespace paddle_mobile {
namespace operators {
#if defined(FUSION_DEQUANT_BN_RELU_OP) || defined(FUSION_DEQUANT_ADD_BN_RELU_OP)
void DequantBNReluCompute(const FusionDequantBNParam<CPU> *param) {
const int32_t *input = param->input_->data<int32_t>();
const float *bn_scale = param->bn_scale_->data<float>();
const float *bn_bias = param->bn_bias_->data<float>();
// dequantize params
const float activation_scale = param->activation_scale_->data<float>()[0];
const float weight_scale = param->weight_scale_;
const float dequant_scale = activation_scale / weight_scale;
float *output = param->output_->mutable_data<float>();
int batch_size = param->input_->dims()[0];
int channels = param->input_->dims()[1];
size_t spatial_size = param->input_->dims()[2] * param->input_->dims()[3];
#pragma omp parallel for collapse(2)
for (int batch = 0; batch < batch_size; ++batch) {
for (int c = 0; c < channels; ++c) {
float scale = bn_scale[c] * dequant_scale;
float bias = bn_bias[c];
size_t offset = (batch * channels + c) * spatial_size;
const int32_t *x = input + offset;
float *y = output + offset;
size_t remain = spatial_size;
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
int loop = spatial_size >> 4;
remain = spatial_size & 0xF;
float32x4_t __scale = vdupq_n_f32(scale);
float32x4_t __bias = vdupq_n_f32(bias);
float32x4_t __zero = vdupq_n_f32(0.f);
for (int k = 0; k < loop; ++k, x += 16, y += 16) {
int32x4_t r0 = vld1q_s32(x);
int32x4_t r1 = vld1q_s32(x + 4);
int32x4_t r2 = vld1q_s32(x + 8);
int32x4_t r3 = vld1q_s32(x + 12);
float32x4_t f0 = vcvtq_f32_s32(r0);
float32x4_t f1 = vcvtq_f32_s32(r1);
float32x4_t f2 = vcvtq_f32_s32(r2);
float32x4_t f3 = vcvtq_f32_s32(r3);
f0 = vmlaq_f32(__bias, __scale, f0);
f1 = vmlaq_f32(__bias, __scale, f1);
f2 = vmlaq_f32(__bias, __scale, f2);
f3 = vmlaq_f32(__bias, __scale, f3);
f0 = vmaxq_f32(__zero, f0);
f1 = vmaxq_f32(__zero, f1);
f2 = vmaxq_f32(__zero, f2);
f3 = vmaxq_f32(__zero, f3);
vst1q_f32(y, f0);
vst1q_f32(y + 4, f1);
vst1q_f32(y + 8, f2);
vst1q_f32(y + 12, f3);
}
#endif // __ARM_NEON__
for (int k = 0; k < remain; ++k) {
y[k] = std::max(scale * x[k] + bias, 0.f);
}
}
}
}
#endif
#ifdef FUSION_DEQUANT_BN_RELU_OP
template <>
bool FusionDequantBNReluKernel<CPU, float>::Init(
FusionDequantBNReluParam<CPU> *param) {
// batch norm params
const Tensor *bn_mean = param->bn_mean_;
const Tensor *bn_variance = param->bn_variance_;
Tensor *bn_scale = param->bn_scale_;
Tensor *bn_bias = param->bn_bias_;
const float epsilon = param->epsilon_;
const float *mean_ptr = bn_mean->data<float>();
const float *var_ptr = bn_variance->data<float>();
float *bn_scale_ptr = bn_scale->mutable_data<float>();
float *bn_bias_ptr = bn_bias->mutable_data<float>();
for (int c = 0; c < bn_scale->numel(); ++c) {
float inv_scale = bn_scale_ptr[c] / (std::sqrt(var_ptr[c] + epsilon));
bn_scale_ptr[c] = inv_scale;
bn_bias_ptr[c] = bn_bias_ptr[c] - inv_scale * mean_ptr[c];
}
return true;
}
template <>
void FusionDequantBNReluKernel<CPU, float>::Compute(
const FusionDequantBNReluParam<CPU> &param) {
DequantBNReluCompute(&param);
}
#endif // FUSION_DEQUANT_BN_RELU_OP
#ifdef FUSION_DEQUANT_ADD_BN_RELU_OP
template <>
bool FusionDequantAddBNReluKernel<CPU, float>::Init(
FusionDequantAddBNReluParam<CPU> *param) {
// elementwise add params
const Tensor *bias = param->bias_;
// batch norm params
const Tensor *bn_mean = param->bn_mean_;
const Tensor *bn_variance = param->bn_variance_;
Tensor *bn_scale = param->bn_scale_;
Tensor *bn_bias = param->bn_bias_;
const float epsilon = param->epsilon_;
const float *bias_ptr = bias->data<float>();
const float *mean_ptr = bn_mean->data<float>();
const float *var_ptr = bn_variance->data<float>();
float *bn_scale_ptr = bn_scale->mutable_data<float>();
float *bn_bias_ptr = bn_bias->mutable_data<float>();
for (int c = 0; c < bn_scale->numel(); ++c) {
float inv_scale = bn_scale_ptr[c] / (std::sqrt(var_ptr[c] + epsilon));
bn_scale_ptr[c] = inv_scale;
bn_bias_ptr[c] = inv_scale * (bias_ptr[c] - mean_ptr[c]) + bn_bias_ptr[c];
}
return true;
}
template <>
void FusionDequantAddBNReluKernel<CPU, float>::Compute(
const FusionDequantAddBNReluParam<CPU> &param) {
DequantBNReluCompute(&param);
}
#endif // FUSION_DEQUANT_ADD_BN_RELU_OP
} // namespace operators
} // namespace paddle_mobile
/* Copyright (c) 201f8 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <cmath>
#include "operators/kernel/dequant_bn_kernel.h"
#include "operators/math/activation.h"
#include "operators/math/quantize.h"
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#include <arm_neon.h>
#endif
namespace paddle_mobile {
namespace operators {
#if defined(FUSION_DEQUANT_BN_OP) || defined(FUSION_DEQUANT_ADD_BN_OP) || \
defined(FUSION_DEQUANT_BN_RELU_OP) || \
defined(FUSION_DEQUANT_ADD_BN_RELU_OP) || \
defined(FUSION_DEQUANT_ADD_BN_QUANT_OP) || \
defined(FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP)
void PublicFusionDequantBNInitParam(FusionDequantBNParam<CPU> *param,
const framework::Tensor *bias) {
// batch norm params
const Tensor *bn_mean = param->bn_mean_;
const Tensor *bn_variance = param->bn_variance_;
Tensor *bn_scale = param->bn_scale_;
Tensor *bn_bias = param->bn_bias_;
const float epsilon = param->epsilon_;
const float *mean_ptr = bn_mean->data<float>();
const float *var_ptr = bn_variance->data<float>();
float *bn_scale_ptr = bn_scale->mutable_data<float>();
float *bn_bias_ptr = bn_bias->mutable_data<float>();
for (int c = 0; c < bn_scale->numel(); ++c) {
float inv_scale = 1.f / (std::sqrt(var_ptr[c] + epsilon));
float val = bias ? bias->data<float>()[c] : 0;
bn_bias_ptr[c] =
inv_scale * bn_scale_ptr[c] * (val - mean_ptr[c]) + bn_bias_ptr[c];
bn_scale_ptr[c] = inv_scale * bn_scale_ptr[c];
}
}
#endif
#if defined(FUSION_DEQUANT_BN_OP) || defined(FUSION_DEQUANT_ADD_BN_OP) || \
defined(FUSION_DEQUANT_BN_RELU_OP) || \
defined(FUSION_DEQUANT_ADD_BN_RELU_OP)
template <ActivationType Act>
void DequantBNCompute(const FusionDequantBNParam<CPU> *param) {
const int32_t *input = param->input_->data<int32_t>();
const float *bn_scale = param->bn_scale_->data<float>();
const float *bn_bias = param->bn_bias_->data<float>();
// dequantize params
const float activation_scale = param->activation_scale_->data<float>()[0];
const float weight_scale = param->weight_scale_;
const float dequant_scale = activation_scale / weight_scale;
float *output = param->output_->mutable_data<float>();
int batch_size = param->input_->dims()[0];
int channels = param->input_->dims()[1];
size_t spatial_size = param->input_->dims()[2] * param->input_->dims()[3];
#pragma omp parallel for collapse(2)
for (int batch = 0; batch < batch_size; ++batch) {
for (int c = 0; c < channels; ++c) {
// not fuse bn and dequant scale to minimize precision difference
// float scale = bn_scale[c] * dequant_scale;
float scale = bn_scale[c];
float bias = bn_bias[c];
size_t offset = (batch * channels + c) * spatial_size;
const int32_t *x = input + offset;
float *y = output + offset;
size_t remain = spatial_size;
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
int loop = spatial_size >> 4;
remain = spatial_size & 0xF;
float32x4_t __dequant_scale = vdupq_n_f32(dequant_scale);
float32x4_t __scale = vdupq_n_f32(scale);
float32x4_t __bias = vdupq_n_f32(bias);
for (int k = 0; k < loop; ++k, x += 16, y += 16) {
int32x4_t r0 = vld1q_s32(x);
int32x4_t r1 = vld1q_s32(x + 4);
int32x4_t r2 = vld1q_s32(x + 8);
int32x4_t r3 = vld1q_s32(x + 12);
float32x4_t f0 = vcvtq_f32_s32(r0);
float32x4_t f1 = vcvtq_f32_s32(r1);
float32x4_t f2 = vcvtq_f32_s32(r2);
float32x4_t f3 = vcvtq_f32_s32(r3);
f0 = vmulq_f32(__dequant_scale, f0);
f1 = vmulq_f32(__dequant_scale, f1);
f2 = vmulq_f32(__dequant_scale, f2);
f3 = vmulq_f32(__dequant_scale, f3);
f0 = vmlaq_f32(__bias, __scale, f0);
f1 = vmlaq_f32(__bias, __scale, f1);
f2 = vmlaq_f32(__bias, __scale, f2);
f3 = vmlaq_f32(__bias, __scale, f3);
f0 = math::vActiveq_f32<Act>(f0);
f1 = math::vActiveq_f32<Act>(f1);
f2 = math::vActiveq_f32<Act>(f2);
f3 = math::vActiveq_f32<Act>(f3);
vst1q_f32(y, f0);
vst1q_f32(y + 4, f1);
vst1q_f32(y + 8, f2);
vst1q_f32(y + 12, f3);
}
#endif // __ARM_NEON__
for (int k = 0; k < remain; ++k) {
y[k] = math::Active<Act>(scale * (dequant_scale * x[k]) + bias);
}
}
}
}
#endif
#ifdef FUSION_DEQUANT_BN_OP
template <>
bool FusionDequantBNKernel<CPU, float>::Init(FusionDequantBNParam<CPU> *param) {
PublicFusionDequantBNInitParam(param, nullptr);
return true;
}
template <>
void FusionDequantBNKernel<CPU, float>::Compute(
const FusionDequantBNParam<CPU> &param) {
DequantBNCompute<IDENTITY>(&param);
}
#endif // FUSION_DEQUANT_BN_OP
#ifdef FUSION_DEQUANT_BN_RELU_OP
template <>
bool FusionDequantBNReluKernel<CPU, float>::Init(
FusionDequantBNParam<CPU> *param) {
PublicFusionDequantBNInitParam(param, nullptr);
return true;
}
template <>
void FusionDequantBNReluKernel<CPU, float>::Compute(
const FusionDequantBNParam<CPU> &param) {
DequantBNCompute<RELU>(&param);
}
#endif // FUSION_DEQUANT_BN_RELU_OP
#ifdef FUSION_DEQUANT_ADD_BN_OP
template <>
bool FusionDequantAddBNKernel<CPU, float>::Init(
FusionDequantAddBNParam<CPU> *param) {
const framework::Tensor *bias = param->bias_;
PublicFusionDequantBNInitParam(param, bias);
return true;
}
template <>
void FusionDequantAddBNKernel<CPU, float>::Compute(
const FusionDequantAddBNParam<CPU> &param) {
DequantBNCompute<IDENTITY>(&param);
}
#endif // FUSION_DEQUANT_ADD_BN_OP
#ifdef FUSION_DEQUANT_ADD_BN_RELU_OP
template <>
bool FusionDequantAddBNReluKernel<CPU, float>::Init(
FusionDequantAddBNParam<CPU> *param) {
const framework::Tensor *bias = param->bias_;
PublicFusionDequantBNInitParam(param, bias);
return true;
}
template <>
void FusionDequantAddBNReluKernel<CPU, float>::Compute(
const FusionDequantAddBNParam<CPU> &param) {
DequantBNCompute<RELU>(&param);
}
#endif // FUSION_DEQUANT_ADD_BN_RELU_OP
#if defined(FUSION_DEQUANT_ADD_BN_QUANT_OP) || \
defined(FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP)
template <Activation Act, RoundType R>
void DequantBNQuantCompute(const FusionDequantAddBNQuantParam<CPU> *param) {
const int32_t *input = param->input_->data<int32_t>();
const float *bn_scale = param->bn_scale_->data<float>();
const float *bn_bias = param->bn_bias_->data<float>();
// dequantize params
const float activation_scale = param->activation_scale_->data<float>()[0];
const float weight_scale = param->weight_scale_;
const float dequant_scale = activation_scale / weight_scale;
// quantize params
Tensor *output_scale = param->online_scale_;
float max_abs = 0.f;
int8_t *output = param->output_->mutable_data<int8_t>();
int batch_size = param->input_->dims()[0];
int channels = param->input_->dims()[1];
size_t spatial_size = param->input_->dims()[2] * param->input_->dims()[3];
// if (param->is_static_) {
if (true) {
max_abs = param->static_scale_;
float quant_scale = 127.f / max_abs;
#pragma omp parallel for collapse(2)
for (int batch = 0; batch < batch_size; ++batch) {
for (int c = 0; c < channels; ++c) {
// not fuse bn and dequant scale to minimize precision difference
// float scale = bn_scale[c] * dequant_scale;
float scale = bn_scale[c];
float bias = bn_bias[c];
size_t offset = (batch * channels + c) * spatial_size;
const int32_t *x = input + offset;
int8_t *y = output + offset;
size_t remain = spatial_size;
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
int loop = spatial_size >> 4;
remain = spatial_size & 0xF;
float32x4_t __dequant_scale = vdupq_n_f32(dequant_scale);
float32x4_t __scale = vdupq_n_f32(scale);
float32x4_t __bias = vdupq_n_f32(bias);
float32x4_t __quant_scale = vdupq_n_f32(quant_scale);
for (int k = 0; k < loop; ++k, x += 16, y += 16) {
int32x4_t r0 = vld1q_s32(x);
int32x4_t r1 = vld1q_s32(x + 4);
int32x4_t r2 = vld1q_s32(x + 8);
int32x4_t r3 = vld1q_s32(x + 12);
float32x4_t f0 = vcvtq_f32_s32(r0);
float32x4_t f1 = vcvtq_f32_s32(r1);
float32x4_t f2 = vcvtq_f32_s32(r2);
float32x4_t f3 = vcvtq_f32_s32(r3);
f0 = vmulq_f32(__dequant_scale, f0);
f1 = vmulq_f32(__dequant_scale, f1);
f2 = vmulq_f32(__dequant_scale, f2);
f3 = vmulq_f32(__dequant_scale, f3);
f0 = vmlaq_f32(__bias, __scale, f0);
f1 = vmlaq_f32(__bias, __scale, f1);
f2 = vmlaq_f32(__bias, __scale, f2);
f3 = vmlaq_f32(__bias, __scale, f3);
f0 = math::vActiveq_f32<Act>(f0);
f1 = math::vActiveq_f32<Act>(f1);
f2 = math::vActiveq_f32<Act>(f2);
f3 = math::vActiveq_f32<Act>(f3);
f0 = vmulq_f32(__quant_scale, f0);
f1 = vmulq_f32(__quant_scale, f1);
f2 = vmulq_f32(__quant_scale, f2);
f3 = vmulq_f32(__quant_scale, f3);
int32x4_t q0 = math::vRoundq_f32<R>(f0);
int32x4_t q1 = math::vRoundq_f32<R>(f1);
int32x4_t q2 = math::vRoundq_f32<R>(f2);
int32x4_t q3 = math::vRoundq_f32<R>(f3);
int16x4_t d0 = vmovn_s32(q0);
int16x4_t d1 = vmovn_s32(q1);
int16x4_t d2 = vmovn_s32(q2);
int16x4_t d3 = vmovn_s32(q3);
int16x8_t q5 = vcombine_s16(d0, d1);
int16x8_t q6 = vcombine_s16(d2, d3);
int8x8_t d5 = vmovn_s16(q5);
int8x8_t d6 = vmovn_s16(q6);
vst1_s8(y, d5);
vst1_s8(y + 8, d6);
}
#endif // __ARM_NEON__
for (int k = 0; k < remain; ++k) {
float x_temp =
math::Active<Act>(scale * (dequant_scale * x[k]) + bias);
y[k] = math::Round<R>(x_temp * quant_scale);
}
}
}
} else {
// TODO(hjchen2)
max_abs = std::max(max_abs, 1e-6f);
}
param->online_scale_->mutable_data<float>()[0] = max_abs;
}
template <>
bool FusionDequantAddBNQuantKernel<CPU, float>::Init(
FusionDequantAddBNQuantParam<CPU> *param) {
const framework::Tensor *bias = param->bias_;
PublicFusionDequantBNInitParam(param, bias);
return true;
}
template <>
void FusionDequantAddBNQuantKernel<CPU, float>::Compute(
const FusionDequantAddBNQuantParam<CPU> &param) {
switch (param.round_type_) {
case ROUND_NEAREST_TO_EVEN:
DequantBNQuantCompute<IDENTITY, ROUND_NEAREST_TO_EVEN>(&param);
break;
case ROUND_NEAREST_TOWARDS_ZERO:
DequantBNQuantCompute<IDENTITY, ROUND_NEAREST_TOWARDS_ZERO>(&param);
break;
case ROUND_NEAREST_AWAY_ZERO:
DequantBNQuantCompute<IDENTITY, ROUND_NEAREST_AWAY_ZERO>(&param);
break;
default:
LOG(kLOG_ERROR) << "round type is not supported.";
break;
}
}
#endif // FUSION_DEQUANT_ADD_BN_QUANT_OP
#ifdef FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
template <>
bool FusionDequantAddBNReluQuantKernel<CPU, float>::Init(
FusionDequantAddBNQuantParam<CPU> *param) {
const framework::Tensor *bias = param->bias_;
PublicFusionDequantBNInitParam(param, bias);
return true;
}
template <>
void FusionDequantAddBNReluQuantKernel<CPU, float>::Compute(
const FusionDequantAddBNQuantParam<CPU> &param) {
switch (param.round_type_) {
case ROUND_NEAREST_TO_EVEN:
DequantBNQuantCompute<RELU, ROUND_NEAREST_TO_EVEN>(&param);
break;
case ROUND_NEAREST_TOWARDS_ZERO:
DequantBNQuantCompute<RELU, ROUND_NEAREST_TOWARDS_ZERO>(&param);
break;
case ROUND_NEAREST_AWAY_ZERO:
DequantBNQuantCompute<RELU, ROUND_NEAREST_AWAY_ZERO>(&param);
break;
default:
LOG(kLOG_ERROR) << "round type is not supported.";
break;
}
}
#endif // FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
} // namespace operators
} // namespace paddle_mobile
......@@ -30,8 +30,8 @@ bool DequantizeKernel<CPU, float>::Init(DequantizeParam<CPU> *param) {
template <>
void DequantizeKernel<CPU, float>::Compute(const DequantizeParam<CPU> &param) {
const Tensor *input = param.input_;
Tensor *output = param.output_;
const LoDTensor *input = param.input_;
LoDTensor *output = param.output_;
float activation_scale = param.activation_scale_->data<float>()[0];
float weight_scale = param.weight_scale_;
const int32_t *x = input->data<const int32_t>();
......@@ -72,6 +72,7 @@ void DequantizeKernel<CPU, float>::Compute(const DequantizeParam<CPU> &param) {
for (size_t i = 0; i < size; ++i) {
y[i] = x[i] * scale;
}
output->set_lod(input->lod());
}
} // namespace operators
......
......@@ -29,12 +29,6 @@ template <>
void GruKernel<CPU, float>::Compute(const GruParam<CPU> &param) {
GruCompute<float>(param);
param.OutHidden()->set_lod(param.InputInput()->lod());
// DLOG << "________________" << param.OutHidden()->dims();
// DLOG << "________________" << param.OutHidden()->numel();
// auto *hiden_data = param.OutHidden()->data<float>();
// for (int64_t i = 0; i < 10; i++) {
// DLOG << "****************" << hiden_data[i];
// }
}
template class GruKernel<CPU, float>;
......
......@@ -15,11 +15,56 @@ limitations under the License. */
#ifdef RELU_OP
#include "operators/kernel/relu_kernel.h"
#include "operators/kernel/central-arm-func/relu_arm_func.h"
#include "common/types.h"
#include "operators/math/activation.h"
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#include <arm_neon.h>
#endif
namespace paddle_mobile {
namespace operators {
template <typename Dtype, ActivationType Act>
struct ReluCompute {
void operator()(const Tensor *input, Tensor *output) {}
};
template <ActivationType Act>
struct ReluCompute<float, Act> {
void operator()(const Tensor *input, Tensor *output) {
const float *x = input->data<float>();
float *y = output->mutable_data<float>();
size_t remain = input->numel();
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
size_t loop = remain >> 4;
remain = remain & 0xF;
#pragma omp parallel for
for (size_t i = 0; i < loop; ++i) {
const float *local_x = x + (i << 4);
float *local_y = y + (i << 4);
float32x4_t r0 = vld1q_f32(local_x);
float32x4_t r1 = vld1q_f32(local_x + 4);
float32x4_t r2 = vld1q_f32(local_x + 8);
float32x4_t r3 = vld1q_f32(local_x + 12);
r0 = math::vActiveq_f32<Act>(r0);
r1 = math::vActiveq_f32<Act>(r1);
r2 = math::vActiveq_f32<Act>(r2);
r3 = math::vActiveq_f32<Act>(r3);
vst1q_f32(local_y, r0);
vst1q_f32(local_y + 4, r1);
vst1q_f32(local_y + 8, r2);
vst1q_f32(local_y + 12, r3);
}
x += (loop << 4);
y += (loop << 4);
#endif
for (size_t i = 0; i < remain; ++i) {
y[i] = math::Active<Act>(x[i]);
}
}
};
template <>
bool ReluKernel<CPU, float>::Init(ReluParam<CPU> *param) {
return true;
......@@ -27,7 +72,21 @@ bool ReluKernel<CPU, float>::Init(ReluParam<CPU> *param) {
template <>
void ReluKernel<CPU, float>::Compute(const ReluParam<CPU> &param) {
ReluCompute<float>(param);
const Tensor *input = param.InputX();
Tensor *output = param.Out();
ReluCompute<float, RELU>()(input, output);
}
template <>
bool Relu6Kernel<CPU, float>::Init(ReluParam<CPU> *param) {
return true;
}
template <>
void Relu6Kernel<CPU, float>::Compute(const ReluParam<CPU> &param) {
const Tensor *input = param.InputX();
Tensor *output = param.Out();
ReluCompute<float, RELU6>()(input, output);
}
} // namespace operators
......
......@@ -13,8 +13,9 @@ See the License for the specific language governing permissions and
limitations under the License. */
#ifdef FUSION_CONVADDADDPRELU_OP
#pragma once
#include <string>
#include <vector>
#include "operators/math/conv_func.h"
#include "operators/math/im2col.h"
......@@ -115,20 +116,7 @@ void ConvAddAddPReluCompute(const FusionConvAddAddPReluParam<CPU> &param) {
Tensor filter_slice = filter.Slice(g * out_step, (g + 1) * out_step);
Tensor bias1_slice = bias1_batch.Slice(g * out_step, (g + 1) * out_step);
float *biase_data1 = bias1_slice.data<float>();
// int n = bias1_slice.dims()[0];
// int m = bias1_slice.dims()[1];
// for(int i=0;i<n*m;i++){
// if(biase_data1[i]!=0)
// DLOG<<biase_data1[i]<<",yangfei";
// }
// math::matmul<float>(filter_slice, false, col_matrix,
// false,
// static_cast<float>(1),
// &out_slice,
// static_cast<float>(1), true,
// biase_data);
math::matmulWithPRelu(filter_slice, false, col_matrix, false, &out_slice,
math::MatMulWithPRelu(filter_slice, false, col_matrix, false, &out_slice,
p, mode, biase_data, biase_data1);
}
}
......@@ -137,4 +125,4 @@ void ConvAddAddPReluCompute(const FusionConvAddAddPReluParam<CPU> &param) {
} // namespace operators
} // namespace paddle_mobile
#endif
#endif // FUSION_CONVADDADDPRELU_OP
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册