Merge branch 'develop' into develop

832e763e · Ray Liu · GitHub · dd4a10a5 · 8d7ceadb · 832e763e
138 changed file
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -10,6 +10,7 @@ option(LOG_PROFILE "log profile" OFF)
 option(CPU "armv7 with neon" ON)
 option(GPU_MALI "mali gpu" OFF)
 option(GPU_CL "opencl gpu" OFF)
+
 option(FPGA "fpga" OFF)
 if(FPGA)
    option(FPGAV1 "fpga v1" ON)
@@ -144,7 +145,7 @@ if(FPGA)
        endforeach()
        file(GLOB_RECURSE _tmp_list src/operators/kernel/fpga/V2/*.h src/fpga/V2/*.h)
        foreach(f ${_tmp_list})
-            list(REMOVE_ITEM PADDLE_MOBILE_CC ${f})
+            list(REMOVE_ITEM PADDLE_MOBILE_H ${f})
        endforeach()
    endif()
    if(FPGAV2)
@@ -156,7 +157,7 @@ if(FPGA)
        endforeach()
        file(GLOB_RECURSE _tmp_list src/operators/kernel/fpga/V1/*.h src/fpga/V1/*.h)
        foreach(f ${_tmp_list})
-            list(REMOVE_ITEM PADDLE_MOBILE_CC ${f})
+            list(REMOVE_ITEM PADDLE_MOBILE_H ${f})
        endforeach()
    endif()


--- a/README.md
+++ b/README.md
@@ -7,11 +7,21 @@
 <!--[![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle-Mobile.svg)](https://github.com/PaddlePaddle/Paddle-Mobile/releases)
 [![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)-->

+Welcome to Paddle-Mobile GitHub project。Paddle-Mobile is a project of PaddlePaddle as well as a deep learning framework for embedded platforms.

 欢迎来到 Paddle-Mobile GitHub 项目。Paddle-Mobile是PaddlePaddle组织下的项目，是一个致力于嵌入式平台的深度学习的框架。

 ## Features

+- high performance in support of ARM CPU 
+- support Mali GPU
+- support Andreno GPU
+- support the realization of GPU Metal on Apple devices
+- support implementation on ZU5、ZU9 and other FPGA-based development boards
+- support implementation on Raspberry Pi and other arm-linux development boards
+
+## Features
+
 - 高性能支持ARM CPU 
 - 支持Mali GPU
 - 支持Andreno GPU
@@ -19,6 +29,7 @@
 - 支持ZU5、ZU9等FPGA开发板
 - 支持树莓派等arm-linux开发板

+
 ## Demo
 - [ANDROID](https://github.com/xiebaiyuan/paddle-mobile-demo)

@@ -26,6 +37,27 @@

 [https://github.com/PaddlePaddle/paddle-mobile/tree/develop/demo](https://github.com/PaddlePaddle/paddle-mobile/tree/develop/demo)

+## Documentation
+
+### Documentation of design
+
+If you want to know more details about the documentation of paddle-mobile design, please refer to the link as follows. There are many previous designs and discussion: [issue](https://github.com/PaddlePaddle/paddle-mobile/issues).
+
+[link of documentation of design](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/design_doc.md)
+
+### Documentation of development
+
+Documentation of development is mainly about building, running and other tasks.As a developer,you can use it with the help of contributed documents.
+* [iOS](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_ios.md)
+* [Android_CPU](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_android.md)
+* [Android_GPU](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_android_GPU.md)
+* [FPGA](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_fpga.md)
+* [ARM_LINUX](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_arm_linux.md)
+
+### How to contribute your documents
+- [tutorial link to contribute documents](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/CONTRIBUTING.md)
+- Main procedure of contributing code is covered in the document above.If you have other problems during the procedure,please send them as [issue](https://github.com/PaddlePaddle/paddle-mobile/issues). We will deal with it as quickly as possible.
+
 ## 文档

 ### 设计文档
@@ -46,6 +78,24 @@
 - [贡献文档链接](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/CONTRIBUTING.md)
 - 上面文档中涵盖了主要的贡献代码流程，如果在实践中您还遇到了其他问题，可以发[issue](https://github.com/PaddlePaddle/paddle-mobile/issues)。我们看到后会尽快处理。

+## Acquision of Models
+At present Paddle-Mobile only supports Paddle fluid training model. Models wiil be operated regularly after transformation if you have various models.
+### 1. Use Paddle Fluid directly to train
+It is the most reliable method to be recommanded
+### 2. Transform Caffe to Paddle Fluid model
+[https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/caffe2fluid](https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/caffe2fluid)
+### 3. ONNX
+ONNX is expanded as Open Neural Network Exchange. The project is aimed to make a full communication and usage among diffrent nerual network development frameworks.
+
+Except for directly using fluid models trained by PaddlePaddle,you can also get certain Paddle fluid models through onnx transformation.
+
+At present，work in support of onnx is also under operation in Baidu. Related tranformation project can be referred to here：
+[https://github.com/PaddlePaddle/paddle-onnx](https://github.com/PaddlePaddle/paddle-onnx)
+
+### 4. Download parts of testing models and testing pictures
+[http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip](http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip)
+
+

 ## 模型获得
 目前Paddle-Mobile仅支持Paddle fluid训练的模型。如果你手中的模型是不同种类的模型，需要进行模型转换才可以运行。
@@ -64,6 +114,22 @@ ONNX全称为“Open Neural Network Exchange”，即“开放的神经网络切
 ### 4. 部分测试模型和测试图片下载
 [http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip](http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip)

+<!--## Online output of simple search
+
+Gif as following is the application output of online main part detection of simple search app
+![ezgif-1-050a733dfb](http://otkwwi4x8.bkt.clouddn.com/2018-07-05-ezgif-1-050a733dfb.gif)-->
+
+## Ask Question
+
+Welcome to put forward or tackle with our problems,You can post your question in our issue modular on github. [Github Issues](https://github.com/PaddlePaddle/paddle-mobile/issues).
+
+## Copyright and License
+Paddle-Mobile provide relatively unstricted Apache-2.0 Open source agreement [Apache-2.0 license](LICENSE).
+
+
+## Old version Mobile-Deep-Learning
+Original MDL(Mobile-Deep-Learning) project has been transferred to [Mobile-Deep-Learning](https://github.com/allonli/mobile-deep-learning) 
+
 <!--## 简单搜索线上效果

 如下gif是简单搜索app的线上主体检测应用效果

--- a/src/common/enforce.h
+++ b/src/common/enforce.h
@@ -16,9 +16,9 @@ limitations under the License. */

 #ifdef ENABLE_EXCEPTION
 #include <stdio.h>
+#include <stdlib.h>
 #include <exception>
 #include <string>
-
 #endif

 namespace paddle_mobile {

--- a/src/common/types.cpp
+++ b/src/common/types.cpp
@@ -22,9 +22,10 @@ const char *G_OP_TYPE_BATCHNORM = "batch_norm";
 const char *G_OP_TYPE_BOX_CODER = "box_coder";
 const char *G_OP_TYPE_CONCAT = "concat";
 const char *G_OP_TYPE_ELEMENTWISE_ADD = "elementwise_add";
+const char *G_OP_TYPE_ELEMENTWISE_SUB = "elementwise_sub";
+const char *G_OP_TYPE_ELEMENTWISE_MUL = "elementwise_mul";
 const char *G_OP_TYPE_FILL_CONSTANT = "fill_constant";
 const char *G_OP_TYPE_FUSION_CONV_ADD_RELU = "fusion_conv_add_relu";
-const char *G_OP_TYPE_FUSION_CONV_ADD_RELU_INT8 = "fusion_conv_add_relu_int8";
 const char *G_OP_TYPE_FUSION_CONV_ADD_PRELU = "fusion_conv_add_prelu";
 const char *G_OP_TYPE_FUSION_CONV_ADD_ADD_PRELU = "fusion_conv_add_add_prelu";
 const char *G_OP_TYPE_FUSION_CONV_ADD_BN_RELU = "fusion_conv_add_bn_relu";
@@ -32,7 +33,6 @@ const char *G_OP_TYPE_FUSION_CONV_BN_ADD_RELU = "fusion_conv_bn_add_relu";
 const char *G_OP_TYPE_FUSION_DWCONV_BN_RELU = "fusion_dwconv_bn_relu";
 const char *G_OP_TYPE_FUSION_CONV_BN_RELU = "fusion_conv_bn_relu";
 const char *G_OP_TYPE_FC = "fusion_fc";
-const char *G_OP_TYPE_FC_INT8 = "fusion_fc_int8";
 const char *G_OP_TYPE_FUSION_CONV_ADD = "fusion_conv_add";
 const char *G_OP_TYPE_LRN = "lrn";
 const char *G_OP_TYPE_MUL = "mul";
@@ -41,6 +41,7 @@ const char *G_OP_TYPE_POLYGON_BOX_TRANSFORM = "polygon_box_transform";
 const char *G_OP_TYPE_POOL2D = "pool2d";
 const char *G_OP_TYPE_PRIOR_BOX = "prior_box";
 const char *G_OP_TYPE_RELU = "relu";
+const char *G_OP_TYPE_RELU6 = "relu6";
 const char *G_OP_TYPE_RESHAPE = "reshape";
 const char *G_OP_TYPE_RESHAPE2 = "reshape2";
 const char *G_OP_TYPE_SIGMOID = "sigmoid";
@@ -68,14 +69,20 @@ const char *G_OP_TYPE_CRF = "crf_decoding";
 const char *G_OP_TYPE_BILINEAR_INTERP = "bilinear_interp";
 const char *G_OP_TYPE_FLATTEN = "flatten";
 const char *G_OP_TYPE_SHAPE = "shape";
-const char *G_OP_TYPE_ELEMENTWISE_MUL = "elementwise_mul";
 const char *G_OP_TYPE_SUM = "sum";
+const char *G_OP_TYPE_TOP_K = "top_k";
+const char *G_OP_TYPE_CAST = "cast";

 const char *G_OP_TYPE_QUANTIZE = "quantize";
 const char *G_OP_TYPE_DEQUANTIZE = "dequantize";
+const char *G_OP_TYPE_FUSION_DEQUANT_BN = "fusion_dequant_bn";
 const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN = "fusion_dequant_add_bn";
 const char *G_OP_TYPE_FUSION_DEQUANT_BN_RELU = "fusion_dequant_bn_relu";
 const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU = "fusion_dequant_add_bn_relu";
+const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_QUANT =
+    "fusion_dequant_add_bn_quant";
+const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU_QUANT =
+    "fusion_dequant_add_bn_relu_quant";

 const char *G_OP_TYPE_TANH = "tanh";
 const char *G_OP_TYPE_FUSION_DECONV_RELU = "fusion_deconv_relu";
@@ -91,10 +98,13 @@ std::unordered_map<
        {G_OP_TYPE_PRELU, {{"X", "Alpha"}, {"Out"}}},
        {G_OP_TYPE_FUSION_CONV_ADD, {{"Input"}, {"Out"}}},
        {G_OP_TYPE_RELU, {{"X"}, {"Out"}}},
+        {G_OP_TYPE_RELU6, {{"X"}, {"Out"}}},
        {G_OP_TYPE_SOFTMAX, {{"X"}, {"Out"}}},
        {G_OP_TYPE_SIGMOID, {{"X"}, {"Out"}}},
        {G_OP_TYPE_MUL, {{"X"}, {"Out"}}},
        {G_OP_TYPE_ELEMENTWISE_ADD, {{"X", "Y"}, {"Out"}}},
+        {G_OP_TYPE_ELEMENTWISE_SUB, {{"X", "Y"}, {"Out"}}},
+        {G_OP_TYPE_ELEMENTWISE_MUL, {{"X", "Y"}, {"Out"}}},
        {G_OP_TYPE_POOL2D, {{"X"}, {"Out"}}},
        {G_OP_TYPE_BATCHNORM, {{"X"}, {"Y"}}},
        {G_OP_TYPE_LRN, {{"X"}, {"Out"}}},
@@ -112,13 +122,11 @@ std::unordered_map<
        {G_OP_TYPE_MULTICLASS_NMS, {{"BBoxes", "Scores"}, {"Out"}}},
        {G_OP_TYPE_POLYGON_BOX_TRANSFORM, {{"Input"}, {"Output"}}},
        {G_OP_TYPE_FC, {{"X", "Y", "Z"}, {"Out"}}},
-        {G_OP_TYPE_FC_INT8, {{"X", "Y", "Z", "Scale"}, {"Out"}}},
        {G_OP_TYPE_RESHAPE, {{"X"}, {"Out"}}},
        {G_OP_TYPE_RESHAPE2, {{"X"}, {"Out", "XShape"}}},
        {G_OP_TYPE_DEPTHWISE_CONV, {{"Input"}, {"Output"}}},
        {G_OP_TYPE_FILL_CONSTANT, {{}, {"Out"}}},
        {G_OP_TYPE_FUSION_CONV_ADD_RELU, {{"Input"}, {"Out"}}},
-        {G_OP_TYPE_FUSION_CONV_ADD_RELU_INT8, {{"Input", "Scale"}, {"Out"}}},
        {G_OP_TYPE_FUSION_CONV_ADD_PRELU, {{"Input"}, {"Out"}}},
        {G_OP_TYPE_FUSION_CONV_ADD_ADD_PRELU, {{"Input"}, {"Out"}}},
        {G_OP_TYPE_IM2SEQUENCE, {{"X"}, {"Out"}}},
@@ -139,12 +147,18 @@ std::unordered_map<
        {G_OP_TYPE_SHAPE, {{"Input"}, {"Out"}}},
        {G_OP_TYPE_CONV_TRANSPOSE, {{"Input"}, {"Output"}}},
        {G_OP_TYPE_SUM, {{"X"}, {"Out"}}},
-        {G_OP_TYPE_ELEMENTWISE_MUL, {{"X", "Y"}, {"Out"}}},
+        {G_OP_TYPE_TOP_K, {{"X"}, {"Out", "Indices"}}},
+        {G_OP_TYPE_CAST, {{"X"}, {"Out"}}},
        {G_OP_TYPE_QUANTIZE, {{"X"}, {"Out", "OutScale"}}},
        {G_OP_TYPE_DEQUANTIZE, {{"X", "Scale"}, {"Out"}}},
-        {G_OP_TYPE_FUSION_DEQUANT_ADD_BN, {{"X", "Scale"}, {"Y"}}},
+        {G_OP_TYPE_FUSION_DEQUANT_BN, {{"X", "Scale"}, {"Out"}}},
+        {G_OP_TYPE_FUSION_DEQUANT_ADD_BN, {{"X", "Scale"}, {"Out"}}},
        {G_OP_TYPE_FUSION_DEQUANT_BN_RELU, {{"X", "Scale"}, {"Out"}}},
        {G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU, {{"X", "Scale"}, {"Out"}}},
+        {G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU_QUANT,
+         {{"X", "Scale"}, {"Out", "OutScale"}}},
+        {G_OP_TYPE_FUSION_DEQUANT_ADD_BN_QUANT,
+         {{"X", "Scale"}, {"Out", "OutScale"}}},
        {G_OP_TYPE_TANH, {{"X"}, {"Out"}}},
        {G_OP_TYPE_FUSION_DECONV_RELU, {{"Input"}, {"Out"}}},
        {G_OP_TYPE_FUSION_DECONV_ADD, {{"Input"}, {"Out"}}},

--- a/src/common/types.h
+++ b/src/common/types.h
@@ -87,10 +87,24 @@ enum PMStatus {
 };

 enum RoundType {
-  ROUND_UNK = 0,
-  ROUND_NEAREST_AWAY_ZERO = 1,
-  ROUND_NEAREST_TOWARDS_ZERO = 2,
-  ROUND_NEAREST_TO_EVEN = 3
+  ROUND_NEAREST_AWAY_ZERO = 0,
+  ROUND_NEAREST_TOWARDS_ZERO = 1,
+  ROUND_NEAREST_TO_EVEN = 2,
+};
+
+enum ActivationType {
+  IDENTITY = 0,
+  RELU = 1,
+  RELU6 = 2,
+  PRELU = 3,
+  LEAKY_RELU = 4,
+  TANH = 5,
+  SIGMOID = 6,
+};
+
+enum PoolingType {
+  MAX = 0,
+  AVG = 1,
 };

 extern const char *G_OP_TYPE_CONV;
@@ -98,12 +112,12 @@ extern const char *G_OP_TYPE_BATCHNORM;
 extern const char *G_OP_TYPE_BOX_CODER;
 extern const char *G_OP_TYPE_CONCAT;
 extern const char *G_OP_TYPE_ELEMENTWISE_ADD;
+extern const char *G_OP_TYPE_ELEMENTWISE_SUB;
+extern const char *G_OP_TYPE_ELEMENTWISE_MUL;
 extern const char *G_OP_TYPE_FUSION_CONV_ADD_RELU;
-extern const char *G_OP_TYPE_FUSION_CONV_ADD_RELU_INT8;
 extern const char *G_OP_TYPE_FUSION_CONV_ADD_PRELU;
 extern const char *G_OP_TYPE_FUSION_CONV_ADD_ADD_PRELU;
 extern const char *G_OP_TYPE_FC;
-extern const char *G_OP_TYPE_FC_INT8;
 extern const char *G_OP_TYPE_FUSION_CONV_ADD;
 extern const char *G_OP_TYPE_FUSION_CONV_ADD_BN_RELU;
 extern const char *G_OP_TYPE_FUSION_CONV_BN_ADD_RELU;
@@ -116,6 +130,7 @@ extern const char *G_OP_TYPE_MULTICLASS_NMS;
 extern const char *G_OP_TYPE_POOL2D;
 extern const char *G_OP_TYPE_PRIOR_BOX;
 extern const char *G_OP_TYPE_RELU;
+extern const char *G_OP_TYPE_RELU6;
 extern const char *G_OP_TYPE_RESHAPE;
 extern const char *G_OP_TYPE_SIGMOID;
 extern const char *G_OP_TYPE_SOFTMAX;
@@ -136,13 +151,17 @@ extern const char *G_OP_TYPE_FUSION_CONV_BN;
 extern const char *G_OP_TYPE_CONV_TRANSPOSE;
 extern const char *G_OP_TYPE_PRELU;
 extern const char *G_OP_TYPE_SUM;
-extern const char *G_OP_TYPE_ELEMENTWISE_MUL;
+extern const char *G_OP_TYPE_TOP_K;
+extern const char *G_OP_TYPE_CAST;

 extern const char *G_OP_TYPE_QUANTIZE;
 extern const char *G_OP_TYPE_DEQUANTIZE;
+extern const char *G_OP_TYPE_FUSION_DEQUANT_BN;
 extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN;
 extern const char *G_OP_TYPE_FUSION_DEQUANT_BN_RELU;
 extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU;
+extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_QUANT;
+extern const char *G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU_QUANT;

 extern const char *G_OP_TYPE_TANH;
 extern const char *G_OP_TYPE_FUSION_DECONV_RELU;

--- a/src/fpga/V1/api.cpp
+++ b/src/fpga/V1/api.cpp
@@ -24,8 +24,6 @@ namespace fpga {
 #define USE_RELU 1
 #define USE_BIAS 2

-int get_align_image_cw(int cw) { return align_to_x(cw, IMAGE_ALIGNMENT); }
-
 void format_image(framework::Tensor *image_tensor) {
  auto dims = image_tensor->dims();
  auto channel = dims[1], height = dims[2], width = dims[3];
@@ -83,6 +81,13 @@ int get_plit_num(framework::Tensor *filter_tensor) {
  int div_capacity = filter::calc_division_capacity(chw);
  return filter::calc_split_num(num, div_capacity);
 }
+int get_deconv_plit_num(framework::Tensor *filter_tensor, int stride) {
+  auto dims = filter_tensor->dims();
+  auto chw = dims[1] * dims[2] / stride * dims[3] / stride;
+  auto num = dims[0] * stride;
+  int div_capacity = filter::calc_division_capacity(chw);
+  return filter::calc_split_num(num, div_capacity);
+}

 int get_filter_num_per_div(framework::Tensor *filter_tensor, int group_num) {
  auto dims = filter_tensor->dims();
@@ -92,12 +97,17 @@ int get_filter_num_per_div(framework::Tensor *filter_tensor, int group_num) {
  return filter::calc_num_per_div(num, group_num, div_capacity);
 }

-int get_aligned_filter_element_num(int chw) {
-  return align_to_x(chw, FILTER_ELEMENT_ALIGNMENT);
+int get_deconv_filter_num_per_div(framework::Tensor *filter_tensor,
+                                  int group_num, int stride) {
+  auto dims = filter_tensor->dims();
+  auto chw = dims[1] * dims[2] / stride * dims[3] / stride;
+  auto num = dims[0] * stride;
+  int div_capacity = filter::calc_division_capacity(chw);
+  return filter::calc_num_per_div(num, group_num, div_capacity);
 }

-int get_aligned_filter_num(int num) {
-  return align_to_x(num, FILTER_NUM_ALIGNMENT);
+int get_aligned_filter_element_num(int chw) {
+  return align_to_x(chw, FILTER_ELEMENT_ALIGNMENT);
 }

 void format_filter(framework::Tensor *filter_tensor, float max_value,
@@ -177,46 +187,37 @@ void format_concat_output(framework::Tensor *out, int height, int width,

 void expand_conv_arg(ConvArgs *arg) {
  ConvArgs args = *arg;
-  uint64_t filterlen = (uint64_t)args.kernel.width *
-                       (uint64_t)args.kernel.height *
-                       (uint64_t)args.image.channels;
-  filterlen = align_to_x(filterlen, FILTER_ELEMENT_ALIGNMENT);
-  filterlen *= align_to_x((uint64_t)args.filter_num, FILTER_NUM_ALIGNMENT);
-  uint64_t fpga_bias_scale_len =
+
+  auto fpga_bias_scale_len =
      align_to_x(args.filter_num / args.group_num, 8) * args.group_num;

-  uint64_t output_height =
+  auto output_height =
      (args.image.height + args.image.pad_height * 2 - args.kernel.height) /
          args.kernel.stride_h +
      1;
-  uint64_t output_width =
+  auto output_width =
      (args.image.width + args.image.pad_width * 2 - args.kernel.width) /
          args.kernel.stride_w +
      1;
-  uint64_t output_size =
-      output_height * output_width * (uint64_t)args.filter_num;
-
-  auto filter_per_group = (uint64_t)(args.filter_num / args.group_num);
-  auto channel_per_group = (uint64_t)(args.image.channels / args.group_num);
-
-  uint64_t image_row_count = ((uint64_t)args.image.width) *
-                             ((uint64_t)args.image.channels);  // without align
-  uint64_t image_amount_per_row = align_to_x(image_row_count, IMAGE_ALIGNMENT);
-  uint64_t image_one_pad_per_row =
-      align_to_x(image_row_count, IMAGE_ALIGNMENT) +
-      ((uint64_t)args.image.pad_width) * ((uint64_t)args.image.channels);
-  uint64_t filter_amount_all =
-      align_to_x(((uint64_t)args.kernel.height) *
-                     ((uint64_t)args.kernel.width) * channel_per_group,
+
+  auto filter_per_group = args.filter_num / args.group_num;
+  auto channel_per_group = args.image.channels / args.group_num;
+
+  auto image_row_count = args.image.width * args.image.channels;
+  auto image_amount_per_row = align_to_x(image_row_count, IMAGE_ALIGNMENT);
+  auto image_one_pad_per_row = align_to_x(image_row_count, IMAGE_ALIGNMENT) +
+                               args.image.pad_width * args.image.channels;
+  auto filter_amount_all =
+      align_to_x(args.kernel.height * args.kernel.width * channel_per_group,
                 FILTER_ELEMENT_ALIGNMENT);

-  uint64_t output_amount_per_row =
-      align_to_x(output_width * ((uint64_t)args.filter_num), IMAGE_ALIGNMENT);
+  auto output_amount_per_row =
+      align_to_x(output_width * args.filter_num, IMAGE_ALIGNMENT);

  // find the opt partition strategy
  uint64_t res_win;
  uint64_t res_fit = 0;
-  for (res_win = 1; res_win <= output_width; res_win = res_win + 1) {
+  for (res_win = 1; res_win <= output_width; res_win++) {
    if ((align_to_x(
             (args.image.channels *
              (args.kernel.width + (res_win - 1) * args.kernel.stride_w)),
@@ -238,48 +239,48 @@ void expand_conv_arg(ConvArgs *arg) {
  }
  res_fit = res_win;

-  uint64_t block_num = (output_width + res_fit - 1) / res_fit;
-  uint64_t block_len = res_fit;
-  uint64_t block_last = output_width - res_fit * (block_num - 1);
+  auto block_num = (output_width + res_fit - 1) / res_fit;
+  auto block_len = res_fit;
+  auto block_last = output_width - res_fit * (block_num - 1);

-  uint64_t res_amount_per_row = output_width * args.filter_num;
-  uint64_t res_amount_per_row_pad = output_amount_per_row - res_amount_per_row;
+  auto res_amount_per_row = output_width * args.filter_num;
+  auto res_amount_per_row_pad = output_amount_per_row - res_amount_per_row;

-  uint64_t image_block_amount_per_row =
-      args.kernel.stride_w * (res_fit)*args.image.channels;
-  uint64_t filter_pad_width_mul_channel =
+  auto image_block_amount_per_row =
+      args.kernel.stride_w * res_fit * args.image.channels;
+  auto filter_pad_width_mul_channel =
      args.image.pad_width * args.image.channels;
-  uint64_t image_amount_per_row_multi_win_first =
+  auto image_amount_per_row_multi_win_first =
      image_amount_per_row * (4 * args.kernel.stride_h - args.image.pad_height);
-  uint64_t image_amount_per_row_multi_win =
+  auto image_amount_per_row_multi_win =
      image_amount_per_row * (4 * args.kernel.stride_h);

-  uint64_t image_block_num = block_num;
-  uint64_t image_block_len =
+  auto image_block_num = block_num;
+  auto image_block_len =
      align_to_x((args.image.channels *
                  (args.kernel.width + (block_len - 1) * args.kernel.stride_w)),
                 IMAGE_ALIGNMENT) /
          16 +
      1;
-  uint64_t image_block_len_last =
+  auto image_block_len_last =
      align_to_x(
          (args.image.channels *
           (args.kernel.width + (block_last - 1) * args.kernel.stride_w)),
          IMAGE_ALIGNMENT) /
          16 +
      1;
-  uint64_t image_win_cnt = block_len;
-  uint64_t image_win_cnt_last = block_last;
-  uint64_t res_row_data_align4_pad = res_amount_per_row_pad / 8;
-  uint64_t prog_full_cnt = 2048 / (filter_amount_all / 16 * 2) - 1;
+  auto image_win_cnt = block_len;
+  auto image_win_cnt_last = block_last;
+  auto res_row_data_align4_pad = res_amount_per_row_pad / 8;
+  auto prog_full_cnt = 2048 / (filter_amount_all / 16 * 2) - 1;
  if (prog_full_cnt == 1023) {
    prog_full_cnt--;
  }
-  uint64_t post_prog_full_cnt =
+  auto post_prog_full_cnt =
      (512 / (align_to_x(args.filter_num, 4) / 4 * 2) > 2)
          ? (512 / (align_to_x(args.filter_num, 4) / 4 * 2) - 2)
          : 0;
-  uint64_t cmd = 0UL | (args.relu_enabled ? USE_RELU : 0) | USE_BIAS;
+  auto cmd = 0UL | (args.relu_enabled ? USE_RELU : 0) | USE_BIAS;

  (*arg).driver.image_address_phy = vaddr_to_paddr(args.image.address);
  (*arg).driver.sb_address_phy = vaddr_to_paddr(args.sb_address);
@@ -449,7 +450,6 @@ void fill_deconv_arg(struct DeconvArgs *arg, framework::Tensor *input,
  arg->sub_conv_num = (uint32_t)stride_h;
  arg->filter_num = (uint32_t)filter->dims()[0];
  int sub_conv_num = arg->sub_conv_num;
-  int sub_stride = 1;
  int sub_pad = deconv_filter::deconv_calc_sub_pad((int)filter->dims()[3],
                                                   padding_w, stride_w);
  int sub_filter_width = deconv_filter::deconv_get_sub_filter_axis(
@@ -464,7 +464,9 @@ void fill_deconv_arg(struct DeconvArgs *arg, framework::Tensor *input,
  arg->sub_output_height = (uint32_t)sub_output_height;
  arg->omit_size = (uint32_t)deconv_filter::deconv_get_omit(
      stride_w, (int)filter->dims()[3], padding_w);
-  arg->conv_args = (ConvArgs *)fpga_malloc(sub_conv_num * sizeof(ConvArgs));
+
+  arg->output.address = out_ptr;
+  arg->output.scale_address = out->scale;

  int sub_channels = (int)input->dims()[1];
  int omit_size = arg->omit_size;
@@ -484,50 +486,160 @@ void fill_deconv_arg(struct DeconvArgs *arg, framework::Tensor *input,
  int align_conv_sub_filter_count =
      align_sub_filter_count * align_sub_filter_num;

+  int split_num =
+      group_num == 1 ? (uint32_t)get_deconv_plit_num(filter, sub_conv_num) : 1;
+
+  arg->split_conv_args =
+      (SplitConvArgs *)fpga_malloc(sub_conv_num * sizeof(SplitConvArgs));
  for (int i = 0; i < sub_conv_num; ++i) {
-    arg->conv_args[i].filter_num = (arg->sub_conv_num) * (arg->filter_num);
-    arg->conv_args[i].group_num = (uint32_t)group_num;
-
-    arg->conv_args[i].filter_scale_address = filter->scale;
-    arg->conv_args[i].relu_enabled = relu_enabled;
-
-    arg->conv_args[i].kernel.width = (uint32_t)sub_filter_width;
-    arg->conv_args[i].kernel.height = (uint32_t)sub_filter_width;
-    arg->conv_args[i].kernel.stride_w = 1;
-    arg->conv_args[i].kernel.stride_h = 1;
-
-    // DeconvParam.conv_args[i].image.address = (void*)ptr_image;
-    arg->conv_args[i].image.scale_address = input->scale;
-    arg->conv_args[i].image.channels = (uint32_t)sub_channels;
-    arg->conv_args[i].image.width = (uint32_t)input->dims()[3];
-    arg->conv_args[i].image.height = (uint32_t)input->dims()[2];
-    arg->conv_args[i].image.pad_width = (uint32_t)sub_pad;
-    arg->conv_args[i].image.pad_height = (uint32_t)sub_pad;
-    arg->conv_args[i].image.address = input_ptr;
-    arg->conv_args[i].sb_address = (void *)bs_ptr;
-
-    auto filter_sub_space =
-        (char *)fpga_malloc(align_conv_sub_filter_count * sizeof(char));
-    fpga_copy(filter_sub_space,
-              (char *)filter_ptr + i * align_conv_sub_filter_count,
-              (size_t)align_conv_sub_filter_count);
-    arg->conv_args[i].filter_address = (void *)(filter_sub_space);
-    fpga_flush(filter_sub_space, (size_t)align_conv_sub_filter_count);
+    arg->split_conv_args[i].filter_num =
+        (arg->sub_conv_num) * (arg->filter_num);
+    arg->split_conv_args[i].group_num = (uint32_t)group_num;
+    arg->split_conv_args[i].split_num = split_num;
+    arg->split_conv_args[i].conv_arg =
+        (ConvArgs *)fpga_malloc(split_num * sizeof(ConvArgs));
+
+    arg->split_conv_args[i].concat_arg.height = sub_output_height;
+    arg->split_conv_args[i].concat_arg.width = sub_output_width;
+    arg->split_conv_args[i].concat_arg.image_num = split_num;
+    arg->split_conv_args[i].concat_arg.images_in =
+        (half **)fpga_malloc(split_num * sizeof(half *));
+    arg->split_conv_args[i].concat_arg.scales_in =
+        (float **)fpga_malloc(split_num * sizeof(float *));
+    arg->split_conv_args[i].concat_arg.channel_num =
+        (uint32_t *)fpga_malloc(split_num * sizeof(uint32_t));
+    // arg->split_conv_args[i].concat_arg.image_out =
+    // fpga_malloc(conv_output_size * sizeof(half));
+    // arg->split_conv_args[i].concat_arg.scale_out = fpga_malloc(2 *
+    // sizeof(float));
+  }

+  int filter_num_per_div =
+      get_deconv_filter_num_per_div(filter, group_num, stride_w);
+  int element_num = get_aligned_filter_element_num(
+      (int)(sub_channels * sub_filter_width * sub_filter_width));
+
+  int chw = sub_channels * sub_filter_width * sub_filter_width;
+  int division_capacity = filter::calc_division_capacity(chw);
+  int num_per_div_before_alignment =
+      filter::calc_num_per_div(sub_filter_num, group_num, division_capacity);
+  int num_per_div_after_alignment =
+      align_to_x(num_per_div_before_alignment, FILTER_NUM_ALIGNMENT);
+  int div_num = (sub_filter_num + num_per_div_before_alignment - 1) /
+                num_per_div_before_alignment;
+  int residual = sub_filter_num % num_per_div_before_alignment;
+  int num_after_alignment = num_per_div_after_alignment *
+                                ((residual == 0) ? div_num : (div_num - 1)) +
+                            align_to_x(residual, FILTER_NUM_ALIGNMENT);
+
+  int filter_sub_conv_offset = element_num * num_after_alignment;
+  for (int i = 0; i < sub_conv_num; ++i) {
    if (sub_conv_num == 1) {
-      arg->conv_args[i].output.address = out_ptr;
-      arg->conv_args[i].output.scale_address = out->scale;
+      arg->split_conv_args[i].output.address = arg->output.address;
+      arg->split_conv_args[i].output.scale_address = arg->output.scale_address;
+
    } else {
      auto ptr_output = (half *)fpga_malloc(conv_output_size * sizeof(half));
-      arg->conv_args[i].output.address = (void *)((half *)ptr_output);
+      arg->split_conv_args[i].output.address = (void *)((half *)ptr_output);
      auto ptr_output_scale = (float *)fpga_malloc(2 * sizeof(float));
-      arg->conv_args[i].output.scale_address = ptr_output_scale;
+      arg->split_conv_args[i].output.scale_address = ptr_output_scale;
    }
-  }

-  arg->output.address = out_ptr;
-  arg->output.scale_address = out->scale;
-  // fpga_free(filter_ptr);
+    for (int j = 0; j < split_num; ++j) {
+      arg->split_conv_args[i].conv_arg[j].relu_enabled = relu_enabled;
+      arg->split_conv_args[i].conv_arg[j].group_num = (uint32_t)group_num;
+
+      arg->split_conv_args[i].conv_arg[j].kernel.width =
+          (uint32_t)sub_filter_width;
+      arg->split_conv_args[i].conv_arg[j].kernel.height =
+          (uint32_t)sub_filter_width;
+      arg->split_conv_args[i].conv_arg[j].kernel.stride_w = 1;
+      arg->split_conv_args[i].conv_arg[j].kernel.stride_h = 1;
+
+      arg->split_conv_args[i].conv_arg[j].image.scale_address = input->scale;
+      arg->split_conv_args[i].conv_arg[j].image.channels =
+          (uint32_t)sub_channels;
+      arg->split_conv_args[i].conv_arg[j].image.width =
+          (uint32_t)input->dims()[3];
+      arg->split_conv_args[i].conv_arg[j].image.height =
+          (uint32_t)input->dims()[2];
+      arg->split_conv_args[i].conv_arg[j].image.pad_width = (uint32_t)sub_pad;
+      arg->split_conv_args[i].conv_arg[j].image.pad_height = (uint32_t)sub_pad;
+      arg->split_conv_args[i].conv_arg[j].image.address = input_ptr;
+
+      arg->split_conv_args[i].conv_arg[j].filter_scale_address = filter->scale;
+      arg->split_conv_args[i].conv_arg[j].filter_num = (uint32_t)(
+          j == split_num - 1
+              ? sub_filter_num - (split_num - 1) * filter_num_per_div  // NOLINT
+              : filter_num_per_div);
+
+      size_t filter_size =
+          element_num *
+          align_to_x(arg->split_conv_args[i].conv_arg[j].filter_num,
+                     FILTER_NUM_ALIGNMENT) *
+          sizeof(int8_t);
+      auto filter_head =
+          &((int8_t *)filter_ptr)[j * element_num * filter_num_per_div +
+                                  i * filter_sub_conv_offset];
+      arg->split_conv_args[i].conv_arg[j].filter_address =
+          fpga_malloc(filter_size);
+      memcpy(arg->split_conv_args[i].conv_arg[j].filter_address, filter_head,
+             filter_size);
+      fpga_flush(arg->split_conv_args[i].conv_arg[j].filter_address,
+                 filter_size);
+
+      {
+        static int test_cnt = 0;
+        signed char result = 0;
+        if (test_cnt <= 1) {
+          std::string filename = "deconv_split_flt" + std::to_string(test_cnt);
+
+          fpga::savefile<signed char>(
+              filename, arg->split_conv_args[i].conv_arg[j].filter_address,
+              filter_size, result);
+          test_cnt++;
+        }
+      }
+
+      size_t bs_align_num = align_to_x(
+          arg->split_conv_args[i].conv_arg[j].filter_num, BS_NUM_ALIGNMENT);
+      size_t bs_size = 2 * bs_align_num * sizeof(float);
+      auto bs_head = &bs_ptr[j * filter_num_per_div * 2];
+
+      arg->split_conv_args[i].conv_arg[j].sb_address = fpga_malloc(bs_size);
+      memcpy(arg->split_conv_args[i].conv_arg[j].sb_address, bs_head, bs_size);
+      fpga_flush(arg->split_conv_args[i].conv_arg[j].sb_address, bs_size);
+
+      if (split_num == 1) {
+        arg->split_conv_args[i].conv_arg[j].output.address =
+            arg->split_conv_args[i].output.address;
+        arg->split_conv_args[i].conv_arg[j].output.scale_address =
+            arg->split_conv_args[i].output.scale_address;
+      } else {
+        auto ptr_output = (half *)fpga_malloc(conv_output_size * sizeof(half));
+        arg->split_conv_args[i].conv_arg[j].output.address =
+            (void *)((half *)ptr_output);
+        auto ptr_output_scale = (float *)fpga_malloc(2 * sizeof(float));
+        arg->split_conv_args[i].conv_arg[j].output.scale_address =
+            ptr_output_scale;
+      }
+      arg->split_conv_args[i].concat_arg.images_in[j] =
+          (half *)arg->split_conv_args[i].conv_arg[j].output.address;  // NOLINT
+      arg->split_conv_args[i].concat_arg.scales_in[j] =
+          arg->split_conv_args[i].conv_arg[j].output.scale_address;
+      arg->split_conv_args[i].concat_arg.channel_num[j] =
+          arg->split_conv_args[i].conv_arg[j].filter_num;
+
+      expand_conv_arg(&(arg->split_conv_args[i].conv_arg[j]));
+    }
+
+    arg->split_conv_args[i].concat_arg.image_out =
+        arg->split_conv_args[i].output.address;
+    arg->split_conv_args[i].concat_arg.scale_out =
+        arg->split_conv_args[i].output.scale_address;
+  }
+  filter->reset_data_ptr(nullptr);
+  fpga_free(bs_ptr);
 }  // fill_deconv_arg

 }  // namespace fpga

--- a/src/fpga/V1/api.h
+++ b/src/fpga/V1/api.h
@@ -21,16 +21,19 @@ limitations under the License. */
 namespace paddle_mobile {
 namespace fpga {

-int get_align_image_cw(int cw);
 void format_image(framework::Tensor* image_tensor);
 void format_fp16_ofm(framework::Tensor* ofm_tensor);  // only allocate memory
 void format_fp32_ofm(framework::Tensor* ofm_tensor);

 float filter_find_max(framework::Tensor* filter_tensor);
 int get_filter_num_per_div(framework::Tensor* filter_tensor, int group_num);
+int get_deconv_filter_num_per_div(framework::Tensor* filter_tensor,
+                                  int group_num, int stride);
+
 int get_plit_num(framework::Tensor* filter_tensor);
+int get_deconv_plit_num(framework::Tensor* filter_tensor, int stride);
+
 int get_aligned_filter_element_num(int chw);
-int get_aligned_filter_num(int num);
 void format_filter(framework::Tensor* filter_tensor, float max_value,
                   int group_num);
 void format_fc_filter(framework::Tensor* filter_tensor, float max_value);

--- a/src/fpga/V1/deconv_filter.cpp
+++ b/src/fpga/V1/deconv_filter.cpp
@@ -40,10 +40,9 @@ inverse kernel weights of each channel for every filter
 void deconv_inverse_filter(float** data_in, int num, int channel, int width,
                           int height) {
  float* tmp = *data_in;
-  // float fix_range = 127;//  float scale = fix_range / max;
  int data_size = num * channel * width * height;
  int hw_len = height * width;
-  float* tmp_data = (float*)fpga_malloc(data_size * sizeof(float));
+  auto tmp_data = (float*)fpga_malloc(data_size * sizeof(float));
  for (int i = 0; i < num; ++i) {
    for (int j = 0; j < channel; ++j) {
      for (int k = 0; k < hw_len; ++k) {
@@ -52,7 +51,7 @@ void deconv_inverse_filter(float** data_in, int num, int channel, int width,
      }
    }
  }
-  *data_in = (float*)tmp_data;  //
+  *data_in = tmp_data;
  fpga_free(tmp);
 }

@@ -61,8 +60,7 @@ void deconv_inverse_filter(float** data_in, int num, int channel, int width,
 */
 int deconv_calc_sub_pad(int filter_axis, int pad, int stride) {
  if (stride == 0 || ((filter_axis - pad - 1) < 0)) {
-    // error
-    return 0;
+    PADDLE_MOBILE_ENFORCE(false, "Wrong deconv parameters");
  }
  return (filter_axis - pad - 1) / stride;
 }
@@ -79,11 +77,8 @@ int deconv_get_sub_out_axis(int image_axis, int sub_pad, int sub_filter_axis) {
   position. so the omit rows or columns is (stride - )
 */
 int deconv_get_omit(int stride, int filter_width, int pad) {
-  if (((filter_width - pad) <= 0)) {  // ((filter_width-pad) > stride) ||
-    // error
-    return 0;
-  }
-  int idx = 1;
+  PADDLE_MOBILE_ENFORCE(filter_width > pad, "Wrong deconv parameters");
+  int idx;
  bool flag = false;
  for (idx = 1; idx <= stride; ++idx) {
    int j = idx;
@@ -102,10 +97,6 @@ int deconv_get_omit(int stride, int filter_width, int pad) {
  return (stride - idx);
 }

-int deconv_get_sub_filter_num(int filter_num, int stride) {
-  return filter_num * stride;
-}
-
 void deconv_get_sub_filter(char** data_in, int height, int width,
                           int sub_conv_n, int kernel_num, int channel) {
  char* ptr_tmp = *data_in;
@@ -245,7 +236,6 @@ void deconv_format_filter(float** data_in, int num, int channel, int height,
  char* ptr_space = (char*)fpga_malloc(sub_conv_n * align_offset *
                                       sizeof(char));  // continuous space
  for (int i = 0; i < sub_conv_n; ++i) {
-    int offset = i * origin_offset;
    char* ptr_tmp = (ptr_ptr_data)[i];

    filter::align_element(&ptr_tmp, sub_num, sub_chw);

--- a/src/fpga/V1/deconv_filter.h
+++ b/src/fpga/V1/deconv_filter.h
@@ -21,7 +21,6 @@ namespace deconv_filter {
 void deconv_inverse_filter(float** data_in, int num, int channel, int width,
                           int height);
 int deconv_calc_sub_pad(int filter_axis, int pad, int stride);
-int deconv_get_sub_filter_num(int filter_num, int stride);
 int deconv_get_sub_filter_axis(int filter_axis, int stride);
 int deconv_get_sub_out_axis(int image_axis, int sub_pad, int sub_filter_axis);
 int deconv_get_omit(int stride, int filter_width, int pad);

--- a/src/fpga/V1/pe.cpp
+++ b/src/fpga/V1/pe.cpp
--- a/src/fpga/common/driver.cpp
+++ b/src/fpga/common/driver.cpp
@@ -153,10 +153,6 @@ int memory_request(struct fpga_memory *memory, size_t size, uint64_t *addr) {
  uint64_t _nr = DIV_ROUND_UP(size, FPGA_PAGE_SIZE);
  unsigned int nr = (unsigned int)_nr;
  int ret = 0;
-  DLOG << size;
-  DLOG << _nr;
-  DLOG << nr;
-
  uint64_t a_size = FPGA_PAGE_SIZE * nr;
  DLOG << a_size;

@@ -283,7 +279,7 @@ int fpga_memory_add() {
  return 0;
 }

-uint64_t vaddr_to_paddr(void *address) {
+uint64_t vaddr_to_paddr_driver(void *address) {
  uint64_t paddr = 0;
  auto iter = g_fpgainfo.fpga_vaddr2paddr_map.find(address);
  if (iter != g_fpgainfo.fpga_vaddr2paddr_map.end()) {
@@ -315,7 +311,7 @@ void *fpga_reg_free(void *ptr) {
    g_fpgainfo.fpga_addr2size_map.erase(iter);
    munmap(ptr, size);
  } else {
-    DLOG << "Invalid pointer";
+    DLOG << "Invalid pointer" << ptr;
  }
 }

@@ -347,7 +343,7 @@ void fpga_free_driver(void *ptr) {
    g_fpgainfo.fpga_addr2size_map.erase(iter);
    munmap(ptr, size);

-    p_addr = vaddr_to_paddr(ptr);
+    p_addr = vaddr_to_paddr_driver(ptr);
    pos = (p_addr - g_fpgainfo.memory_info->mem_start) / FPGA_PAGE_SIZE;

    /*clear bitmap*/
@@ -361,7 +357,7 @@ void fpga_free_driver(void *ptr) {
      g_fpgainfo.fpga_vaddr2paddr_map.erase(iter);
    }
  } else {
-    DLOG << "Invalid pointer";
+    DLOG << "Invalid pointer" << ptr;
  }
 }

@@ -373,7 +369,7 @@ int fpga_flush_driver(void *address, size_t size) {
  struct MemoryCacheArgs args;
  uint64_t p_addr;

-  p_addr = vaddr_to_paddr(address);
+  p_addr = vaddr_to_paddr_driver(address);

  args.offset = (void *)(p_addr - FPGA_MEM_PHY_ADDR);  // NOLINT
  args.size = size;
@@ -385,7 +381,7 @@ int fpga_invalidate_driver(void *address, size_t size) {
  struct MemoryCacheArgs args;
  uint64_t p_addr;

-  p_addr = vaddr_to_paddr(address);
+  p_addr = vaddr_to_paddr_driver(address);

  args.offset = (void *)(p_addr - FPGA_MEM_PHY_ADDR);  // NOLINT
  args.size = size;

--- a/src/fpga/common/driver.h
+++ b/src/fpga/common/driver.h
@@ -31,8 +31,8 @@ namespace driver {

 #define FPGA_REG_PHY_ADDR 0xa0000000
 #define FPGA_REG_SIZE 0x1000
-#define FPGA_MEM_PHY_ADDR 0x20000000
-#define FPGA_MEM_SIZE 0x20000000
+#define FPGA_MEM_PHY_ADDR 0x40000000
+#define FPGA_MEM_SIZE 0x80000000

 #define FPGA_PAGE_SIZE (16UL * 1024UL)

@@ -122,15 +122,11 @@ void *fpga_malloc_driver(size_t size);

 void fpga_free_driver(void *ptr);

-void fpga_copy_driver(void *dest, const void *src, size_t num);
-
 int fpga_flush_driver(void *address, size_t size);

 int fpga_invalidate_driver(void *address, size_t size);

-/*pe*/
-
-uint64_t vaddr_to_paddr(void *address);
+uint64_t vaddr_to_paddr_driver(void *address);

 int fpga_regpoll(uint64_t reg, uint64_t val, int time);


--- a/src/fpga/common/fpga_common.cpp
+++ b/src/fpga/common/fpga_common.cpp
@@ -59,6 +59,9 @@ int close_device() {

 void *fpga_malloc(size_t size) {
  static uint64_t counter = 0;
+  if (size <= 0) {
+    size = 1;
+  }
 #ifdef PADDLE_MOBILE_ZU5
  auto ptr = driver::fpga_malloc_driver(size);
 #else
@@ -115,7 +118,7 @@ int fpga_invalidate(void *address, size_t size) {
 }
 uint64_t vaddr_to_paddr(void *address) {
 #ifdef PADDLE_MOBILE_ZU5
-  return driver::vaddr_to_paddr(address);
+  return driver::vaddr_to_paddr_driver(address);
 #else
  return 0;
 #endif

--- a/src/fpga/common/fpga_common.h
+++ b/src/fpga/common/fpga_common.h
@@ -37,6 +37,18 @@ enum LayoutType {
  LAYOUT_HWC = 0,
 };

+enum ActivationType {
+  NONE = 0,
+  LEAKYRELU = 1,
+  SIGMOID = 2,
+  TANH = 3,
+};
+
+struct ActivationArgs {
+  enum ActivationType activation_type;
+  int16_t leaky_relu_negative_slope;
+};
+
 struct KernelArgs {
  uint32_t width;
  uint32_t height;
@@ -58,7 +70,10 @@ struct ImageOutputArgs {
  void* address;         // output result address;
  float* scale_address;  // output scale address;
  uint64_t timer_cnt;    // time counter for FPGA computation
+  struct ActivationArgs
+      activation;  // To select activation and specify (Leaky)Relu parameter.
 };
+
 #ifdef PADDLE_MOBILE_FPGA_V1
 struct ConvDriverParam {
  uint64_t image_address_phy;
@@ -195,10 +210,14 @@ struct DeconvArgs {
  uint32_t sub_output_width;
  uint32_t sub_output_height;
  struct ImageOutputArgs output;
-  struct ConvArgs* conv_args;
+  struct SplitConvArgs* split_conv_args;
 };

-static inline int align_to_x(int num, int x) { return (num + x - 1) / x * x; }
+// static inline int align_to_x(int num, int x) { return (num + x - 1) / x * x;
+// }
+static inline uint32_t align_to_x(int64_t num, int64_t x) {
+  return ((uint32_t)(num + x) - 1) / (uint32_t)x * (uint32_t)x;
+}

 int16_t fp32_2_fp16(float fp32_num);
 float fp16_2_fp32(int16_t fp16_num);

--- a/src/framework/data_type.h
+++ b/src/framework/data_type.h
@@ -28,6 +28,10 @@ extern _PaddleMobile__Framework__Proto__VarType__Type ToDataType(
 extern std::type_index ToTypeIndex(
    _PaddleMobile__Framework__Proto__VarType__Type type);

+inline _PaddleMobile__Framework__Proto__VarType__Type ToDataType(int type) {
+  return static_cast<_PaddleMobile__Framework__Proto__VarType__Type>(type);
+}
+
 template <typename Visitor>
 inline void VisitDataType(_PaddleMobile__Framework__Proto__VarType__Type type,
                          Visitor visitor) {

--- a/src/framework/executor.cpp
+++ b/src/framework/executor.cpp
--- a/src/framework/executor.h
+++ b/src/framework/executor.h
@@ -17,6 +17,7 @@ limitations under the License. */
 #include <map>
 #include <memory>
 #include <string>
+#include <utility>
 #include <vector>
 #include "common/types.h"
 #include "common/util.h"
@@ -28,41 +29,29 @@ limitations under the License. */
 namespace paddle_mobile {
 namespace framework {

-template <typename Dtype = CPU, Precision P = Precision::FP32>
+template <typename Device, typename T = float>
 class Executor {
 public:
-  typedef typename PrecisionTrait<P>::ptype Ptype;
-  // exector constructor
-  // @param program program converted from proto program in PaddlePaddle
-  // @param use_optimize bool whether use operator fusion to speed up or not
-  // @param loddable bool
-  Executor(const framework::Program<Dtype> program, int batch_size = 1,
-           const bool use_optimize = true, const bool loddable = false);
-
-  // predict with tensor input
-  // @param t input tensor to do prediction
-  // @return predicted tensor
-  std::shared_ptr<framework::Tensor> Predict(const framework::Tensor &t);
-
-  // predict with lod tensor input
-  // @param t input lod tensor to do prediction
-  // @return predicted lod tensor
-  std::shared_ptr<framework::LoDTensor> PredictLod(
-      const framework::LoDTensor &t);
-
-  // predict with vector input and dims
-  // @param input vector whose elements will be formed
-  // @param       input lod tensor to do prediction
-  // @param dims  vector whose elements will be formed
-  // @param       input tensor shape
-  // @return vector which is flatted from predicted tensor
-  std::vector<Ptype> Predict(const std::vector<Ptype> &input,
-                             const std::vector<int64_t> &dims);
+  Executor(const Program<Device> &program, int batch_size = 1,
+           const bool use_optimize = true, const bool lod_mode = false);
+
+  PMStatus Predict(const std::vector<std::pair<std::string, Tensor>> &inputs);
+  PMStatus Predict(
+      const std::vector<std::pair<std::string, LoDTensor>> &inputs);
+
+  std::vector<T> Predict(const std::vector<T> &input,
+                         const std::vector<int64_t> &dims);
+  PMStatus Predict();
+
+  void SetInput(const Tensor &input, const std::string &var_name);
+  void SetInput(const LoDTensor &input, const std::string &var_name);
+
+  std::shared_ptr<LoDTensor> GetOutput(const std::string &var_name);

 #ifdef PADDLE_MOBILE_FPGA
-  void InjectVariable(const framework::Tensor &t, std::string var_name);
-  void FeedData(const framework::Tensor &t);
-  std::shared_ptr<framework::Tensor> FetchResult(int id = -1);
+  void InjectVariable(const Tensor &t, std::string var_name);
+  void FeedData(const Tensor &t);
+  std::shared_ptr<Tensor> FetchResult(int id = -1);
  void Predict_From_To(int start = 0, int end = -1);
  void Predict_From(int start);
  void Predict_To(int end);
@@ -70,26 +59,28 @@ class Executor {

 protected:
  Executor() = default;
-  std::shared_ptr<framework::Tensor> Predict(const framework::Tensor &t,
-                                             int block_id);
-  bool varInputMemory(const std::shared_ptr<framework::VarDesc> &var_desc,
-                      framework::Variable *var,
-                      framework::LoDTensor *tensor) const;
+
+  bool varInputMemory(const std::shared_ptr<VarDesc> &var_desc, Variable *var,
+                      LoDTensor *tensor) const;
  void InitMemory();
  void InitCombineMemory();
-  void LoadMemory(void **data,
-                  const std::shared_ptr<framework::VarDesc> var_desc,
-                  framework::LoDTensor *tensor);
+  void LoadMemory(void **data, const std::shared_ptr<VarDesc> var_desc,
+                  LoDTensor *tensor);
 #ifdef PADDLE_MOBILE_CL
-  void LoadMemory(const framework::VarDesc var_desc, float *tensorInput,
-                  char **data);
+  void LoadMemory(const VarDesc var_desc, float *tensorInput, char **data);
 #endif
-  framework::Program<Dtype> program_;
-  int batch_size_ = 1;
-  std::shared_ptr<framework::ProgramDesc> to_predict_program_;
-  std::map<framework::BlockDesc,
-           std::vector<std::shared_ptr<framework::OperatorBase<Dtype>>>>
-      ops_of_block_;
+
+  int batch_size_;
+  bool use_optimize_;
+  bool lod_mode_;
+  Program<Device> program_;
+  std::shared_ptr<ProgramDesc> program_desc_;
+
+  typedef std::shared_ptr<OperatorBase<Device>> OperatorBasePtr;
+  std::vector<std::vector<OperatorBasePtr>> ops_of_block_;
+  // operators list
+  std::vector<OperatorBasePtr> ops_list_;
+
 #ifdef PADDLE_MOBILE_PROFILE
  struct ProfInfo {
    int tid = 0;
@@ -97,8 +88,6 @@ class Executor {
    uint64_t runEnd = 0UL;
  };
 #endif
-  bool use_optimize_ = false;
-  bool loddable_ = false;
 };

 }  // namespace framework

--- a/src/framework/load_ops.h
+++ b/src/framework/load_ops.h
@@ -191,6 +191,7 @@ LOAD_OP2(mul, CPU, MALI_GPU);
 #endif
 #ifdef RELU_OP
 LOAD_OP2(relu, CPU, MALI_GPU);
+LOAD_OP1(relu6, CPU);
 #endif
 #ifdef IM2SEQUENCE_OP
 LOAD_OP1(im2sequence, CPU);
@@ -227,12 +228,22 @@ LOAD_FUSION_MATCHER(fusion_conv_bn);
 #ifdef ELEMENTWISESUB_OP
 LOAD_OP1(elementwise_sub, CPU)
 #endif
+#ifdef TOP_K_OP
+LOAD_OP1(top_k, CPU)
+#endif
+#ifdef CAST_OP
+LOAD_OP1(cast, CPU)
+#endif
 #ifdef QUANT_OP
 LOAD_OP1(quantize, CPU);
 #endif
 #ifdef DEQUANT_OP
 LOAD_OP1(dequantize, CPU);
 #endif
+#ifdef FUSION_DEQUANT_BN_OP
+LOAD_OP1(fusion_dequant_bn, CPU);
+LOAD_FUSION_MATCHER(fusion_dequant_bn);
+#endif
 #ifdef FUSION_DEQUANT_ADD_BN_OP
 LOAD_OP1(fusion_dequant_add_bn, CPU);
 LOAD_FUSION_MATCHER(fusion_dequant_add_bn);
@@ -245,3 +256,11 @@ LOAD_FUSION_MATCHER(fusion_dequant_bn_relu);
 LOAD_OP1(fusion_dequant_add_bn_relu, CPU);
 LOAD_FUSION_MATCHER(fusion_dequant_add_bn_relu);
 #endif
+#ifdef FUSION_DEQUANT_ADD_BN_QUANT_OP
+LOAD_OP1(fusion_dequant_add_bn_quant, CPU);
+LOAD_FUSION_MATCHER(fusion_dequant_add_bn_quant);
+#endif
+#ifdef FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
+LOAD_OP1(fusion_dequant_add_bn_relu_quant, CPU);
+LOAD_FUSION_MATCHER(fusion_dequant_add_bn_relu_quant);
+#endif
--- a/src/framework/loader.cpp
+++ b/src/framework/loader.cpp
@@ -23,14 +23,8 @@ limitations under the License. */
 namespace paddle_mobile {
 namespace framework {

-/**
- * muteandresize tensor as originProgramDesc and scope in loadParams
- *
- * @param originProgramDesc
- * @param scope
- */
-template <typename Dtype, Precision P>
-void Loader<Dtype, P>::InitMemoryFromProgram(
+template <typename Device, typename T>
+void Loader<Device, T>::InitMemoryFromProgram(
    const std::shared_ptr<ProgramDesc> &originProgramDesc,
    const std::shared_ptr<Scope> &scope) {
  for (const auto &block : originProgramDesc.get()->Blocks()) {
@@ -43,8 +37,6 @@ void Loader<Dtype, P>::InitMemoryFromProgram(
          tensor->Resize(make_ddim(dim));
        } else {
          auto dim = var_desc->Tensor_desc().Dims();
-          //          PADDLE_MOBILE_ENFORCE(dim.size() > 0, "dim size is 0");
-          //          dim[0] = 1;
          if (dim.size() == 0) {
            auto tensor = var->GetMutable<LoDTensor>();
            framework::DDim dDim = {0};
@@ -60,7 +52,7 @@ void Loader<Dtype, P>::InitMemoryFromProgram(
          }
        }
      } else {
-        // TODO(codeWorm): some.
+        // TODO(codeWorm)
      }
    }
  }
@@ -68,7 +60,7 @@ void Loader<Dtype, P>::InitMemoryFromProgram(

 #ifdef PADDLE_MOBILE_CL
 template <>
-void Loader<GPU_CL, Precision::FP32>::InitMemoryFromProgram(
+void Loader<GPU_CL, float>::InitMemoryFromProgram(
    const std::shared_ptr<ProgramDesc> &originProgramDesc,
    const std::shared_ptr<Scope> &scope) {
  for (const auto &block : originProgramDesc.get()->Blocks()) {
@@ -77,7 +69,6 @@ void Loader<GPU_CL, Precision::FP32>::InitMemoryFromProgram(
      if (var_desc->Type() == VARTYPE_TYPE_LOD_TENSOR) {
        if (var_desc->Persistable()) {
          auto dim = var_desc->Tensor_desc().Dims();
-          //              auto tensor = var->GetMutable<LoDTensor>();
          auto cl_image = var->GetMutable<framework::CLImage>();
          cl_image->Resize(make_ddim(dim));
        } else {
@@ -88,14 +79,13 @@ void Loader<GPU_CL, Precision::FP32>::InitMemoryFromProgram(
          cl_image->Resize(make_ddim(dim));
        }
      } else {
-        // TODO(codeWorm): some.
+        // TODO(codeWorm)
      }
    }
  }
 }
 template <>
-const Program<GPU_CL, Precision::FP32>
-Loader<GPU_CL, Precision::FP32>::LoadCombinedMemory(
+const Program<GPU_CL, float> Loader<GPU_CL, float>::LoadCombinedMemory(
    size_t read_size, const uint8_t *buf, size_t combined_params_len,
    uint8_t *combined_params_buf, bool optimize, bool quantification) {
  bool can_add_split = false;
@@ -113,7 +103,7 @@ Loader<GPU_CL, Precision::FP32>::LoadCombinedMemory(

  auto originProgramDesc = std::make_shared<ProgramDesc>(c_program);

-  Program<GPU_CL, Precision::FP32> program;
+  Program<GPU_CL, float> program;
  program.combined = true;
  program.originProgram = originProgramDesc;
  program.quantification = quantification;
@@ -145,16 +135,16 @@ Loader<GPU_CL, Precision::FP32>::LoadCombinedMemory(

 /**
 * fusion and print someinfos
- * @tparam Dtype
+ * @tparam Device
 * @tparam P
 * @param optimize
 * @param can_add_split
 * @param program
 * @param originProgramDesc
 */
-template <typename Dtype, Precision P>
+template <typename Device, typename T>
 void FusionAndPrintInfos(
-    bool optimize, bool can_add_split, Program<Dtype, P> *program,
+    bool optimize, bool can_add_split, Program<Device, T> *program,
    const std::shared_ptr<ProgramDesc> &originProgramDesc) {
  if (optimize) {
    ProgramOptimize program_optimize;
@@ -193,22 +183,22 @@ static size_t ReadBuffer(const char *file_name, uint8_t **out) {
  return cur_len;
 }

-template <typename Dtype, Precision P>
-const Program<Dtype, P> Loader<Dtype, P>::Load(const std::string &dirname,
-                                               bool optimize,
-                                               bool quantification,
-                                               bool can_add_split) {
+template <typename Device, typename T>
+const Program<Device, T> Loader<Device, T>::Load(const std::string &dirname,
+                                                 bool optimize,
+                                                 bool quantification,
+                                                 bool can_add_split) {
  auto program = this->LoadProgram(dirname + "/__model__", optimize,
                                   quantification, can_add_split);
  program.model_path = dirname;
  return program;
 }

-template <typename Dtype, Precision P>
-const Program<Dtype, P> Loader<Dtype, P>::Load(const std::string &model_path,
-                                               const std::string &para_path,
-                                               bool optimize,
-                                               bool quantification) {
+template <typename Device, typename T>
+const Program<Device, T> Loader<Device, T>::Load(const std::string &model_path,
+                                                 const std::string &para_path,
+                                                 bool optimize,
+                                                 bool quantification) {
  auto program = this->LoadProgram(model_path, optimize, quantification);

  program.para_path = para_path;
@@ -217,8 +207,8 @@ const Program<Dtype, P> Loader<Dtype, P>::Load(const std::string &model_path,
  return program;
 }

-template <typename Dtype, Precision P>
-const Program<Dtype, P> Loader<Dtype, P>::LoadProgram(
+template <typename Device, typename T>
+const Program<Device, T> Loader<Device, T>::LoadProgram(
    const std::string &model_path, bool optimize, bool quantification,
    bool can_add_split) {
  std::string model_filename = model_path;
@@ -237,7 +227,7 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadProgram(
  //
  auto originProgramDesc = std::make_shared<ProgramDesc>(c_program);

-  Program<Dtype, P> program;
+  Program<Device, T> program;
  program.originProgram = originProgramDesc;
  program.quantification = quantification;
  program.combined_params_len = 0;
@@ -254,8 +244,8 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadProgram(
  return program;
 }

-template <typename Dtype, Precision P>
-const Program<Dtype, P> Loader<Dtype, P>::LoadCombinedMemory(
+template <typename Device, typename T>
+const Program<Device, T> Loader<Device, T>::LoadCombinedMemory(
    size_t read_size, const uint8_t *buf, size_t combined_params_len,
    uint8_t *combined_params_buf, bool optimize, bool quantification) {
  bool can_add_split = false;
@@ -273,7 +263,7 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadCombinedMemory(

  auto originProgramDesc = std::make_shared<ProgramDesc>(c_program);

-  Program<Dtype, P> program;
+  Program<Device, T> program;
  program.combined = true;
  program.originProgram = originProgramDesc;
  program.quantification = quantification;
@@ -289,13 +279,13 @@ const Program<Dtype, P> Loader<Dtype, P>::LoadCombinedMemory(
  return program;
 }

-template class Loader<CPU, Precision::FP32>;
+template class Loader<CPU, float>;

-template class Loader<FPGA, Precision::FP32>;
+template class Loader<FPGA, float>;

-template class Loader<GPU_MALI, Precision::FP32>;
+template class Loader<GPU_MALI, float>;

-template class Loader<GPU_CL, Precision::FP32>;
+template class Loader<GPU_CL, float>;

 }  // namespace framework
 }  // namespace paddle_mobile
--- a/src/framework/loader.h
+++ b/src/framework/loader.h
@@ -22,39 +22,39 @@ limitations under the License. */
 namespace paddle_mobile {
 namespace framework {

-template <typename Dtype = CPU, Precision P = Precision::FP32>
+template <typename Device = CPU, typename T = float>
 class Loader {
 public:
  /*
   * @b load separate format fluid model
-   * @b 加载分开形式的 fluid 模型
+   * @b 加载分开存储的fluid模型
   * */
-  const Program<Dtype, P> Load(const std::string &dirname,
-                               bool optimize = false,
-                               bool quantification = false,
-                               bool can_add_split = false);
+  const Program<Device, T> Load(const std::string &dirname,
+                                bool optimize = false,
+                                bool quantification = false,
+                                bool can_add_split = false);

  /*
   * @b load combine format fluid mode
-   * @b 加载结合在一起格式的模型
+   * @b 加载统一存储的fluid模型
   * */
-  const Program<Dtype, P> Load(const std::string &model_path,
-                               const std::string &para_path,
-                               bool optimize = false,
-                               bool quantification = false);
+  const Program<Device, T> Load(const std::string &model_path,
+                                const std::string &para_path,
+                                bool optimize = false,
+                                bool quantification = false);

-  const Program<Dtype, P> LoadCombinedMemory(size_t model_len,
-                                             const uint8_t *model_buf,
-                                             size_t combined_params_len,
-                                             uint8_t *combined_params_buf,
-                                             bool optimize = false,
-                                             bool quantification = false);
+  const Program<Device, T> LoadCombinedMemory(size_t model_len,
+                                              const uint8_t *model_buf,
+                                              size_t combined_params_len,
+                                              uint8_t *combined_params_buf,
+                                              bool optimize = false,
+                                              bool quantification = false);

 private:
-  const Program<Dtype, P> LoadProgram(const std::string &model_path,
-                                      bool optimize = false,
-                                      bool quantification = false,
-                                      bool can_add_split = false);
+  const Program<Device, T> LoadProgram(const std::string &model_path,
+                                       bool optimize = false,
+                                       bool quantification = false,
+                                       bool can_add_split = false);

  void InitMemoryFromProgram(
      const std::shared_ptr<ProgramDesc> &originProgramDesc,

--- a/src/framework/lod_tensor.h
+++ b/src/framework/lod_tensor.h
@@ -16,12 +16,12 @@ limitations under the License. */

 #include <memory>
 #include <string>
+#include <utility>
 #include <vector>
-#include "tensor.h"
-#include "tensor_util.h"
+#include "framework/tensor.h"
+#include "framework/tensor_util.h"

 namespace paddle_mobile {
-
 namespace framework {

 /*
@@ -202,5 +202,29 @@ void SerializeToStream(std::ostream &os, const LoDTensor &tensor);

 void DeserializeFromStream(std::istream &is, LoDTensor *tensor);

+#ifdef PADDLE_MOBILE_DEBUG
+inline Print &operator<<(Print &printer, const LoDTensor &tensor) {
+  printer << " dims: " << tensor.dims() << "\n";
+  int stride = tensor.numel() / 20;
+  stride = stride > 0 ? stride : 1;
+#ifndef PADDLE_MOBILE_FPGA
+  for (int i = 0; i < tensor.numel(); i += stride) {
+    if (tensor.type() == typeid(float)) {
+      printer << tensor.data<float>()[i] << " ";
+    } else if (tensor.type() == typeid(int32_t)) {
+      printer << tensor.data<int32_t>()[i] << " ";
+    } else if (tensor.type() == typeid(int64_t)) {
+      printer << tensor.data<int64_t>()[i] << " ";
+    } else if (tensor.type() == typeid(int8_t)) {
+      printer << static_cast<int>(tensor.data<int8_t>()[i]) << " ";
+    } else if (tensor.type() == typeid(int32_t)) {
+      printer << tensor.data<int32_t>()[i] << " ";
+    }
+  }
+#endif  // PADDLE_MOBILE_FPGA
+  return printer;
+}
+#endif  // PADDLE_MOBILE_DEBUG
+
 }  // namespace framework
 }  // namespace paddle_mobile
--- a/src/framework/op_registry.h
+++ b/src/framework/op_registry.h
@@ -98,24 +98,6 @@ class OpRegistry {
  }
 };

-#define REGISTER_OPERATOR_INT8(op_type, op_class, device_name, device_type) \
-  template class op_class<device_type, int8_t>;                             \
-  template <typename Dtype, typename T>                                     \
-  class _OpClass_##op_type##_##device_name : public op_class<Dtype, T> {    \
-   public:                                                                  \
-    DEFINE_OP_CONSTRUCTOR(_OpClass_##op_type##_##device_name, op_class);    \
-  };                                                                        \
-  static paddle_mobile::framework::OperatorRegistrar<                       \
-      device_type, _OpClass_##op_type##_##device_name<device_type, int8_t>> \
-      __op_registrar_##op_type##_##device_name(#op_type);                   \
-  int TouchOpRegistrar_##op_type##_##device_name() {                        \
-    __op_registrar_##op_type##_##device_name.Touch();                       \
-    return 0;                                                               \
-  }
-
-#define REGISTER_OPERATOR_CPU_INT8(op_type, op_class) \
-  REGISTER_OPERATOR_INT8(op_type, op_class, cpu, paddle_mobile::CPU);
-
 #define REGISTER_OPERATOR(op_type, op_class, device_name, device_type)     \
  template class op_class<device_type, float>;                             \
  template <typename Dtype, typename T>                                    \

--- a/src/framework/program/program-optimize/node.cpp
+++ b/src/framework/program/program-optimize/node.cpp
@@ -220,7 +220,16 @@ void Node::Folder(
    }
  } else {
    for (auto &op_output : this->op_desc_->outputs_) {
-      op_desc->outputs_.emplace(op_output.first, op_output.second);
+      auto output_key = op_output.first;
+      if (change->find(this->type_) != change->end()) {
+        const auto change_pairs = (*change)[this->type_];
+        for (const auto &target : change_pairs) {
+          if (target.first == output_key) {
+            output_key = target.second;
+          }
+        }
+      }
+      op_desc->outputs_.emplace(output_key, op_output.second);
    }

    for (auto &output : this->outputs_) {

--- a/src/framework/program/program.h
+++ b/src/framework/program/program.h
@@ -14,16 +14,15 @@ limitations under the License. */

 #pragma once

+#include <string>
 #include "common/types.h"
 #include "framework/program/program_desc.h"
 #include "framework/scope.h"

-#include <string>
-
 namespace paddle_mobile {
 namespace framework {

-template <typename Dtype, Precision P = Precision::FP32>
+template <typename Device, typename T = float>
 class Program {
 public:
  std::shared_ptr<ProgramDesc> originProgram;

--- a/src/framework/scope.h
+++ b/src/framework/scope.h
@@ -26,6 +26,7 @@ limitations under the License. */

 namespace paddle_mobile {
 namespace framework {
+
 class Scope {
 public:
  Scope() = default;

--- a/src/framework/tensor.h
+++ b/src/framework/tensor.h
@@ -148,8 +148,8 @@ class Tensor : public TensorBase {
    PADDLE_MOBILE_ENFORCE(
        (std::is_same<T, void>::value ||
         holder_->type().hash_code() == typeid(T).hash_code()),
-        "Tensor holds the wrong type, it holds %s",
-        this->holder_->type().name());
+        "Tensor holds the wrong type, it holds %s, requested %s",
+        this->holder_->type().name(), typeid(T).name());

    return reinterpret_cast<T *>(reinterpret_cast<uintptr_t>(holder_->ptr()) +
                                 offset_);
@@ -162,7 +162,7 @@ class Tensor : public TensorBase {
    PADDLE_MOBILE_ENFORCE(
        (std::is_same<T, void>::value ||
         holder_->type().hash_code() == typeid(T).hash_code()),
-        "Tensor holds the wrong type, it holds %s ,requested:%s",
+        "Tensor holds the wrong type, it holds %s, requested %s",
        this->holder_->type().name(), typeid(T).name());

    return reinterpret_cast<const T *>(
@@ -226,7 +226,6 @@ inline Print &operator<<(Print &printer, const Tensor &tensor) {
    }
  }
 #endif
-
  return printer;
 }


--- a/src/io/api_paddle_mobile.cc
+++ b/src/io/api_paddle_mobile.cc
@@ -18,17 +18,17 @@

 namespace paddle_mobile {

-template <typename Dtype, Precision P>
-PaddleMobilePredictor<Dtype, P>::PaddleMobilePredictor(
+template <typename Device, typename T>
+PaddleMobilePredictor<Device, T>::PaddleMobilePredictor(
    const PaddleMobileConfig &config) {
  PADDLE_MOBILE_ENFORCE(Init(config) == true,
                        "paddle mobile predictor init failed!");
  config_ = config;
 }

-template <typename Dtype, Precision P>
-bool PaddleMobilePredictor<Dtype, P>::Init(const PaddleMobileConfig &config) {
-  paddle_mobile_.reset(new PaddleMobile<Dtype, P>());
+template <typename Device, typename T>
+bool PaddleMobilePredictor<Device, T>::Init(const PaddleMobileConfig &config) {
+  paddle_mobile_.reset(new PaddleMobile<Device, T>());
 #ifdef PADDLE_MOBILE_CL
  paddle_mobile_->SetCLPath(config.cl_path);
 #endif
@@ -52,8 +52,8 @@ bool PaddleMobilePredictor<Dtype, P>::Init(const PaddleMobileConfig &config) {
  paddle_mobile_->SetThreadNum(config.thread_num);
  return true;
 }
-template <typename Dtype, Precision P>
-bool PaddleMobilePredictor<Dtype, P>::Run(
+template <typename Device, typename T>
+bool PaddleMobilePredictor<Device, T>::Run(
    const std::vector<PaddleTensor> &inputs,
    std::vector<PaddleTensor> *output_data, int batch_size) {
  if (inputs.empty()) {
@@ -78,12 +78,12 @@ bool PaddleMobilePredictor<Dtype, P>::Run(
  framework::Tensor input_tensor;
  input_tensor.Resize(ddim);
  int input_length = framework::product(ddim);
-  typedef typename PrecisionTrait<P>::ptype PType;
-  auto input_ptr = input_tensor.mutable_data<PType>();
+  auto input_ptr = input_tensor.mutable_data<T>();

-  memcpy(input_ptr, static_cast<PType *>(input.data.data()),
-         input_length * sizeof(PType));
-  auto output_tensor = paddle_mobile_->Predict(input_tensor);
+  memcpy(input_ptr, static_cast<T *>(input.data.data()),
+         input_length * sizeof(T));
+  paddle_mobile_->Predict(input_tensor);
+  auto output_tensor = paddle_mobile_->Fetch();

  if (output_data->empty()) {
    LOG(kLOG_ERROR) << "At least one output should be set with tensors' names.";
@@ -99,18 +99,18 @@ bool PaddleMobilePredictor<Dtype, P>::Run(
    output.shape.push_back(static_cast<int>(d));
  }

-  if (output.data.length() < output_length * sizeof(PType)) {
-    output.data.Resize(output_length * sizeof(PType));
+  if (output.data.length() < output_length * sizeof(T)) {
+    output.data.Resize(output_length * sizeof(T));
  }

-  memcpy(output.data.data(), output_tensor->template data<PType>(),
-         output_length * sizeof(PType));
+  memcpy(output.data.data(), output_tensor->template data<T>(),
+         output_length * sizeof(T));

  return true;
 }

-template <typename Dtype, Precision P>
-PaddleMobilePredictor<Dtype, P>::~PaddleMobilePredictor() {
+template <typename Device, typename T>
+PaddleMobilePredictor<Device, T>::~PaddleMobilePredictor() {
  paddle_mobile_->Clear();
 }

@@ -122,13 +122,13 @@ CreatePaddlePredictor<PaddleMobileConfig, PaddleEngineKind::kPaddleMobile>(
  std::unique_ptr<PaddlePredictor> x;
  if (config.precision == PaddleMobileConfig::FP32) {
    if (config.device == PaddleMobileConfig::kCPU) {
-      x.reset(new PaddleMobilePredictor<CPU, Precision::FP32>(config));
+      x.reset(new PaddleMobilePredictor<CPU, float>(config));
    } else if (config.device == PaddleMobileConfig::kFPGA) {
-      x.reset(new PaddleMobilePredictor<FPGA, Precision::FP32>(config));
+      x.reset(new PaddleMobilePredictor<FPGA, float>(config));
    } else if (config.device == PaddleMobileConfig::kGPU_MALI) {
-      x.reset(new PaddleMobilePredictor<GPU_MALI, Precision::FP32>(config));
+      x.reset(new PaddleMobilePredictor<GPU_MALI, float>(config));
    } else if (config.device == PaddleMobileConfig::kGPU_CL) {
-      x.reset(new PaddleMobilePredictor<GPU_CL, Precision::FP32>(config));
+      x.reset(new PaddleMobilePredictor<GPU_CL, float>(config));
    } else {
      LOG(kLOG_ERROR) << "unsupport device type!";
      return nullptr;

--- a/src/io/api_paddle_mobile.h
+++ b/src/io/api_paddle_mobile.h
@@ -29,7 +29,7 @@ limitations under the License. */

 namespace paddle_mobile {

-template <typename Dtype = CPU, Precision P = Precision::FP32>
+template <typename Device = CPU, typename T = float>
 class PaddleMobilePredictor : public PaddlePredictor {
 public:
  PaddleMobilePredictor() = delete;
@@ -43,7 +43,7 @@ class PaddleMobilePredictor : public PaddlePredictor {
  ~PaddleMobilePredictor() override;

 private:
-  std::unique_ptr<PaddleMobile<Dtype, P>> paddle_mobile_;
+  std::unique_ptr<PaddleMobile<Device, T>> paddle_mobile_;
  bool Init(const PaddleMobileConfig& config);

  PaddleMobileConfig config_;

--- a/src/io/ios_io/PaddleMobileCPU.mm
+++ b/src/io/ios_io/PaddleMobileCPU.mm
@@ -59,7 +59,7 @@

 @interface  PaddleMobileCPU()
 {
-  paddle_mobile::PaddleMobile<paddle_mobile::CPU, paddle_mobile::Precision::FP32> *pam_;
+  paddle_mobile::PaddleMobile<paddle_mobile::CPU, float> *pam_;
  BOOL loaded_;
 }

@@ -73,7 +73,7 @@ static std::mutex shared_mutex;

 - (instancetype)initWithConfig:(PaddleMobileCPUConfig *)config {
  if (self = [super init]) {
-    pam_ = new paddle_mobile::PaddleMobile<paddle_mobile::CPU, paddle_mobile::Precision::FP32>();
+    pam_ = new paddle_mobile::PaddleMobile<paddle_mobile::CPU, float>();
    _config = config;
  }
  return self;
@@ -82,6 +82,7 @@ static std::mutex shared_mutex;
 -(instancetype)init {
  if (self = [super init]) {
    _config = [[PaddleMobileCPUConfig alloc] init];
+    pam_ = new paddle_mobile::PaddleMobile<paddle_mobile::CPU, float>();
  }
  return self;
 }
@@ -246,7 +247,8 @@ static std::mutex shared_mutex;
  memcpy(input_ptr, input,
         numel * sizeof(float));

-  std::shared_ptr<paddle_mobile::framework::Tensor> output = pam_->Predict(input_tensor);
+  pam_->Predict(input_tensor);
+  std::shared_ptr<paddle_mobile::framework::Tensor> output = pam_->Fetch();

  float *output_pointer = new float[output->numel()];


--- a/src/io/jni/paddle_mobile_jni.cpp
+++ b/src/io/jni/paddle_mobile_jni.cpp
@@ -16,21 +16,23 @@ limitations under the License. */

 #include "paddle_mobile_jni.h"
 #include <cmath>
+#include <string>
+#include <vector>
 #include "common/log.h"
 #include "framework/tensor.h"
 #include "io/paddle_mobile.h"

 #ifdef ENABLE_EXCEPTION
-
 #include "common/enforce.h"
-
 #endif

 #ifdef __cplusplus
 extern "C" {
 #endif
+
 namespace paddle_mobile {
 namespace jni {
+
 using framework::DDim;
 using framework::Program;
 using framework::Tensor;
@@ -200,7 +202,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictImage(
    for (int i = 0; i < length; i++) {
      input_ptr[i] = dataPointer[i];
    }
-    auto output = getPaddleMobileInstance()->Predict(input);
+    getPaddleMobileInstance()->Predict(input);
+    auto output = getPaddleMobileInstance()->Fetch();
    count = output->numel();
    result = env->NewFloatArray(count);
    env->SetFloatArrayRegion(result, 0, count, output->data<float>());
@@ -233,7 +236,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictImage(
  for (int i = 0; i < length; i++) {
    input_ptr[i] = dataPointer[i];
  }
-  auto output = getPaddleMobileInstance()->Predict(input);
+  getPaddleMobileInstance()->Predict(input);
+  auto output = getPaddleMobileInstance()->Fetch();
  count = output->numel();
  result = env->NewFloatArray(count);
  env->SetFloatArrayRegion(result, 0, count, output->data<float>());
@@ -328,7 +332,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictYuv(
    for (int i = 0; i < length; i++) {
      input_ptr[i] = matrix[i];
    }
-    auto output = getPaddleMobileInstance()->Predict(input);
+    getPaddleMobileInstance()->Predict(input);
+    auto output = getPaddleMobileInstance()->Fetch();
    count = output->numel();
    result = env->NewFloatArray(count);
    env->SetFloatArrayRegion(result, 0, count, output->data<float>());
@@ -363,7 +368,8 @@ JNIEXPORT jfloatArray JNICALL Java_com_baidu_paddle_PML_predictYuv(
  for (int i = 0; i < length; i++) {
    input_ptr[i] = matrix[i];
  }
-  auto output = getPaddleMobileInstance()->Predict(input);
+  getPaddleMobileInstance()->Predict(input);
+  auto output = getPaddleMobileInstance()->Fetch();
  count = output->numel();
  result = env->NewFloatArray(count);
  env->SetFloatArrayRegion(result, 0, count, output->data<float>());
@@ -399,7 +405,8 @@ Java_com_baidu_paddle_PML_predictLod(JNIEnv *env, jclass thiz, jlongArray buf) {
  auto *pdata = words.mutable_data<int64_t>();
  size_t n = words.numel() * sizeof(int64_t);
  memcpy(pdata, ids.data(), n);
-  auto vec_result = paddle_mobile.PredictLod(words);
+  paddle_mobile.Predict(words);
+  auto vec_result = paddle_mobile.Fetch();
  int count = vec_result->numel();
  jlongArray result = NULL;
  ANDROIDLOGE("predict nlp size %d", count);

--- a/src/io/paddle_mobile.cpp
+++ b/src/io/paddle_mobile.cpp
@@ -13,66 +13,68 @@ See the License for the specific language governing permissions and
 limitations under the License. */

 #include "io/paddle_mobile.h"
+#include <utility>
+#include "common/common.h"
 #ifdef PADDLE_MOBILE_CL
 #include <CL/cl.h>
 #include "framework/cl/cl_tensor.h"
 #endif
-#include "common/common.h"
 #include "operators/math/gemm.h"
+
 namespace paddle_mobile {

-template <typename Dtype, Precision P>
-void PaddleMobile<Dtype, P>::SetThreadNum(int num) {
+template <typename Device, typename T>
+void PaddleMobile<Device, T>::SetThreadNum(int num) {
 #ifdef _OPENMP
  omp_set_num_threads(num);
 #endif
 }

-template <typename Dtype, Precision P>
-bool PaddleMobile<Dtype, P>::Load(const std::string &dirname, bool optimize,
-                                  bool quantification, int batch_size,
-                                  bool loddable) {
+template <typename Device, typename T>
+PMStatus PaddleMobile<Device, T>::Load(const std::string &dirname,
+                                       bool optimize, bool quantification,
+                                       int batch_size, bool loddable) {
  if (loader_.get() == nullptr) {
-    loader_ = std::make_shared<framework::Loader<Dtype, P>>();
+    loader_ = std::make_shared<framework::Loader<Device, T>>();
  } else {
    LOG(kLOG_INFO) << "loader inited";
  }

  if (executor_.get() == nullptr) {
-    executor_ = std::make_shared<framework::Executor<Dtype, P>>(
+    executor_ = std::make_shared<framework::Executor<Device, T>>(
        loader_->Load(dirname, optimize, quantification), batch_size, optimize,
        loddable);
  } else {
    LOG(kLOG_INFO) << "executor inited";
  }

-  return true;
+  return PMSuccess;
 }

-template <typename Dtype, Precision P>
-bool PaddleMobile<Dtype, P>::Load(const std::string &model_path,
-                                  const std::string &para_path, bool optimize,
-                                  bool quantification, int batch_size,
-                                  bool loddable) {
+template <typename Device, typename T>
+PMStatus PaddleMobile<Device, T>::Load(const std::string &model_path,
+                                       const std::string &para_path,
+                                       bool optimize, bool quantification,
+                                       int batch_size, bool loddable) {
  if (loader_.get() == nullptr) {
-    loader_ = std::make_shared<framework::Loader<Dtype, P>>();
+    loader_ = std::make_shared<framework::Loader<Device, T>>();
  } else {
    LOG(kLOG_INFO) << "loader inited";
  }

  if (executor_.get() == nullptr) {
-    executor_ = std::make_shared<framework::Executor<Dtype, P>>(
+    executor_ = std::make_shared<framework::Executor<Device, T>>(
        loader_->Load(model_path, para_path, optimize, quantification),
        batch_size, optimize, loddable);
  } else {
    LOG(kLOG_INFO) << "executor inited";
  }

-  return true;
+  return PMSuccess;
 }

-template <typename Dtype, Precision P>
-bool PaddleMobile<Dtype, P>::LoadCombinedMemory(size_t model_len,
+template <typename Dtype, typename T>
+bool PaddleMobile<Dtype, T>::LoadCombinedMemory(size_t model_len,
                                                const uint8_t *model_buf,
                                                size_t combined_params_len,
                                                uint8_t *combined_params_buf,
@@ -80,13 +82,12 @@ bool PaddleMobile<Dtype, P>::LoadCombinedMemory(size_t model_len,
                                                bool quantification, int batch_size,
                                                bool loddable) {
  if (loader_.get() == nullptr) {
-    loader_ = std::make_shared<framework::Loader<Dtype, P>>();
+    loader_ = std::make_shared<framework::Loader<Device, T>>();
  } else {
    LOG(kLOG_INFO) << "loader inited";
  }
-
  if (executor_.get() == nullptr) {
-    executor_ = std::make_shared<framework::Executor<Dtype, P>>(
+    executor_ = std::make_shared<framework::Executor<Device, T>>(
        loader_->LoadCombinedMemory(model_len, model_buf, combined_params_len,
                                    combined_params_buf, optimize,
                                    quantification),
@@ -95,38 +96,76 @@ bool PaddleMobile<Dtype, P>::LoadCombinedMemory(size_t model_len,
    LOG(kLOG_INFO) << "executor inited";
  }

-  return true;
+  return PMSuccess;
 }
-template <typename Dtype, Precision P>
-std::shared_ptr<framework::Tensor> PaddleMobile<Dtype, P>::Predict(
-    const framework::Tensor &t) {
-  return executor_->Predict(t);
+
+template <typename Device, typename T>
+PMStatus PaddleMobile<Device, T>::Predict(const framework::Tensor &input) {
+  std::vector<std::pair<std::string, framework::Tensor>> inputs;
+  inputs.push_back(std::make_pair("feed", input));
+  return this->Predict(inputs);
+}
+
+template <typename Device, typename T>
+PMStatus PaddleMobile<Device, T>::Predict(const framework::LoDTensor &input) {
+  std::vector<std::pair<std::string, framework::LoDTensor>> inputs;
+  inputs.push_back(std::make_pair("feed", input));
+  return this->Predict(inputs);
 }

-template <typename Dtype, Precision P>
-std::shared_ptr<framework::Tensor> PaddleMobile<Dtype, P>::PredictLod(
-    const framework::LoDTensor &t) {
-  return executor_->PredictLod(t);
+template <typename Device, typename T>
+PMStatus PaddleMobile<Device, T>::Predict(
+    const std::vector<std::pair<std::string, framework::Tensor>> &inputs) {
+  return executor_->Predict(inputs);
 }

-template <typename Dtype, Precision P>
-std::vector<typename PaddleMobile<Dtype, P>::Ptype>
-PaddleMobile<Dtype, P>::Predict(const std::vector<Ptype> &input,
-                                const std::vector<int64_t> &dims) {
+template <typename Device, typename T>
+PMStatus PaddleMobile<Device, T>::Predict(
+    const std::vector<std::pair<std::string, framework::LoDTensor>> &inputs) {
+  return executor_->Predict(inputs);
+}
+
+template <typename Device, typename T>
+std::vector<T> PaddleMobile<Device, T>::Predict(
+    const std::vector<T> &input, const std::vector<int64_t> &dims) {
  return executor_->Predict(input, dims);
 }

-template <typename Dtype, Precision P>
-void PaddleMobile<Dtype, P>::Clear() {
+template <typename Device, typename T>
+PMStatus PaddleMobile<Device, T>::Predict() {
+  return executor_->Predict();
+}
+
+template <typename Device, typename T>
+void PaddleMobile<Device, T>::Feed(const framework::Tensor &input,
+                                   const std::string &var_name) {
+  executor_->SetInput(input, var_name);
+}
+
+template <typename Device, typename T>
+void PaddleMobile<Device, T>::Feed(const framework::LoDTensor &input,
+                                   const std::string &var_name) {
+  executor_->SetInput(input, var_name);
+}
+
+typedef std::shared_ptr<framework::LoDTensor> LoDTensorPtr;
+template <typename Device, typename T>
+LoDTensorPtr PaddleMobile<Device, T>::Fetch(const std::string &var_name) {
+  return executor_->GetOutput(var_name);
+}
+
+template <typename Device, typename T>
+void PaddleMobile<Device, T>::Clear() {
  executor_ = nullptr;
  loader_ = nullptr;
 }
-template <typename Dtype, Precision P>
-double PaddleMobile<Dtype, P>::GetPredictTime() {}
+
+template <typename Device, typename T>
+double PaddleMobile<Device, T>::GetPredictTime() {}

 #ifdef PADDLE_MOBILE_CPU
 template <>
-double PaddleMobile<CPU, Precision::FP32>::GetPredictTime() {
+double PaddleMobile<CPU, float>::GetPredictTime() {
  int m = 32;
  int n = 224 * 224;
  int k = 27;
@@ -142,14 +181,13 @@ double PaddleMobile<CPU, Precision::FP32>::GetPredictTime() {
  int t1 = 1;
  int t2 = 1;
  for (int i = 0; i < m * k; ++i) {
-    unsigned int seed = 100;
-    a[i] = t1 + rand_r(&seed) % t2;
+    a[i] = t1 + rand() % t2;  // NOLINT
  }
  for (int i = 0; i < k * n; ++i) {
-    unsigned int seed = 200;
-    b[i] = t1 + rand_r(&seed) % t2;
+    b[i] = t1 + rand() % t2;  // NOLINT
  }
-  paddle_mobile::operators::math::Gemm gemm;
+
+  operators::math::Gemm gemm;
  auto time1 = paddle_mobile::time();
  gemm.Sgemm(m, n, k, static_cast<float>(1), a, lda, b, ldb,
             static_cast<float>(0), c, ldc, false,
@@ -163,57 +201,51 @@ double PaddleMobile<CPU, Precision::FP32>::GetPredictTime() {
 }
 #endif

-template <typename Dtype, Precision P>
-PaddleMobile<Dtype, P>::~PaddleMobile() {
-  executor_ = nullptr;
-  loader_ = nullptr;
-}
-
 #ifdef PADDLE_MOBILE_FPGA
-
-template <typename Dtype, Precision P>
-void PaddleMobile<Dtype, P>::InjectVariable(const framework::Tensor &t,
-                                            std::string var_name) {
+template <typename Device, typename T>
+void PaddleMobile<Device, T>::InjectVariable(const framework::Tensor &t,
+                                             std::string var_name) {
  executor_->InjectVariable(t, var_name);
 }

-template <typename Dtype, Precision P>
-void PaddleMobile<Dtype, P>::FeedData(const framework::Tensor &t) {
+template <typename Device, typename T>
+void PaddleMobile<Device, T>::FeedData(const framework::Tensor &t) {
  executor_->FeedData(t);
 }

-template <typename Dtype, Precision P>
-std::shared_ptr<framework::Tensor> PaddleMobile<Dtype, P>::FetchResult(int id) {
+template <typename Device, typename T>
+std::shared_ptr<framework::Tensor> PaddleMobile<Device, T>::FetchResult(
+    int id) {
  return executor_->FetchResult(id);
 }

-template <typename Dtype, Precision P>
-void PaddleMobile<Dtype, P>::Predict_From_To(int start, int end) {
+template <typename Device, typename T>
+void PaddleMobile<Device, T>::Predict_From_To(int start, int end) {
  executor_->Predict_From_To(start, end);
 }

-template <typename Dtype, Precision P>
-void PaddleMobile<Dtype, P>::Predict_From(int start) {
+template <typename Device, typename T>
+void PaddleMobile<Device, T>::Predict_From(int start) {
  executor_->Predict_From(start);
 }

-template <typename Dtype, Precision P>
-void PaddleMobile<Dtype, P>::Predict_To(int end) {
+template <typename Device, typename T>
+void PaddleMobile<Device, T>::Predict_To(int end) {
  executor_->Predict_To(end);
 }
 #endif

 #ifdef PADDLE_MOBILE_CL
 static std::mutex lc;
-template <typename Dtype, Precision P>
-void PaddleMobile<Dtype, P>::SetCLPath(std::string path) {
+template <typename Device, typename T>
+void PaddleMobile<Device, T>::SetCLPath(std::string path) {
  std::lock_guard<std::mutex> lock(lc);
  if (framework::CLEngine::Instance()->GetCLPath() == "") {
    framework::CLEngine::Instance()->setClPath(path);
  }
 }
 template <>
-double PaddleMobile<GPU_CL, Precision::FP32>::GetPredictTime() {
+double PaddleMobile<GPU_CL, float>::GetPredictTime() {
  cl_int status;
  cl_uint nPlatform;
  clGetPlatformIDs(0, NULL, &nPlatform);
@@ -411,8 +443,8 @@ double PaddleMobile<GPU_CL, Precision::FP32>::GetPredictTime() {
    return -1;
  }
 }
-template <typename Dtype, Precision P>
-int PaddleMobile<Dtype, P>::readText(
+template <typename Device, typename T>
+int PaddleMobile<Device, T>::readText(
    const char *kernelPath,
    char **pcode) {  // 读取文本文件放入 pcode，返回字符串长度
  FILE *fp;
@@ -441,13 +473,11 @@ int PaddleMobile<Dtype, P>::readText(
  fclose(fp);
  return size + 1;
 }
-
 #endif

-template class PaddleMobile<CPU, Precision::FP32>;
-template class PaddleMobile<FPGA, Precision::FP32>;
-template class PaddleMobile<GPU_MALI, Precision::FP32>;
-
-template class PaddleMobile<GPU_CL, Precision::FP32>;
+template class PaddleMobile<CPU, float>;
+template class PaddleMobile<FPGA, float>;
+template class PaddleMobile<GPU_MALI, float>;
+template class PaddleMobile<GPU_CL, float>;

 }  // namespace paddle_mobile
--- a/src/io/paddle_mobile.h
+++ b/src/io/paddle_mobile.h
@@ -16,6 +16,7 @@ limitations under the License. */

 #include <memory>
 #include <string>
+#include <utility>
 #include <vector>
 #ifdef _OPENMP
 #include <omp.h>
@@ -32,44 +33,53 @@ limitations under the License. */

 namespace paddle_mobile {

-template <typename Dtype = CPU, Precision P = Precision::FP32>
+template <typename Device, typename T = float>
 class PaddleMobile {
-  typedef typename PrecisionTrait<P>::ptype Ptype;
-
 public:
  PaddleMobile() {
 #ifndef PADDLE_MOBILE_CL
-    bool is_gpu = std::is_same<DeviceType<kGPU_CL>, Dtype>::value;
-    PADDLE_MOBILE_ENFORCE(!is_gpu,
-                          "Not Enable GPU in CmakeList but run gpu codes ");
+    bool is_gpu = std::is_same<DeviceType<kGPU_CL>, Device>::value;
+    PADDLE_MOBILE_ENFORCE(!is_gpu, "Please recompile with GPU_CL is on");
 #endif
  }
-  bool Load(const std::string &dirname, bool optimize = false,
-            bool quantification = false, int batch_size = 1,
-            bool loddable = false);
+  ~PaddleMobile() {}
+
+  PMStatus Load(const std::string &dirname, const bool optimize = false,
+                const bool quantification = false, const int batch_size = 1,
+                const bool lod = false);
+  PMStatus Load(const std::string &model_path, const std::string &para_path,
+                const bool optimize = false, const bool quantification = false,
+                const int batch_size = 1, const bool lod = false);
+
+  PMStatus Predict(const framework::Tensor &input);
+  PMStatus Predict(const framework::LoDTensor &input);

-  bool Load(const std::string &model_path, const std::string &para_path,
-            bool optimize = false, bool quantification = false,
-            int batch_size = 1, bool loddable = false);
+  PMStatus Predict(
+      const std::vector<std::pair<std::string, framework::Tensor>> &inputs);
+  PMStatus Predict(
+      const std::vector<std::pair<std::string, framework::LoDTensor>> &inputs);

-  std::shared_ptr<framework::Tensor> Predict(const framework::Tensor &t);
+  std::vector<T> Predict(const std::vector<T> &input,
+                         const std::vector<int64_t> &dims);
+  PMStatus Predict();

-  std::shared_ptr<framework::Tensor> PredictLod(const framework::LoDTensor &t);
+  void Feed(const framework::LoDTensor &input, const std::string &var_name);
+  void Feed(const framework::Tensor &input, const std::string &var_name);

-  std::vector<Ptype> Predict(const std::vector<Ptype> &input,
-                             const std::vector<int64_t> &dims);
+  typedef std::shared_ptr<framework::LoDTensor> LoDTensorPtr;
+  LoDTensorPtr Fetch(const std::string &var_name);
+
+  LoDTensorPtr Fetch() { return Fetch("fetch"); }

  bool LoadCombinedMemory(size_t model_len, const uint8_t *model_buf,
                          size_t combined_params_len,
                          uint8_t *combined_params_buf, bool optimize = false, bool quantification = false,
                          int batch_size = 1, bool loddable = false);

-  void SetThreadNum(int num);
+  void SetThreadNum(int count);
  void Clear();
  double GetPredictTime();

-  ~PaddleMobile();
-
 #ifdef PADDLE_MOBILE_FPGA
  void InjectVariable(const framework::Tensor &t, std::string var_name);
  void FeedData(const framework::Tensor &t);
@@ -80,15 +90,15 @@ class PaddleMobile {
 #endif

 #ifdef PADDLE_MOBILE_CL
- public:
+ public:  // NOLINT
  void SetCLPath(std::string cl_path);
  int readText(const char *kernelPath,
               char **pcode);  // 读取文本文件放入 pcode，返回字符串长度
 #endif

 private:
-  std::shared_ptr<framework::Loader<Dtype, P>> loader_;
-  std::shared_ptr<framework::Executor<Dtype, P>> executor_;
+  std::shared_ptr<framework::Loader<Device, T>> loader_;
+  std::shared_ptr<framework::Executor<Device, T>> executor_;
 };

 }  // namespace paddle_mobile
--- a/src/io/paddle_test_inference_api.cpp
+++ b/src/io/paddle_test_inference_api.cpp
@@ -14,10 +14,12 @@ limitations under the License. */

 #include "io/paddle_test_inference_api.h"
 #include "io/paddle_mobile.h"
+
 namespace paddle_mobile {
-template <typename Dtype, Precision P>
-double PaddleTester<Dtype, P>::CaculatePredictTime(std::string *cl_path) {
-  PaddleMobile<Dtype, P> paddle_mobile;
+
+template <typename Device, typename T>
+double PaddleTester<Device, T>::CaculatePredictTime(std::string *cl_path) {
+  PaddleMobile<Device, T> paddle_mobile;
 #ifdef PADDLE_MOBILE_CL
  if (cl_path) {
    paddle_mobile.SetCLPath(*cl_path);
@@ -26,10 +28,10 @@ double PaddleTester<Dtype, P>::CaculatePredictTime(std::string *cl_path) {
 #endif
  return paddle_mobile.GetPredictTime();
 }
-template class PaddleTester<CPU, Precision::FP32>;
-template class PaddleTester<FPGA, Precision::FP32>;
-template class PaddleTester<GPU_MALI, Precision::FP32>;
+template class PaddleTester<CPU, float>;
+template class PaddleTester<FPGA, float>;
+template class PaddleTester<GPU_MALI, float>;

-template class PaddleTester<GPU_CL, Precision::FP32>;
+template class PaddleTester<GPU_CL, float>;

 }  // namespace paddle_mobile
--- a/src/io/paddle_test_inference_api.h
+++ b/src/io/paddle_test_inference_api.h
@@ -20,10 +20,13 @@ limitations under the License. */
 */

 #pragma once
+
 #include "common/types.h"
 #include "string"
+
 namespace paddle_mobile {
-template <typename Dtype, Precision P = Precision::FP32>
+
+template <typename Device, typename T = float>
 class PaddleTester {
 public:
  double CaculatePredictTime(std::string *cl_path = nullptr);

--- a/src/operators/kernel/dequant_add_bn_kernel.h
+++ b/src/operators/kernel/dequant_add_bn_kernel.h
@@ -12,26 +12,25 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#pragma once
+#ifdef CAST_OP

-#ifdef FUSION_DEQUANT_ADD_BN_OP
-
-#include "framework/operator.h"
-#include "operators/op_param.h"
+#include "operators/cast_op.h"

 namespace paddle_mobile {
 namespace operators {

 template <typename DeviceType, typename T>
-class FusionDequantAddBNKernel
-    : public framework::OpKernelBase<DeviceType,
-                                     FusionDequantAddBNParam<DeviceType>> {
- public:
-  void Compute(const FusionDequantAddBNParam<DeviceType> &param);
-  bool Init(FusionDequantAddBNParam<DeviceType> *param);
-};
+void CastOp<DeviceType, T>::InferShape() const {
+  const auto &dims = this->param_.input_->dims();
+  this->param_.output_->Resize(dims);
+}

 }  // namespace operators
 }  // namespace paddle_mobile

+namespace ops = paddle_mobile::operators;
+#ifdef PADDLE_MOBILE_CPU
+REGISTER_OPERATOR_CPU(cast, ops::CastOp);
 #endif
+
+#endif  // CAST_OP
--- a/src/operators/fusion_fc_int8_op.h
+++ b/src/operators/fusion_fc_int8_op.h
@@ -12,39 +12,34 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef FUSION_FC_INT8_OP
+#ifdef CAST_OP

 #pragma once

 #include <string>
-#include <vector>
-
 #include "framework/operator.h"
-#include "framework/program/program-optimize/fusion_op_register.h"
-#include "operators/kernel/fusion_fc_kernel.h"
+#include "operators/kernel/kernels.h"
 #include "operators/op_param.h"

 namespace paddle_mobile {
 namespace operators {

 template <typename DeviceType, typename T>
-class FusionFcInt8Op
-    : public framework::OperatorWithKernel<DeviceType,
-                                           FusionFcParam<DeviceType>,
-                                           FusionFcKernel<DeviceType, T>> {
+class CastOp : public framework::OperatorWithKernel<
+                   DeviceType, CastParam<DeviceType>,
+                   operators::CastKernel<DeviceType, T>> {
 public:
-  FusionFcInt8Op(const std::string &type, const VariableNameMap &inputs,
-                 const VariableNameMap &outputs,
-                 const framework::AttributeMap &attrs,
-                 std::shared_ptr<framework::Scope> scope)
-      : framework::OperatorWithKernel<DeviceType, FusionFcParam<DeviceType>,
-                                      FusionFcKernel<DeviceType, T>>(
+  CastOp(const std::string &type, const VariableNameMap &inputs,
+         const VariableNameMap &outputs, const framework::AttributeMap &attrs,
+         std::shared_ptr<framework::Scope> scope)
+      : framework::OperatorWithKernel<DeviceType, CastParam<DeviceType>,
+                                      operators::CastKernel<DeviceType, T>>(
            type, inputs, outputs, attrs, scope) {}
-
+  // inference output shape
  void InferShape() const override;
 };

 }  // namespace operators
 }  // namespace paddle_mobile

-#endif  // FUSION_FC_INT8_OP
+#endif  // CAST_OP
--- a/src/operators/dequantize_op.cpp
+++ b/src/operators/dequantize_op.cpp
@@ -33,4 +33,4 @@ namespace ops = paddle_mobile::operators;
 REGISTER_OPERATOR_CPU(dequantize, ops::DequantizeOp);
 #endif

-#endif
+#endif  // DEQUANT_OP
--- a/src/operators/dequantize_op.h
+++ b/src/operators/dequantize_op.h
@@ -44,4 +44,4 @@ class DequantizeOp
 }  // namespace operators
 }  // namespace paddle_mobile

-#endif
+#endif  // DEQUANT_OP
--- a/src/operators/fill_constant_op.h
+++ b/src/operators/fill_constant_op.h
@@ -25,12 +25,11 @@ limitations under the License. */

 namespace paddle_mobile {
 namespace operators {
-using std::string;

 template <typename DeviceType, typename T>
 class FillConstantOp : public framework::OperatorBase<DeviceType> {
 public:
-  FillConstantOp(const string &type, const VariableNameMap &inputs,
+  FillConstantOp(const std::string &type, const VariableNameMap &inputs,
                 const VariableNameMap &outputs,
                 const framework::AttributeMap attrs,
                 std::shared_ptr<framework::Scope> scope)
@@ -58,7 +57,7 @@ class FillConstantOp : public framework::OperatorBase<DeviceType> {
    tensor->Resize(framework::make_ddim(param_.Shape()));
    tensor->mutable_data(framework::ToTypeIndex(data_type));

-    math::set_constant(tensor, value);
+    math::SetConstant(tensor, value);
  }

  void Init() {}

--- a/src/operators/fusion_conv_add_relu_int8_op.cpp
+++ b/src/operators/fusion_conv_add_relu_int8_op.cpp
-/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#ifdef FUSION_CONVADDRELU_INT8_OP
-
-#include "operators/fusion_conv_add_relu_int8_op.h"
-#include <vector>
-#include "operators/math/conv_func.h"
-
-namespace paddle_mobile {
-namespace operators {
-
-template <typename Dtype, typename T>
-void FusionConvAddReluInt8Op<Dtype, T>::InferShape() const {
-  auto in_dims = this->param_.Input()->dims();
-  auto filter_dims = this->param_.Filter()->dims();
-  const std::vector<int> &strides = this->param_.Strides();
-  std::vector<int> paddings = this->param_.Paddings();
-  int groups = this->param_.Groups();
-  std::vector<int> dilations = this->param_.Dilations();
-
-  PADDLE_MOBILE_ENFORCE((in_dims.size() == filter_dims.size() &&
-                         dilations.size() == paddings.size() &&
-                         paddings.size() == strides.size()),
-                        "ConvParam is not suitable");
-
-  std::vector<int64_t> output_shape({in_dims[0], filter_dims[0]});
-  for (size_t i = 0; i < strides.size(); ++i) {
-    output_shape.push_back(
-        math::ConvOutputSize(in_dims[i + 2], filter_dims[i + 2], dilations[i],
-                             paddings[i], strides[i]));
-  }
-  framework::DDim ddim = framework::make_ddim(output_shape);
-  this->param_.Output()->Resize(ddim);
-}
-
-}  // namespace operators
-}  // namespace paddle_mobile
-
-namespace ops = paddle_mobile::operators;
-#ifdef PADDLE_MOBILE_CPU
-REGISTER_OPERATOR_CPU_INT8(fusion_conv_add_relu_int8,
-                           ops::FusionConvAddReluInt8Op);
-#endif
-#endif  // FUSION_CONVADDRELU_INT8_OP
--- a/src/operators/fusion_dequant_add_bn_op.h
+++ b/src/operators/fusion_dequant_add_bn_op.h
@@ -20,7 +20,7 @@ limitations under the License. */
 #include <vector>
 #include "framework/operator.h"
 #include "framework/program/program-optimize/fusion_op_register.h"
-#include "operators/kernel/dequant_add_bn_kernel.h"
+#include "operators/kernel/dequant_bn_kernel.h"
 #include "operators/op_param.h"

 namespace paddle_mobile {
@@ -43,7 +43,8 @@ class FusionDequantAddBNMatcher : public framework::FusionOpMatcher {
                   {{"Scale", "BNScale"},
                    {"Mean", "BNMean"},
                    {"Bias", "BNBias"},
-                    {"Variance", "BNVariance"}}}},
+                    {"Variance", "BNVariance"},
+                    {"Y", "Out"}}}},
                 removed_nodes);
  }


--- a/src/operators/fusion_dequant_add_bn_relu_op.h
+++ b/src/operators/fusion_dequant_add_bn_relu_op.h
@@ -20,7 +20,7 @@ limitations under the License. */
 #include <vector>
 #include "framework/operator.h"
 #include "framework/program/program-optimize/fusion_op_register.h"
-#include "operators/kernel/dequant_bn_relu_kernel.h"
+#include "operators/kernel/dequant_bn_kernel.h"
 #include "operators/op_param.h"

 namespace paddle_mobile {
@@ -44,7 +44,8 @@ class FusionDequantAddBNReluMatcher : public framework::FusionOpMatcher {
                   {{"Scale", "BNScale"},
                    {"Mean", "BNMean"},
                    {"Bias", "BNBias"},
-                    {"Variance", "BNVariance"}}}},
+                    {"Variance", "BNVariance"},
+                    {"Y", "Out"}}}},
                 removed_nodes);
  }

@@ -54,7 +55,7 @@ class FusionDequantAddBNReluMatcher : public framework::FusionOpMatcher {
 template <typename DeviceType, typename T>
 class FusionDequantAddBNReluOp
    : public framework::OperatorWithKernel<
-          DeviceType, FusionDequantAddBNReluParam<DeviceType>,
+          DeviceType, FusionDequantAddBNParam<DeviceType>,
          operators::FusionDequantAddBNReluKernel<DeviceType, T>> {
 public:
  FusionDequantAddBNReluOp(const std::string &type,
@@ -63,7 +64,7 @@ class FusionDequantAddBNReluOp
                           const framework::AttributeMap &attrs,
                           std::shared_ptr<framework::Scope> scope)
      : framework::OperatorWithKernel<
-            DeviceType, FusionDequantAddBNReluParam<DeviceType>,
+            DeviceType, FusionDequantAddBNParam<DeviceType>,
            operators::FusionDequantAddBNReluKernel<DeviceType, T>>(
            type, inputs, outputs, attrs, scope) {}
  // inference output shape

--- a/src/operators/fusion_fc_int8_op.cpp
+++ b/src/operators/fusion_fc_int8_op.cpp
@@ -12,50 +12,51 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef FUSION_FC_INT8_OP
-
-#include "operators/fusion_fc_int8_op.h"
+#include "operators/fusion_dequant_add_bn_relu_quant_op.h"

+#ifdef FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
 namespace paddle_mobile {
 namespace operators {

 template <typename Dtype, typename T>
-void FusionFcInt8Op<Dtype, T>::InferShape() const {
-  auto x_dims = this->param_.InputX()->dims();
-  auto y_dims = this->param_.InputY()->dims();
-  int x_num_col_dims = this->param_.XNumColDims();
-  int y_num_col_dims = this->param_.YNumColDims();
-
-  assert(x_dims.size() > x_num_col_dims);
-  assert(y_dims.size() > y_num_col_dims);
-
-  /// (1,2,3,4) , x_num_col_dims = 2  -> (2,12)
-  auto x_mat_dims = framework::flatten_to_2d(x_dims, x_num_col_dims);
-  auto y_mat_dims = framework::flatten_to_2d(y_dims, y_num_col_dims);
+void FusionDequantAddBNReluQuantOp<Dtype, T>::InferShape() const {
+  const auto& input_dims = this->param_.input_->dims();
+  this->param_.output_->Resize(input_dims);
+}

-  assert(x_mat_dims[1] == y_mat_dims[0]);
+}  // namespace operators
+}  // namespace paddle_mobile

-  std::vector<int64_t> output_dims;
-  output_dims.reserve(
-      static_cast<size_t>(x_num_col_dims + y_dims.size() - y_num_col_dims));
+namespace ops = paddle_mobile::operators;
+REGISTER_FUSION_MATCHER(fusion_dequant_add_bn_relu_quant,
+                        ops::FusionDequantAddBNReluQuantMatcher);

-  for (int i = 0; i < x_num_col_dims; ++i) {
-    output_dims.push_back(x_dims[i]);
-  }
+#ifdef PADDLE_MOBILE_CPU
+REGISTER_OPERATOR_CPU(fusion_dequant_add_bn_relu_quant,
+                      ops::FusionDequantAddBNReluQuantOp);
+#endif
+#endif  // FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP

-  for (int i = y_num_col_dims; i < y_dims.size(); ++i) {
-    output_dims.push_back(y_dims[i]);
-  }
+#ifdef FUSION_DEQUANT_ADD_BN_QUANT_OP
+namespace paddle_mobile {
+namespace operators {

-  framework::DDim ddim = framework::make_ddim(output_dims);
-  this->param_.Out()->Resize(ddim);
+template <typename Dtype, typename T>
+void FusionDequantAddBNQuantOp<Dtype, T>::InferShape() const {
+  const auto& input_dims = this->param_.input_->dims();
+  this->param_.output_->Resize(input_dims);
 }

 }  // namespace operators
 }  // namespace paddle_mobile

 namespace ops = paddle_mobile::operators;
+REGISTER_FUSION_MATCHER(fusion_dequant_add_bn_quant,
+                        ops::FusionDequantAddBNQuantMatcher);
+
 #ifdef PADDLE_MOBILE_CPU
-REGISTER_OPERATOR_CPU_INT8(fusion_fc_int8, ops::FusionFcInt8Op);
+REGISTER_OPERATOR_CPU(fusion_dequant_add_bn_quant,
+                      ops::FusionDequantAddBNQuantOp);
 #endif
-#endif  // FUSION_FC_INT8_OP
+
+#endif  // FUSION_DEQUANT_ADD_BN_QUANT_OP
--- a/src/operators/fusion_dequant_add_bn_relu_quant_op.h
+++ b/src/operators/fusion_dequant_add_bn_relu_quant_op.h
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <string>
+#include <vector>
+#include "framework/operator.h"
+#include "framework/program/program-optimize/fusion_op_register.h"
+#include "operators/kernel/dequant_bn_kernel.h"
+#include "operators/op_param.h"
+
+namespace paddle_mobile {
+namespace operators {
+
+#ifdef FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
+class FusionDequantAddBNReluQuantMatcher : public framework::FusionOpMatcher {
+ public:
+  FusionDequantAddBNReluQuantMatcher() {
+    node_ = framework::Node(G_OP_TYPE_DEQUANTIZE);
+    node_ > std::make_shared<framework::Node>(G_OP_TYPE_ELEMENTWISE_ADD) >
+        std::make_shared<framework::Node>(G_OP_TYPE_BATCHNORM) >
+        std::make_shared<framework::Node>(G_OP_TYPE_RELU) >
+        std::make_shared<framework::Node>(G_OP_TYPE_QUANTIZE);
+  }
+
+  void FolderNodes(
+      framework::Node *node,
+      std::vector<std::shared_ptr<framework::Node>> *removed_nodes) {
+    node->Folder(node_.Depth(), Type(),
+                 {{G_OP_TYPE_ELEMENTWISE_ADD, {{"Y", "Y"}}},
+                  {G_OP_TYPE_BATCHNORM,
+                   {{"Scale", "BNScale"},
+                    {"Mean", "BNMean"},
+                    {"Bias", "BNBias"},
+                    {"Variance", "BNVariance"},
+                    {"Y", "Out"}}}},
+                 removed_nodes);
+  }
+
+  std::string Type() { return G_OP_TYPE_FUSION_DEQUANT_ADD_BN_RELU_QUANT; }
+};
+
+template <typename DeviceType, typename T>
+class FusionDequantAddBNReluQuantOp
+    : public framework::OperatorWithKernel<
+          DeviceType, FusionDequantAddBNReluQuantParam<DeviceType>,
+          operators::FusionDequantAddBNReluQuantKernel<DeviceType, T>> {
+ public:
+  FusionDequantAddBNReluQuantOp(const std::string &type,
+                                const VariableNameMap &inputs,
+                                const VariableNameMap &outputs,
+                                const framework::AttributeMap &attrs,
+                                std::shared_ptr<framework::Scope> scope)
+      : framework::OperatorWithKernel<
+            DeviceType, FusionDequantAddBNReluQuantParam<DeviceType>,
+            operators::FusionDequantAddBNReluQuantKernel<DeviceType, T>>(
+            type, inputs, outputs, attrs, scope) {}
+  // inference output shape
+  void InferShape() const override;
+};
+#endif  // FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
+
+#ifdef FUSION_DEQUANT_ADD_BN_QUANT_OP
+class FusionDequantAddBNQuantMatcher : public framework::FusionOpMatcher {
+ public:
+  FusionDequantAddBNQuantMatcher() {
+    node_ = framework::Node(G_OP_TYPE_DEQUANTIZE);
+    node_ > std::make_shared<framework::Node>(G_OP_TYPE_ELEMENTWISE_ADD) >
+        std::make_shared<framework::Node>(G_OP_TYPE_BATCHNORM) >
+        std::make_shared<framework::Node>(G_OP_TYPE_QUANTIZE);
+  }
+
+  void FolderNodes(
+      framework::Node *node,
+      std::vector<std::shared_ptr<framework::Node>> *removed_nodes) {
+    node->Folder(node_.Depth(), Type(),
+                 {{G_OP_TYPE_ELEMENTWISE_ADD, {{"Y", "Y"}}},
+                  {G_OP_TYPE_BATCHNORM,
+                   {{"Scale", "BNScale"},
+                    {"Mean", "BNMean"},
+                    {"Bias", "BNBias"},
+                    {"Variance", "BNVariance"},
+                    {"Y", "Out"}}}},
+                 removed_nodes);
+  }
+
+  std::string Type() { return G_OP_TYPE_FUSION_DEQUANT_ADD_BN_QUANT; }
+};
+
+template <typename DeviceType, typename T>
+class FusionDequantAddBNQuantOp
+    : public framework::OperatorWithKernel<
+          DeviceType, FusionDequantAddBNQuantParam<DeviceType>,
+          operators::FusionDequantAddBNQuantKernel<DeviceType, T>> {
+ public:
+  FusionDequantAddBNQuantOp(const std::string &type,
+                            const VariableNameMap &inputs,
+                            const VariableNameMap &outputs,
+                            const framework::AttributeMap &attrs,
+                            std::shared_ptr<framework::Scope> scope)
+      : framework::OperatorWithKernel<
+            DeviceType, FusionDequantAddBNQuantParam<DeviceType>,
+            operators::FusionDequantAddBNQuantKernel<DeviceType, T>>(
+            type, inputs, outputs, attrs, scope) {}
+  // inference output shape
+  void InferShape() const override;
+};
+#endif  // FUSION_DEQUANT_ADD_BN_QUANT_OP
+
+}  // namespace operators
+}  // namespace paddle_mobile
--- a/src/operators/fusion_dequant_bn_relu_op.cpp
+++ b/src/operators/fusion_dequant_bn_relu_op.cpp
@@ -12,28 +12,43 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef FUSION_DEQUANT_BN_RELU_OP
-
-#include "operators/fusion_dequant_bn_relu_op.h"
+#include "operators/fusion_dequant_bn_op.h"

 namespace paddle_mobile {
 namespace operators {

+#ifdef FUSION_DEQUANT_BN_OP
+template <typename Dtype, typename T>
+void FusionDequantBNOp<Dtype, T>::InferShape() const {
+  const auto& input_dims = this->param_.input_->dims();
+  this->param_.output_->Resize(input_dims);
+}
+#endif  // FUSION_DEQUANT_BN_OP
+
+#ifdef FUSION_DEQUANT_BN_RELU_OP
 template <typename Dtype, typename T>
 void FusionDequantBNReluOp<Dtype, T>::InferShape() const {
  const auto& input_dims = this->param_.input_->dims();
  this->param_.output_->Resize(input_dims);
 }
+#endif  // FUSION_DEQUANT_BN_RELU_OP

 }  // namespace operators
 }  // namespace paddle_mobile

 namespace ops = paddle_mobile::operators;
+
+#ifdef FUSION_DEQUANT_BN_OP
+REGISTER_FUSION_MATCHER(fusion_dequant_bn, ops::FusionDequantBNMatcher);
+#ifdef PADDLE_MOBILE_CPU
+REGISTER_OPERATOR_CPU(fusion_dequant_bn, ops::FusionDequantBNOp);
+#endif  // PADDLE_MOBILE_CPU
+#endif  // FUSION_DEQUANT_BN_OP
+
+#ifdef FUSION_DEQUANT_BN_RELU_OP
 REGISTER_FUSION_MATCHER(fusion_dequant_bn_relu,
                        ops::FusionDequantBNReluMatcher);
-
 #ifdef PADDLE_MOBILE_CPU
 REGISTER_OPERATOR_CPU(fusion_dequant_bn_relu, ops::FusionDequantBNReluOp);
-#endif
-
-#endif
+#endif  // PADDLE_MOBILE_CPU
+#endif  // FUSION_DEQUANT_BN_RELU_OP
--- a/src/operators/fusion_dequant_bn_op.h
+++ b/src/operators/fusion_dequant_bn_op.h
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <string>
+#include <vector>
+#include "framework/operator.h"
+#include "framework/program/program-optimize/fusion_op_register.h"
+#include "operators/kernel/dequant_bn_kernel.h"
+#include "operators/op_param.h"
+
+namespace paddle_mobile {
+namespace operators {
+
+#if defined(FUSION_DEQUANT_BN_OP) || defined(FUSION_DEQUANT_BN_RELU_OP)
+class FusionDequantBNMatcher : public framework::FusionOpMatcher {
+ public:
+  FusionDequantBNMatcher() {
+    node_ = framework::Node(G_OP_TYPE_DEQUANTIZE);
+    node_ > std::make_shared<framework::Node>(G_OP_TYPE_BATCHNORM);
+  }
+
+  virtual void FolderNodes(
+      framework::Node *node,
+      std::vector<std::shared_ptr<framework::Node>> *removed_nodes) {
+    node->Folder(node_.Depth(), Type(),
+                 {{G_OP_TYPE_BATCHNORM,
+                   {{"Scale", "BNScale"},
+                    {"Mean", "BNMean"},
+                    {"Bias", "BNBias"},
+                    {"Variance", "BNVariance"},
+                    {"Y", "Out"}}}},
+                 removed_nodes);
+  }
+
+  std::string Type() override { return G_OP_TYPE_FUSION_DEQUANT_BN; }
+};
+#endif  // FUSION_DEQUANT_BN_OP || FUSION_DEQUANT_BN_RELU_OP
+
+#ifdef FUSION_DEQUANT_BN_OP
+template <typename DeviceType, typename T>
+class FusionDequantBNOp : public framework::OperatorWithKernel<
+                              DeviceType, FusionDequantBNParam<DeviceType>,
+                              operators::FusionDequantBNKernel<DeviceType, T>> {
+ public:
+  FusionDequantBNOp(const std::string &type, const VariableNameMap &inputs,
+                    const VariableNameMap &outputs,
+                    const framework::AttributeMap &attrs,
+                    std::shared_ptr<framework::Scope> scope)
+      : framework::OperatorWithKernel<
+            DeviceType, FusionDequantBNParam<DeviceType>,
+            operators::FusionDequantBNKernel<DeviceType, T>>(
+            type, inputs, outputs, attrs, scope) {}
+  // inference output shape
+  void InferShape() const override;
+};
+#endif  // FUSION_DEQUANT_BN_OP
+
+#ifdef FUSION_DEQUANT_BN_RELU_OP
+class FusionDequantBNReluMatcher : public FusionDequantBNMatcher {
+ public:
+  FusionDequantBNReluMatcher() : FusionDequantBNMatcher() {
+    node_ > std::make_shared<framework::Node>(G_OP_TYPE_RELU);
+  }
+
+  virtual std::string Type() { return G_OP_TYPE_FUSION_DEQUANT_BN_RELU; }
+};
+
+template <typename DeviceType, typename T>
+class FusionDequantBNReluOp
+    : public framework::OperatorWithKernel<
+          DeviceType, FusionDequantBNParam<DeviceType>,
+          operators::FusionDequantBNReluKernel<DeviceType, T>> {
+ public:
+  FusionDequantBNReluOp(const std::string &type, const VariableNameMap &inputs,
+                        const VariableNameMap &outputs,
+                        const framework::AttributeMap &attrs,
+                        std::shared_ptr<framework::Scope> scope)
+      : framework::OperatorWithKernel<
+            DeviceType, FusionDequantBNParam<DeviceType>,
+            operators::FusionDequantBNReluKernel<DeviceType, T>>(
+            type, inputs, outputs, attrs, scope) {}
+
+  void InferShape() const override;
+};
+#endif  // FUSION_DEQUANT_BN_RELU_OP
+
+}  // namespace operators
+}  // namespace paddle_mobile
--- a/src/operators/fusion_dequant_bn_relu_op.h
+++ b/src/operators/fusion_dequant_bn_relu_op.h
@@ -42,7 +42,8 @@ class FusionDequantBNReluMatcher : public framework::FusionOpMatcher {
                   {{"Scale", "BNScale"},
                    {"Mean", "BNMean"},
                    {"Bias", "BNBias"},
-                    {"Variance", "BNVariance"}}}},
+                    {"Variance", "BNVariance"},
+                    {"Y", "Out"}}}},
                 removed_nodes);
  }


--- a/src/operators/gru_op.cpp
+++ b/src/operators/gru_op.cpp
@@ -14,19 +14,15 @@ limitations under the License. */

 #ifdef GRU_OP

+#include "operators/gru_op.h"
 #include <vector>
-
 #include "common/enforce.h"
-#include "operators/gru_op.h"

 namespace paddle_mobile {
 namespace operators {

 template <typename Dtype, typename T>
 void GruOp<Dtype, T>::InferShape() const {
-  auto lod_size = this->param_.InputInput()->lod().size();
-  PADDLE_MOBILE_ENFORCE((lod_size == 1),
-                        "Current LoD only supports one dimension.");
  auto input_dims = this->param_.InputInput()->dims();
  auto weight_dims = this->param_.InputWeight()->dims();
  int input_size = input_dims[1];

--- a/src/operators/im2sequence_op.cpp
+++ b/src/operators/im2sequence_op.cpp
@@ -15,6 +15,7 @@ limitations under the License. */
 #ifdef IM2SEQUENCE_OP

 #include "operators/im2sequence_op.h"
+#include <vector>

 namespace paddle_mobile {
 namespace operators {
@@ -29,20 +30,16 @@ int Im2SequenceOutputSize(int input_size, int kernel, int padding_1,
 template <typename Dtype, typename T>
 void Im2SequenceOp<Dtype, T>::InferShape() const {
  auto in_x_dims = this->param_.Input()->dims();
-
  const std::vector<int> &kernels = this->param_.Kernels();
-
  const std::vector<int> &strides = this->param_.Strides();
-
  std::vector<int> paddings = this->param_.Paddings();
-
  std::vector<int64_t> output_shape({in_x_dims[0], in_x_dims[1]});
+
  for (size_t i = 0; i < strides.size(); ++i) {
    output_shape.push_back(Im2SequenceOutputSize(in_x_dims[i + 2], kernels[i],
                                                 paddings[i], paddings[i + 2],
                                                 strides[i]));
  }
-
  framework::DDim ddim = framework::make_ddim(output_shape);
  this->param_.Output()->Resize(ddim);
 }
@@ -54,9 +51,5 @@ namespace ops = paddle_mobile::operators;
 #ifdef PADDLE_MOBILE_CPU
 REGISTER_OPERATOR_CPU(im2sequence, ops::Im2SequenceOp);
 #endif
-#ifdef PADDLE_MOBILE_MALI_GPU
-#endif
-#ifdef PADDLE_MOBILE_FPGA
-#endif

-#endif
+#endif  // IM2SEQUENCE_OP
--- a/src/operators/math/pool_3x3.h
+++ b/src/operators/math/pool_3x3.h
@@ -12,39 +12,63 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef POOL_OP
+#ifdef CAST_OP

-#pragma once
-#ifdef _OPENMP
-#include <omp.h>
-#endif
 #include <algorithm>
 #include <vector>
-#include "framework/tensor.h"
-#if __ARM_NEON
-#include <arm_neon.h>
-#endif  // __ARM_NEON
+#include "framework/data_type.h"
+#include "operators/kernel/kernels.h"

 namespace paddle_mobile {
 namespace operators {
-namespace math {
-void Pool3x3Avgs1p1(const framework::Tensor *input, framework::Tensor *output);
-void Pool3x3Maxs1p1(const framework::Tensor *input, framework::Tensor *output);
-void Pool3x3Max(std::vector<int> strides, std::vector<int> paddings,
-                const framework::Tensor *input, framework::Tensor *output);
-
-void Pool3x3Avg(std::vector<int> strides, std::vector<int> paddings,
-                const framework::Tensor *in_x, framework::Tensor *out);
-
-void Pool3x3Maxs1_int8(const framework::Tensor *input,
-                       framework::Tensor *output, int32_t pad_h, int32_t pad_w);
-void Pool3x3Maxs2_int8(const framework::Tensor *input,
-                       framework::Tensor *output, int32_t pad_h, int32_t pad_w);
-void Pool3x3Max_int8(const std::vector<int> &strides,
-                     const std::vector<int> &paddings,
-                     const framework::Tensor *input, framework::Tensor *output);
-}  // namespace math
+
+template <typename InT>
+struct CastOutOpFunctor {
+  const framework::Tensor* in_;
+  framework::Tensor* out_;
+  CastOutOpFunctor(const framework::Tensor* in, framework::Tensor* out)
+      : in_(in), out_(out) {}
+
+  template <typename OutT>
+  void apply() const {
+    const InT* input = in_->data<InT>();
+    OutT* output = out_->mutable_data<OutT>();
+    size_t numel = in_->numel();
+    for (int i = 0; i < numel; ++i) {
+      output[i] = static_cast<OutT>(input[i]);
+    }
+  }
+};
+
+struct CastOpFunctor {
+  const framework::Tensor* in_;
+  framework::Tensor* out_;
+  int output_type_;
+  CastOpFunctor(const framework::Tensor* in, framework::Tensor* out,
+                const int output_type)
+      : in_(in), out_(out), output_type_(output_type) {}
+
+  template <typename InT>
+  void apply() const {
+    framework::VisitDataType(framework::ToDataType(output_type_),
+                             CastOutOpFunctor<InT>(in_, out_));
+  }
+};
+
+template <>
+bool CastKernel<CPU, float>::Init(CastParam<CPU>* param) {
+  return true;
+}
+
+template <>
+void CastKernel<CPU, float>::Compute(const CastParam<CPU>& param) {
+  const Tensor* input = param.input_;
+  Tensor* output = param.output_;
+  framework::VisitDataType(framework::ToDataType(param.input_type_),
+                           CastOpFunctor(input, output, param.output_type_));
+}
+
 }  // namespace operators
 }  // namespace paddle_mobile

-#endif
+#endif  // CAST_OP
--- a/src/operators/kernel/arm/conv_add_bn_relu_kernel.cpp
+++ b/src/operators/kernel/arm/conv_add_bn_relu_kernel.cpp
@@ -15,6 +15,7 @@ limitations under the License. */
 #ifdef FUSION_CONVADDBNRELU_OP

 #include "operators/kernel/conv_add_bn_relu_kernel.h"
+#include <math.h>
 #include "operators/kernel/central-arm-func/conv_add_bn_relu_arm_func.h"

 namespace paddle_mobile {

--- a/src/operators/kernel/arm/conv_add_relu_kernel.cpp
+++ b/src/operators/kernel/arm/conv_add_relu_kernel.cpp
@@ -32,20 +32,6 @@ void ConvAddReluKernel<CPU, float>::Compute(
 }
 template class ConvAddReluKernel<CPU, float>;

-#ifdef FUSION_CONVADDRELU_INT8_OP
-template <>
-bool ConvAddReluKernel<CPU, int8_t>::Init(FusionConvAddReluParam<CPU> *param) {
-  return true;
-}
-
-template <>
-void ConvAddReluKernel<CPU, int8_t>::Compute(
-    const FusionConvAddReluParam<CPU> &param) {
-  ConvAddReluCompute<int8_t, int32_t>(param);
-}
-template class ConvAddReluKernel<CPU, int8_t>;
-#endif
-
 }  // namespace operators
 }  // namespace paddle_mobile


--- a/src/operators/kernel/arm/conv_kernel.cpp
+++ b/src/operators/kernel/arm/conv_kernel.cpp
@@ -55,10 +55,9 @@ bool ConvKernel<CPU, float>::Init(ConvParam<CPU> *param) {
               param->Input()->dims()[2] <= 140 /* refered from ncnn */) {
      param->ExecMode() = ConvParam<CPU>::EXEC_WINOGRAD3X3_FLOAT;
      // transform weight
-      framework::Tensor transformed_weight;
-      operators::math::winograd_transform_weight<8, 3>(*param->Filter(),
-                                                       &transformed_weight);
-      framework::TensorCopy(transformed_weight, param->Filter());
+      param->transformed_filter_ = new framework::Tensor;
+      operators::math::winograd_transform_weight<8, 3>(
+          *param->Filter(), param->transformed_filter_);
 #endif
    } else {
      param->ExecMode() = ConvParam<CPU>::EXEC_GEMM_FLOAT;

--- a/src/operators/kernel/arm/dequant_add_bn_kernel.cpp
+++ b/src/operators/kernel/arm/dequant_add_bn_kernel.cpp
-/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#ifdef FUSION_DEQUANT_ADD_BN_OP
-
-#include "operators/kernel/dequant_add_bn_kernel.h"
-#include <cmath>
-#if defined(__ARM_NEON__) || defined(__ARM_NEON)
-#include <arm_neon.h>
-#endif
-
-namespace paddle_mobile {
-namespace operators {
-
-template <>
-bool FusionDequantAddBNKernel<CPU, float>::Init(
-    FusionDequantAddBNParam<CPU> *param) {
-  // elementwise add params
-  const Tensor *bias = param->bias_;
-  // batch norm params
-  const Tensor *bn_mean = param->bn_mean_;
-  const Tensor *bn_variance = param->bn_variance_;
-  Tensor *bn_scale = param->bn_scale_;
-  Tensor *bn_bias = param->bn_bias_;
-  const float epsilon = param->epsilon_;
-
-  const float *bias_ptr = bias->data<float>();
-  const float *mean_ptr = bn_mean->data<float>();
-  const float *var_ptr = bn_variance->data<float>();
-  float *bn_scale_ptr = bn_scale->mutable_data<float>();
-  float *bn_bias_ptr = bn_bias->mutable_data<float>();
-  for (int c = 0; c < bn_scale->numel(); ++c) {
-    float inv_scale = bn_scale_ptr[c] / (std::sqrt(var_ptr[c] + epsilon));
-    bn_scale_ptr[c] = inv_scale;
-    bn_bias_ptr[c] = inv_scale * (bias_ptr[c] - mean_ptr[c]) + bn_bias_ptr[c];
-  }
-  return true;
-}
-
-template <>
-void FusionDequantAddBNKernel<CPU, float>::Compute(
-    const FusionDequantAddBNParam<CPU> &param) {
-  const int32_t *input = param.input_->data<int32_t>();
-  const float *bn_scale = param.bn_scale_->data<float>();
-  const float *bn_bias = param.bn_bias_->data<float>();
-  // dequantize params
-  const float activation_scale = param.activation_scale_->data<float>()[0];
-  const float weight_scale = param.weight_scale_;
-  const float dequant_scale = activation_scale / weight_scale;
-
-  float *output = param.output_->mutable_data<float>();
-  int batch_size = param.input_->dims()[0];
-  int channels = param.input_->dims()[1];
-  size_t spatial_size = param.input_->dims()[2] * param.input_->dims()[3];
-
-  #pragma omp parallel for collapse(2)
-  for (int batch = 0; batch < batch_size; ++batch) {
-    for (int c = 0; c < channels; ++c) {
-      float scale = bn_scale[c] * dequant_scale;
-      float bias = bn_bias[c];
-      size_t offset = (batch * channels + c) * spatial_size;
-      const int32_t *x = input + offset;
-      float *y = output + offset;
-      size_t remain = spatial_size;
-#if defined(__ARM_NEON__) || defined(__ARM_NEON)
-      int loop = spatial_size >> 4;
-      remain = spatial_size & 0xF;
-      float32x4_t __scale = vdupq_n_f32(scale);
-      float32x4_t __bias = vdupq_n_f32(bias);
-
-      for (int k = 0; k < loop; ++k, x += 16, y += 16) {
-        int32x4_t r0 = vld1q_s32(x);
-        int32x4_t r1 = vld1q_s32(x + 4);
-        int32x4_t r2 = vld1q_s32(x + 8);
-        int32x4_t r3 = vld1q_s32(x + 12);
-        float32x4_t f0 = vcvtq_f32_s32(r0);
-        float32x4_t f1 = vcvtq_f32_s32(r1);
-        float32x4_t f2 = vcvtq_f32_s32(r2);
-        float32x4_t f3 = vcvtq_f32_s32(r3);
-        f0 = vmlaq_f32(__bias, __scale, f0);
-        f1 = vmlaq_f32(__bias, __scale, f1);
-        f2 = vmlaq_f32(__bias, __scale, f2);
-        f3 = vmlaq_f32(__bias, __scale, f3);
-        vst1q_f32(y, f0);
-        vst1q_f32(y + 4, f1);
-        vst1q_f32(y + 8, f2);
-        vst1q_f32(y + 12, f3);
-      }
-#endif  // __ARM_NEON__
-      for (int k = 0; k < remain; ++k) {
-        y[k] = scale * x[k] + bias;
-      }
-    }
-  }
-}
-
-}  // namespace operators
-}  // namespace paddle_mobile
-
-#endif  // FUSION_DEQUANT_ADD_BN_OP
--- a/src/operators/kernel/arm/dequant_bn_relu_kernel.cpp
+++ b/src/operators/kernel/arm/dequant_bn_relu_kernel.cpp
-/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include "operators/kernel/dequant_bn_relu_kernel.h"
-#include <cmath>
-#if defined(__ARM_NEON__) || defined(__ARM_NEON)
-#include <arm_neon.h>
-#endif
-
-namespace paddle_mobile {
-namespace operators {
-
-#if defined(FUSION_DEQUANT_BN_RELU_OP) || defined(FUSION_DEQUANT_ADD_BN_RELU_OP)
-void DequantBNReluCompute(const FusionDequantBNParam<CPU> *param) {
-  const int32_t *input = param->input_->data<int32_t>();
-  const float *bn_scale = param->bn_scale_->data<float>();
-  const float *bn_bias = param->bn_bias_->data<float>();
-  // dequantize params
-  const float activation_scale = param->activation_scale_->data<float>()[0];
-  const float weight_scale = param->weight_scale_;
-  const float dequant_scale = activation_scale / weight_scale;
-
-  float *output = param->output_->mutable_data<float>();
-  int batch_size = param->input_->dims()[0];
-  int channels = param->input_->dims()[1];
-  size_t spatial_size = param->input_->dims()[2] * param->input_->dims()[3];
-
-  #pragma omp parallel for collapse(2)
-  for (int batch = 0; batch < batch_size; ++batch) {
-    for (int c = 0; c < channels; ++c) {
-      float scale = bn_scale[c] * dequant_scale;
-      float bias = bn_bias[c];
-      size_t offset = (batch * channels + c) * spatial_size;
-      const int32_t *x = input + offset;
-      float *y = output + offset;
-      size_t remain = spatial_size;
-#if defined(__ARM_NEON__) || defined(__ARM_NEON)
-      int loop = spatial_size >> 4;
-      remain = spatial_size & 0xF;
-      float32x4_t __scale = vdupq_n_f32(scale);
-      float32x4_t __bias = vdupq_n_f32(bias);
-      float32x4_t __zero = vdupq_n_f32(0.f);
-
-      for (int k = 0; k < loop; ++k, x += 16, y += 16) {
-        int32x4_t r0 = vld1q_s32(x);
-        int32x4_t r1 = vld1q_s32(x + 4);
-        int32x4_t r2 = vld1q_s32(x + 8);
-        int32x4_t r3 = vld1q_s32(x + 12);
-        float32x4_t f0 = vcvtq_f32_s32(r0);
-        float32x4_t f1 = vcvtq_f32_s32(r1);
-        float32x4_t f2 = vcvtq_f32_s32(r2);
-        float32x4_t f3 = vcvtq_f32_s32(r3);
-        f0 = vmlaq_f32(__bias, __scale, f0);
-        f1 = vmlaq_f32(__bias, __scale, f1);
-        f2 = vmlaq_f32(__bias, __scale, f2);
-        f3 = vmlaq_f32(__bias, __scale, f3);
-        f0 = vmaxq_f32(__zero, f0);
-        f1 = vmaxq_f32(__zero, f1);
-        f2 = vmaxq_f32(__zero, f2);
-        f3 = vmaxq_f32(__zero, f3);
-        vst1q_f32(y, f0);
-        vst1q_f32(y + 4, f1);
-        vst1q_f32(y + 8, f2);
-        vst1q_f32(y + 12, f3);
-      }
-#endif  // __ARM_NEON__
-      for (int k = 0; k < remain; ++k) {
-        y[k] = std::max(scale * x[k] + bias, 0.f);
-      }
-    }
-  }
-}
-#endif
-
-#ifdef FUSION_DEQUANT_BN_RELU_OP
-template <>
-bool FusionDequantBNReluKernel<CPU, float>::Init(
-    FusionDequantBNReluParam<CPU> *param) {
-  // batch norm params
-  const Tensor *bn_mean = param->bn_mean_;
-  const Tensor *bn_variance = param->bn_variance_;
-  Tensor *bn_scale = param->bn_scale_;
-  Tensor *bn_bias = param->bn_bias_;
-  const float epsilon = param->epsilon_;
-
-  const float *mean_ptr = bn_mean->data<float>();
-  const float *var_ptr = bn_variance->data<float>();
-  float *bn_scale_ptr = bn_scale->mutable_data<float>();
-  float *bn_bias_ptr = bn_bias->mutable_data<float>();
-  for (int c = 0; c < bn_scale->numel(); ++c) {
-    float inv_scale = bn_scale_ptr[c] / (std::sqrt(var_ptr[c] + epsilon));
-    bn_scale_ptr[c] = inv_scale;
-    bn_bias_ptr[c] = bn_bias_ptr[c] - inv_scale * mean_ptr[c];
-  }
-  return true;
-}
-
-template <>
-void FusionDequantBNReluKernel<CPU, float>::Compute(
-    const FusionDequantBNReluParam<CPU> &param) {
-  DequantBNReluCompute(&param);
-}
-#endif  // FUSION_DEQUANT_BN_RELU_OP
-
-#ifdef FUSION_DEQUANT_ADD_BN_RELU_OP
-template <>
-bool FusionDequantAddBNReluKernel<CPU, float>::Init(
-    FusionDequantAddBNReluParam<CPU> *param) {
-  // elementwise add params
-  const Tensor *bias = param->bias_;
-  // batch norm params
-  const Tensor *bn_mean = param->bn_mean_;
-  const Tensor *bn_variance = param->bn_variance_;
-  Tensor *bn_scale = param->bn_scale_;
-  Tensor *bn_bias = param->bn_bias_;
-  const float epsilon = param->epsilon_;
-
-  const float *bias_ptr = bias->data<float>();
-  const float *mean_ptr = bn_mean->data<float>();
-  const float *var_ptr = bn_variance->data<float>();
-  float *bn_scale_ptr = bn_scale->mutable_data<float>();
-  float *bn_bias_ptr = bn_bias->mutable_data<float>();
-  for (int c = 0; c < bn_scale->numel(); ++c) {
-    float inv_scale = bn_scale_ptr[c] / (std::sqrt(var_ptr[c] + epsilon));
-    bn_scale_ptr[c] = inv_scale;
-    bn_bias_ptr[c] = inv_scale * (bias_ptr[c] - mean_ptr[c]) + bn_bias_ptr[c];
-  }
-  return true;
-}
-
-template <>
-void FusionDequantAddBNReluKernel<CPU, float>::Compute(
-    const FusionDequantAddBNReluParam<CPU> &param) {
-  DequantBNReluCompute(&param);
-}
-#endif  // FUSION_DEQUANT_ADD_BN_RELU_OP
-
-}  // namespace operators
-}  // namespace paddle_mobile
--- a/src/operators/kernel/arm/dequantize_bn_kernel.cpp
+++ b/src/operators/kernel/arm/dequantize_bn_kernel.cpp
+/* Copyright (c) 201f8 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <cmath>
+#include "operators/kernel/dequant_bn_kernel.h"
+#include "operators/math/activation.h"
+#include "operators/math/quantize.h"
+#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+#include <arm_neon.h>
+#endif
+
+namespace paddle_mobile {
+namespace operators {
+
+#if defined(FUSION_DEQUANT_BN_OP) || defined(FUSION_DEQUANT_ADD_BN_OP) || \
+    defined(FUSION_DEQUANT_BN_RELU_OP) ||                                 \
+    defined(FUSION_DEQUANT_ADD_BN_RELU_OP) ||                             \
+    defined(FUSION_DEQUANT_ADD_BN_QUANT_OP) ||                            \
+    defined(FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP)
+void PublicFusionDequantBNInitParam(FusionDequantBNParam<CPU> *param,
+                                    const framework::Tensor *bias) {
+  // batch norm params
+  const Tensor *bn_mean = param->bn_mean_;
+  const Tensor *bn_variance = param->bn_variance_;
+  Tensor *bn_scale = param->bn_scale_;
+  Tensor *bn_bias = param->bn_bias_;
+  const float epsilon = param->epsilon_;
+
+  const float *mean_ptr = bn_mean->data<float>();
+  const float *var_ptr = bn_variance->data<float>();
+  float *bn_scale_ptr = bn_scale->mutable_data<float>();
+  float *bn_bias_ptr = bn_bias->mutable_data<float>();
+  for (int c = 0; c < bn_scale->numel(); ++c) {
+    float inv_scale = 1.f / (std::sqrt(var_ptr[c] + epsilon));
+    float val = bias ? bias->data<float>()[c] : 0;
+    bn_bias_ptr[c] =
+        inv_scale * bn_scale_ptr[c] * (val - mean_ptr[c]) + bn_bias_ptr[c];
+    bn_scale_ptr[c] = inv_scale * bn_scale_ptr[c];
+  }
+}
+#endif
+
+#if defined(FUSION_DEQUANT_BN_OP) || defined(FUSION_DEQUANT_ADD_BN_OP) || \
+    defined(FUSION_DEQUANT_BN_RELU_OP) ||                                 \
+    defined(FUSION_DEQUANT_ADD_BN_RELU_OP)
+template <ActivationType Act>
+void DequantBNCompute(const FusionDequantBNParam<CPU> *param) {
+  const int32_t *input = param->input_->data<int32_t>();
+  const float *bn_scale = param->bn_scale_->data<float>();
+  const float *bn_bias = param->bn_bias_->data<float>();
+  // dequantize params
+  const float activation_scale = param->activation_scale_->data<float>()[0];
+  const float weight_scale = param->weight_scale_;
+  const float dequant_scale = activation_scale / weight_scale;
+
+  float *output = param->output_->mutable_data<float>();
+  int batch_size = param->input_->dims()[0];
+  int channels = param->input_->dims()[1];
+  size_t spatial_size = param->input_->dims()[2] * param->input_->dims()[3];
+
+  #pragma omp parallel for collapse(2)
+  for (int batch = 0; batch < batch_size; ++batch) {
+    for (int c = 0; c < channels; ++c) {
+      // not fuse bn and dequant scale to minimize precision difference
+      // float scale = bn_scale[c] * dequant_scale;
+      float scale = bn_scale[c];
+      float bias = bn_bias[c];
+      size_t offset = (batch * channels + c) * spatial_size;
+      const int32_t *x = input + offset;
+      float *y = output + offset;
+      size_t remain = spatial_size;
+#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+      int loop = spatial_size >> 4;
+      remain = spatial_size & 0xF;
+      float32x4_t __dequant_scale = vdupq_n_f32(dequant_scale);
+      float32x4_t __scale = vdupq_n_f32(scale);
+      float32x4_t __bias = vdupq_n_f32(bias);
+      for (int k = 0; k < loop; ++k, x += 16, y += 16) {
+        int32x4_t r0 = vld1q_s32(x);
+        int32x4_t r1 = vld1q_s32(x + 4);
+        int32x4_t r2 = vld1q_s32(x + 8);
+        int32x4_t r3 = vld1q_s32(x + 12);
+        float32x4_t f0 = vcvtq_f32_s32(r0);
+        float32x4_t f1 = vcvtq_f32_s32(r1);
+        float32x4_t f2 = vcvtq_f32_s32(r2);
+        float32x4_t f3 = vcvtq_f32_s32(r3);
+        f0 = vmulq_f32(__dequant_scale, f0);
+        f1 = vmulq_f32(__dequant_scale, f1);
+        f2 = vmulq_f32(__dequant_scale, f2);
+        f3 = vmulq_f32(__dequant_scale, f3);
+        f0 = vmlaq_f32(__bias, __scale, f0);
+        f1 = vmlaq_f32(__bias, __scale, f1);
+        f2 = vmlaq_f32(__bias, __scale, f2);
+        f3 = vmlaq_f32(__bias, __scale, f3);
+        f0 = math::vActiveq_f32<Act>(f0);
+        f1 = math::vActiveq_f32<Act>(f1);
+        f2 = math::vActiveq_f32<Act>(f2);
+        f3 = math::vActiveq_f32<Act>(f3);
+        vst1q_f32(y, f0);
+        vst1q_f32(y + 4, f1);
+        vst1q_f32(y + 8, f2);
+        vst1q_f32(y + 12, f3);
+      }
+#endif  // __ARM_NEON__
+      for (int k = 0; k < remain; ++k) {
+        y[k] = math::Active<Act>(scale * (dequant_scale * x[k]) + bias);
+      }
+    }
+  }
+}
+#endif
+
+#ifdef FUSION_DEQUANT_BN_OP
+template <>
+bool FusionDequantBNKernel<CPU, float>::Init(FusionDequantBNParam<CPU> *param) {
+  PublicFusionDequantBNInitParam(param, nullptr);
+  return true;
+}
+
+template <>
+void FusionDequantBNKernel<CPU, float>::Compute(
+    const FusionDequantBNParam<CPU> &param) {
+  DequantBNCompute<IDENTITY>(&param);
+}
+#endif  // FUSION_DEQUANT_BN_OP
+
+#ifdef FUSION_DEQUANT_BN_RELU_OP
+template <>
+bool FusionDequantBNReluKernel<CPU, float>::Init(
+    FusionDequantBNParam<CPU> *param) {
+  PublicFusionDequantBNInitParam(param, nullptr);
+  return true;
+}
+
+template <>
+void FusionDequantBNReluKernel<CPU, float>::Compute(
+    const FusionDequantBNParam<CPU> &param) {
+  DequantBNCompute<RELU>(&param);
+}
+#endif  // FUSION_DEQUANT_BN_RELU_OP
+
+#ifdef FUSION_DEQUANT_ADD_BN_OP
+template <>
+bool FusionDequantAddBNKernel<CPU, float>::Init(
+    FusionDequantAddBNParam<CPU> *param) {
+  const framework::Tensor *bias = param->bias_;
+  PublicFusionDequantBNInitParam(param, bias);
+  return true;
+}
+
+template <>
+void FusionDequantAddBNKernel<CPU, float>::Compute(
+    const FusionDequantAddBNParam<CPU> &param) {
+  DequantBNCompute<IDENTITY>(&param);
+}
+#endif  // FUSION_DEQUANT_ADD_BN_OP
+
+#ifdef FUSION_DEQUANT_ADD_BN_RELU_OP
+template <>
+bool FusionDequantAddBNReluKernel<CPU, float>::Init(
+    FusionDequantAddBNParam<CPU> *param) {
+  const framework::Tensor *bias = param->bias_;
+  PublicFusionDequantBNInitParam(param, bias);
+  return true;
+}
+
+template <>
+void FusionDequantAddBNReluKernel<CPU, float>::Compute(
+    const FusionDequantAddBNParam<CPU> &param) {
+  DequantBNCompute<RELU>(&param);
+}
+#endif  // FUSION_DEQUANT_ADD_BN_RELU_OP
+
+#if defined(FUSION_DEQUANT_ADD_BN_QUANT_OP) || \
+    defined(FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP)
+template <Activation Act, RoundType R>
+void DequantBNQuantCompute(const FusionDequantAddBNQuantParam<CPU> *param) {
+  const int32_t *input = param->input_->data<int32_t>();
+  const float *bn_scale = param->bn_scale_->data<float>();
+  const float *bn_bias = param->bn_bias_->data<float>();
+  // dequantize params
+  const float activation_scale = param->activation_scale_->data<float>()[0];
+  const float weight_scale = param->weight_scale_;
+  const float dequant_scale = activation_scale / weight_scale;
+  // quantize params
+  Tensor *output_scale = param->online_scale_;
+  float max_abs = 0.f;
+
+  int8_t *output = param->output_->mutable_data<int8_t>();
+  int batch_size = param->input_->dims()[0];
+  int channels = param->input_->dims()[1];
+  size_t spatial_size = param->input_->dims()[2] * param->input_->dims()[3];
+
+  //  if (param->is_static_) {
+  if (true) {
+    max_abs = param->static_scale_;
+    float quant_scale = 127.f / max_abs;
+    #pragma omp parallel for collapse(2)
+    for (int batch = 0; batch < batch_size; ++batch) {
+      for (int c = 0; c < channels; ++c) {
+        // not fuse bn and dequant scale to minimize precision difference
+        // float scale = bn_scale[c] * dequant_scale;
+        float scale = bn_scale[c];
+        float bias = bn_bias[c];
+        size_t offset = (batch * channels + c) * spatial_size;
+        const int32_t *x = input + offset;
+        int8_t *y = output + offset;
+        size_t remain = spatial_size;
+#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+        int loop = spatial_size >> 4;
+        remain = spatial_size & 0xF;
+        float32x4_t __dequant_scale = vdupq_n_f32(dequant_scale);
+        float32x4_t __scale = vdupq_n_f32(scale);
+        float32x4_t __bias = vdupq_n_f32(bias);
+        float32x4_t __quant_scale = vdupq_n_f32(quant_scale);
+        for (int k = 0; k < loop; ++k, x += 16, y += 16) {
+          int32x4_t r0 = vld1q_s32(x);
+          int32x4_t r1 = vld1q_s32(x + 4);
+          int32x4_t r2 = vld1q_s32(x + 8);
+          int32x4_t r3 = vld1q_s32(x + 12);
+          float32x4_t f0 = vcvtq_f32_s32(r0);
+          float32x4_t f1 = vcvtq_f32_s32(r1);
+          float32x4_t f2 = vcvtq_f32_s32(r2);
+          float32x4_t f3 = vcvtq_f32_s32(r3);
+          f0 = vmulq_f32(__dequant_scale, f0);
+          f1 = vmulq_f32(__dequant_scale, f1);
+          f2 = vmulq_f32(__dequant_scale, f2);
+          f3 = vmulq_f32(__dequant_scale, f3);
+          f0 = vmlaq_f32(__bias, __scale, f0);
+          f1 = vmlaq_f32(__bias, __scale, f1);
+          f2 = vmlaq_f32(__bias, __scale, f2);
+          f3 = vmlaq_f32(__bias, __scale, f3);
+          f0 = math::vActiveq_f32<Act>(f0);
+          f1 = math::vActiveq_f32<Act>(f1);
+          f2 = math::vActiveq_f32<Act>(f2);
+          f3 = math::vActiveq_f32<Act>(f3);
+          f0 = vmulq_f32(__quant_scale, f0);
+          f1 = vmulq_f32(__quant_scale, f1);
+          f2 = vmulq_f32(__quant_scale, f2);
+          f3 = vmulq_f32(__quant_scale, f3);
+          int32x4_t q0 = math::vRoundq_f32<R>(f0);
+          int32x4_t q1 = math::vRoundq_f32<R>(f1);
+          int32x4_t q2 = math::vRoundq_f32<R>(f2);
+          int32x4_t q3 = math::vRoundq_f32<R>(f3);
+          int16x4_t d0 = vmovn_s32(q0);
+          int16x4_t d1 = vmovn_s32(q1);
+          int16x4_t d2 = vmovn_s32(q2);
+          int16x4_t d3 = vmovn_s32(q3);
+          int16x8_t q5 = vcombine_s16(d0, d1);
+          int16x8_t q6 = vcombine_s16(d2, d3);
+          int8x8_t d5 = vmovn_s16(q5);
+          int8x8_t d6 = vmovn_s16(q6);
+          vst1_s8(y, d5);
+          vst1_s8(y + 8, d6);
+        }
+#endif  // __ARM_NEON__
+        for (int k = 0; k < remain; ++k) {
+          float x_temp =
+              math::Active<Act>(scale * (dequant_scale * x[k]) + bias);
+          y[k] = math::Round<R>(x_temp * quant_scale);
+        }
+      }
+    }
+  } else {
+    // TODO(hjchen2)
+    max_abs = std::max(max_abs, 1e-6f);
+  }
+  param->online_scale_->mutable_data<float>()[0] = max_abs;
+}
+
+template <>
+bool FusionDequantAddBNQuantKernel<CPU, float>::Init(
+    FusionDequantAddBNQuantParam<CPU> *param) {
+  const framework::Tensor *bias = param->bias_;
+  PublicFusionDequantBNInitParam(param, bias);
+  return true;
+}
+
+template <>
+void FusionDequantAddBNQuantKernel<CPU, float>::Compute(
+    const FusionDequantAddBNQuantParam<CPU> &param) {
+  switch (param.round_type_) {
+    case ROUND_NEAREST_TO_EVEN:
+      DequantBNQuantCompute<IDENTITY, ROUND_NEAREST_TO_EVEN>(&param);
+      break;
+    case ROUND_NEAREST_TOWARDS_ZERO:
+      DequantBNQuantCompute<IDENTITY, ROUND_NEAREST_TOWARDS_ZERO>(&param);
+      break;
+    case ROUND_NEAREST_AWAY_ZERO:
+      DequantBNQuantCompute<IDENTITY, ROUND_NEAREST_AWAY_ZERO>(&param);
+      break;
+    default:
+      LOG(kLOG_ERROR) << "round type is not supported.";
+      break;
+  }
+}
+#endif  // FUSION_DEQUANT_ADD_BN_QUANT_OP
+
+#ifdef FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
+template <>
+bool FusionDequantAddBNReluQuantKernel<CPU, float>::Init(
+    FusionDequantAddBNQuantParam<CPU> *param) {
+  const framework::Tensor *bias = param->bias_;
+  PublicFusionDequantBNInitParam(param, bias);
+  return true;
+}
+
+template <>
+void FusionDequantAddBNReluQuantKernel<CPU, float>::Compute(
+    const FusionDequantAddBNQuantParam<CPU> &param) {
+  switch (param.round_type_) {
+    case ROUND_NEAREST_TO_EVEN:
+      DequantBNQuantCompute<RELU, ROUND_NEAREST_TO_EVEN>(&param);
+      break;
+    case ROUND_NEAREST_TOWARDS_ZERO:
+      DequantBNQuantCompute<RELU, ROUND_NEAREST_TOWARDS_ZERO>(&param);
+      break;
+    case ROUND_NEAREST_AWAY_ZERO:
+      DequantBNQuantCompute<RELU, ROUND_NEAREST_AWAY_ZERO>(&param);
+      break;
+    default:
+      LOG(kLOG_ERROR) << "round type is not supported.";
+      break;
+  }
+}
+#endif  // FUSION_DEQUANT_ADD_BN_RELU_QUANT_OP
+
+}  // namespace operators
+}  // namespace paddle_mobile
--- a/src/operators/kernel/arm/dequantize_kernel.cpp
+++ b/src/operators/kernel/arm/dequantize_kernel.cpp
@@ -30,8 +30,8 @@ bool DequantizeKernel<CPU, float>::Init(DequantizeParam<CPU> *param) {

 template <>
 void DequantizeKernel<CPU, float>::Compute(const DequantizeParam<CPU> &param) {
-  const Tensor *input = param.input_;
-  Tensor *output = param.output_;
+  const LoDTensor *input = param.input_;
+  LoDTensor *output = param.output_;
  float activation_scale = param.activation_scale_->data<float>()[0];
  float weight_scale = param.weight_scale_;
  const int32_t *x = input->data<const int32_t>();
@@ -72,6 +72,7 @@ void DequantizeKernel<CPU, float>::Compute(const DequantizeParam<CPU> &param) {
  for (size_t i = 0; i < size; ++i) {
    y[i] = x[i] * scale;
  }
+  output->set_lod(input->lod());
 }

 }  // namespace operators

--- a/src/operators/kernel/arm/gru_kernel.cpp
+++ b/src/operators/kernel/arm/gru_kernel.cpp
@@ -29,12 +29,6 @@ template <>
 void GruKernel<CPU, float>::Compute(const GruParam<CPU> &param) {
  GruCompute<float>(param);
  param.OutHidden()->set_lod(param.InputInput()->lod());
-  //  DLOG << "________________" << param.OutHidden()->dims();
-  //  DLOG << "________________" << param.OutHidden()->numel();
-  //  auto *hiden_data = param.OutHidden()->data<float>();
-  //  for (int64_t i = 0; i < 10; i++) {
-  //    DLOG << "****************" << hiden_data[i];
-  //  }
 }

 template class GruKernel<CPU, float>;

--- a/src/operators/kernel/arm/quantize_kernel.cpp
+++ b/src/operators/kernel/arm/quantize_kernel.cpp
--- a/src/operators/kernel/arm/relu_kernel.cpp
+++ b/src/operators/kernel/arm/relu_kernel.cpp
@@ -15,11 +15,56 @@ limitations under the License. */
 #ifdef RELU_OP

 #include "operators/kernel/relu_kernel.h"
-#include "operators/kernel/central-arm-func/relu_arm_func.h"
+#include "common/types.h"
+#include "operators/math/activation.h"
+#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+#include <arm_neon.h>
+#endif

 namespace paddle_mobile {
 namespace operators {

+template <typename Dtype, ActivationType Act>
+struct ReluCompute {
+  void operator()(const Tensor *input, Tensor *output) {}
+};
+
+template <ActivationType Act>
+struct ReluCompute<float, Act> {
+  void operator()(const Tensor *input, Tensor *output) {
+    const float *x = input->data<float>();
+    float *y = output->mutable_data<float>();
+    size_t remain = input->numel();
+#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+    size_t loop = remain >> 4;
+    remain = remain & 0xF;
+
+    #pragma omp parallel for
+    for (size_t i = 0; i < loop; ++i) {
+      const float *local_x = x + (i << 4);
+      float *local_y = y + (i << 4);
+      float32x4_t r0 = vld1q_f32(local_x);
+      float32x4_t r1 = vld1q_f32(local_x + 4);
+      float32x4_t r2 = vld1q_f32(local_x + 8);
+      float32x4_t r3 = vld1q_f32(local_x + 12);
+      r0 = math::vActiveq_f32<Act>(r0);
+      r1 = math::vActiveq_f32<Act>(r1);
+      r2 = math::vActiveq_f32<Act>(r2);
+      r3 = math::vActiveq_f32<Act>(r3);
+      vst1q_f32(local_y, r0);
+      vst1q_f32(local_y + 4, r1);
+      vst1q_f32(local_y + 8, r2);
+      vst1q_f32(local_y + 12, r3);
+    }
+    x += (loop << 4);
+    y += (loop << 4);
+#endif
+    for (size_t i = 0; i < remain; ++i) {
+      y[i] = math::Active<Act>(x[i]);
+    }
+  }
+};
+
 template <>
 bool ReluKernel<CPU, float>::Init(ReluParam<CPU> *param) {
  return true;
@@ -27,7 +72,21 @@ bool ReluKernel<CPU, float>::Init(ReluParam<CPU> *param) {

 template <>
 void ReluKernel<CPU, float>::Compute(const ReluParam<CPU> &param) {
-  ReluCompute<float>(param);
+  const Tensor *input = param.InputX();
+  Tensor *output = param.Out();
+  ReluCompute<float, RELU>()(input, output);
+}
+
+template <>
+bool Relu6Kernel<CPU, float>::Init(ReluParam<CPU> *param) {
+  return true;
+}
+
+template <>
+void Relu6Kernel<CPU, float>::Compute(const ReluParam<CPU> &param) {
+  const Tensor *input = param.InputX();
+  Tensor *output = param.Out();
+  ReluCompute<float, RELU6>()(input, output);
 }

 }  // namespace operators

--- a/src/operators/kernel/central-arm-func/transpose2_arm_func.h
+++ b/src/operators/kernel/central-arm-func/transpose2_arm_func.h
--- a/src/operators/kernel/arm/transpose2_kernel.cpp
+++ b/src/operators/kernel/arm/transpose2_kernel.cpp
--- a/src/operators/kernel/central-arm-func/batchnorm_arm_func.h
+++ b/src/operators/kernel/central-arm-func/batchnorm_arm_func.h
--- a/src/operators/kernel/central-arm-func/conv_add_add_prelu_arm_func.h
+++ b/src/operators/kernel/central-arm-func/conv_add_add_prelu_arm_func.h
@@ -13,8 +13,9 @@ See the License for the specific language governing permissions and
 limitations under the License. */

 #ifdef FUSION_CONVADDADDPRELU_OP
-
 #pragma once
+
+#include <string>
 #include <vector>
 #include "operators/math/conv_func.h"
 #include "operators/math/im2col.h"
@@ -115,20 +116,7 @@ void ConvAddAddPReluCompute(const FusionConvAddAddPReluParam<CPU> &param) {
      Tensor filter_slice = filter.Slice(g * out_step, (g + 1) * out_step);
      Tensor bias1_slice = bias1_batch.Slice(g * out_step, (g + 1) * out_step);
      float *biase_data1 = bias1_slice.data<float>();
-      //                    int n = bias1_slice.dims()[0];
-      //                    int m = bias1_slice.dims()[1];
-      //                    for(int i=0;i<n*m;i++){
-      //                        if(biase_data1[i]!=0)
-      //                        DLOG<<biase_data1[i]<<",yangfei";
-      //                    }
-
-      //                    math::matmul<float>(filter_slice, false, col_matrix,
-      //                    false,
-      //                                        static_cast<float>(1),
-      //                                        &out_slice,
-      //                                        static_cast<float>(1), true,
-      //                                        biase_data);
-      math::matmulWithPRelu(filter_slice, false, col_matrix, false, &out_slice,
+      math::MatMulWithPRelu(filter_slice, false, col_matrix, false, &out_slice,
                            p, mode, biase_data, biase_data1);
    }
  }
@@ -137,4 +125,4 @@ void ConvAddAddPReluCompute(const FusionConvAddAddPReluParam<CPU> &param) {
 }  // namespace operators
 }  // namespace paddle_mobile

-#endif
+#endif  // FUSION_CONVADDADDPRELU_OP
--- a/src/operators/kernel/central-arm-func/conv_add_arm_func.h
+++ b/src/operators/kernel/central-arm-func/conv_add_arm_func.h
--- a/src/operators/kernel/central-arm-func/conv_add_bn_relu_arm_func.h
+++ b/src/operators/kernel/central-arm-func/conv_add_bn_relu_arm_func.h
--- a/src/operators/kernel/central-arm-func/conv_add_prelu_arm_func.h
+++ b/src/operators/kernel/central-arm-func/conv_add_prelu_arm_func.h
--- a/src/operators/kernel/central-arm-func/conv_add_relu_arm_func.h
+++ b/src/operators/kernel/central-arm-func/conv_add_relu_arm_func.h
--- a/src/operators/kernel/central-arm-func/conv_arm_func.h
+++ b/src/operators/kernel/central-arm-func/conv_arm_func.h
--- a/src/operators/kernel/central-arm-func/conv_bn_add_relu_arm_func.h
+++ b/src/operators/kernel/central-arm-func/conv_bn_add_relu_arm_func.h
--- a/src/operators/kernel/central-arm-func/conv_bn_relu_arm_func.h
+++ b/src/operators/kernel/central-arm-func/conv_bn_relu_arm_func.h
--- a/src/operators/kernel/central-arm-func/conv_transpose_arm_func.h
+++ b/src/operators/kernel/central-arm-func/conv_transpose_arm_func.h
--- a/src/operators/kernel/central-arm-func/dwconv_bn_relu_arm_func.h
+++ b/src/operators/kernel/central-arm-func/dwconv_bn_relu_arm_func.h
--- a/src/operators/kernel/central-arm-func/elementwise_add_arm_func.h
+++ b/src/operators/kernel/central-arm-func/elementwise_add_arm_func.h
--- a/src/operators/kernel/central-arm-func/fusion_fc_arm_func.h
+++ b/src/operators/kernel/central-arm-func/fusion_fc_arm_func.h
--- a/src/operators/kernel/central-arm-func/gru_arm_func.h
+++ b/src/operators/kernel/central-arm-func/gru_arm_func.h
--- a/src/operators/kernel/central-arm-func/mul_arm_func.h
+++ b/src/operators/kernel/central-arm-func/mul_arm_func.h
--- a/src/operators/kernel/central-arm-func/multiclass_nms_arm_func.h
+++ b/src/operators/kernel/central-arm-func/multiclass_nms_arm_func.h
--- a/src/operators/kernel/central-arm-func/pool_arm_func.h
+++ b/src/operators/kernel/central-arm-func/pool_arm_func.h
--- a/src/operators/kernel/central-arm-func/relu_arm_func.h
+++ b/src/operators/kernel/central-arm-func/relu_arm_func.h
--- a/src/operators/kernel/central-arm-func/transpose_arm_func.h
+++ b/src/operators/kernel/central-arm-func/transpose_arm_func.h
--- a/src/operators/kernel/cl/feed_kernel.cpp
+++ b/src/operators/kernel/cl/feed_kernel.cpp
--- a/src/operators/kernel/cl/fusion_fc_kernel.cpp
+++ b/src/operators/kernel/cl/fusion_fc_kernel.cpp
--- a/src/operators/fusion_conv_add_relu_int8_op.h
+++ b/src/operators/fusion_conv_add_relu_int8_op.h
--- a/src/operators/kernel/feed_kernel.h
+++ b/src/operators/kernel/feed_kernel.h
--- a/src/operators/kernel/fpga/V1/conv_bn_kernel.cpp
+++ b/src/operators/kernel/fpga/V1/conv_bn_kernel.cpp
--- a/src/operators/kernel/fpga/V1/conv_bn_relu_kernel.cpp
+++ b/src/operators/kernel/fpga/V1/conv_bn_relu_kernel.cpp
--- a/src/operators/kernel/fpga/V1/deconv_add_kernel.cpp
+++ b/src/operators/kernel/fpga/V1/deconv_add_kernel.cpp
--- a/src/operators/kernel/fpga/V1/deconv_add_relu_kernel.cpp
+++ b/src/operators/kernel/fpga/V1/deconv_add_relu_kernel.cpp
--- a/src/operators/kernel/fpga/V1/transpose2_kernel.cpp
+++ b/src/operators/kernel/fpga/V1/transpose2_kernel.cpp
--- a/src/operators/kernel/kernels.h
+++ b/src/operators/kernel/kernels.h
--- a/src/operators/kernel/mali/fushion_fc_kernel.cpp
+++ b/src/operators/kernel/mali/fushion_fc_kernel.cpp
--- a/src/operators/kernel/mali/mul_kernel.cpp
+++ b/src/operators/kernel/mali/mul_kernel.cpp
--- a/src/operators/kernel/relu_kernel.h
+++ b/src/operators/kernel/relu_kernel.h
--- a/src/operators/math/activation_functions.h
+++ b/src/operators/math/activation_functions.h
--- a/src/operators/math/gemm.cpp
+++ b/src/operators/math/gemm.cpp
--- a/src/operators/math/gemm.h
+++ b/src/operators/math/gemm.h
--- a/src/operators/math/gru_compute.cpp
+++ b/src/operators/math/gru_compute.cpp
--- a/src/operators/math/gru_compute.h
+++ b/src/operators/math/gru_compute.h
--- a/src/operators/math/gru_cpu_kernel.h
+++ b/src/operators/math/gru_cpu_kernel.h
--- a/src/operators/math/gru_kernel.h
+++ b/src/operators/math/gru_kernel.h
--- a/src/operators/math/math_func_neon.h
+++ b/src/operators/math/math_func_neon.h
--- a/src/operators/math/math_function.cpp
+++ b/src/operators/math/math_function.cpp
--- a/src/operators/math/math_function.h
+++ b/src/operators/math/math_function.h
--- a/src/operators/math/math_function_int8.cpp
+++ b/src/operators/math/math_function_int8.cpp
--- a/src/operators/math/pool_2x2.cpp
+++ b/src/operators/math/pool_2x2.cpp
--- a/src/operators/math/pool_3x3.cpp
+++ b/src/operators/math/pool_3x3.cpp
--- a/src/operators/math/pool_3x3_int8.cpp
+++ b/src/operators/math/pool_3x3_int8.cpp
--- a/src/operators/math/pooling.cpp
+++ b/src/operators/math/pooling.cpp
--- a/src/operators/math/pooling.h
+++ b/src/operators/math/pooling.h
--- a/src/operators/math/pooling3x3.cpp
+++ b/src/operators/math/pooling3x3.cpp
--- a/src/operators/math/quantize.h
+++ b/src/operators/math/quantize.h
--- a/src/operators/math/sequence2batch.h
+++ b/src/operators/math/sequence2batch.h
--- a/src/operators/math/softmax.cpp
+++ b/src/operators/math/softmax.cpp
--- a/src/operators/math/softmax.h
+++ b/src/operators/math/softmax.h
--- a/src/operators/math/winograd/winograd_transform_f6k3.cpp
+++ b/src/operators/math/winograd/winograd_transform_f6k3.cpp
--- a/src/operators/math/winograd/winograd_transform_f6k3_arm64.cpp
+++ b/src/operators/math/winograd/winograd_transform_f6k3_arm64.cpp
--- a/src/operators/op_param.h
+++ b/src/operators/op_param.h
--- a/src/operators/quantize_op.cpp
+++ b/src/operators/quantize_op.cpp
--- a/src/operators/quantize_op.h
+++ b/src/operators/quantize_op.h
--- a/src/operators/relu_op.cpp
+++ b/src/operators/relu_op.cpp
--- a/src/operators/relu_op.h
+++ b/src/operators/relu_op.h
--- a/src/operators/math/pool_2x2.h
+++ b/src/operators/math/pool_2x2.h
--- a/src/operators/kernel/dequant_bn_relu_kernel.h
+++ b/src/operators/kernel/dequant_bn_relu_kernel.h
--- a/test/CMakeLists.txt
+++ b/test/CMakeLists.txt
--- a/test/common/test_gemm_perf.cpp
+++ b/test/common/test_gemm_perf.cpp
--- a/test/executor_for_test.h
+++ b/test/executor_for_test.h
--- a/test/net/test_benchmark.cpp
+++ b/test/net/test_benchmark.cpp
--- a/test/net/test_eng.cpp
+++ b/test/net/test_eng.cpp
--- a/test/net/test_googlenet.cpp
+++ b/test/net/test_googlenet.cpp
--- a/test/net/test_nlp.cpp
+++ b/test/net/test_nlp.cpp
--- a/test/net/test_ocr.cpp
+++ b/test/net/test_ocr.cpp
--- a/test/operators/test_conv_op.cpp
+++ b/test/operators/test_conv_op.cpp
--- a/test/operators/test_fusion_conv_add_relu_int8_op.cpp
+++ b/test/operators/test_fusion_conv_add_relu_int8_op.cpp
--- a/test/operators/test_fusion_fc_op.cpp
+++ b/test/operators/test_fusion_fc_op.cpp
--- a/test/operators/test_pool_op.cpp
+++ b/test/operators/test_pool_op.cpp
--- a/test/operators/test_softmax_op.cpp
+++ b/test/operators/test_softmax_op.cpp
--- a/tools/op.cmake
+++ b/tools/op.cmake