Merge branch 'add-onnx' into 'master'

add onnx See merge request !902

Merge branch 'add-onnx' into 'master'
add onnx See merge request !902
7a79a9cf · 李寅 · 09a7b52d · e4ac3908 · 7a79a9cf · 7a79a9cf
41 changed file
--- a/README.md
+++ b/README.md
@@ -82,7 +82,8 @@ the following projects during the development:
  [Caffe](https://github.com/BVLC/caffe),
  [SNPE](https://developer.qualcomm.com/software/snapdragon-neural-processing-engine-ai),
  [ARM ComputeLibrary](https://github.com/ARM-software/ComputeLibrary),
-  [ncnn](https://github.com/Tencent/ncnn) and many others: we learned many best
+  [ncnn](https://github.com/Tencent/ncnn),
+  [ONNX](https://github.com/onnx/onnx) and many others: we learned many best
  practices from these projects.

 Finally, we also thank the Qualcomm, Pinecone and MediaTek engineering teams for

--- a/docs/installation/env_requirement.rst
+++ b/docs/installation/env_requirement.rst
@@ -64,6 +64,9 @@ Optional dependencies
    * - FileLock
      - pip install -I filelock==3.0.0
      - Required by run on Android
+    * - ONNX
+      - pip install onnx
+      - Required by ONNX model

 .. note::


--- a/docs/installation/manual_setup.rst
+++ b/docs/installation/manual_setup.rst
@@ -72,3 +72,9 @@ Install Caffe (Optional)
 -------------------------

 Please follow the installation instruction of `Caffe <http://caffe.berkeleyvision.org/installation.html>`__.
+
+
+Install ONNX (Optional)
+-------------------------
+
+Please follow the installation instruction of `ONNX <https://github.com/onnx/onnx#source>`__.
--- a/docs/introduction.rst
+++ b/docs/introduction.rst
@@ -18,8 +18,7 @@ MACE Model
 ~~~~~~~~~~

 MACE defines a customized model format which is similar to
-Caffe2. The MACE model can be converted from exported models by TensorFlow
-and Caffe.
+Caffe2. The MACE model can be converted from exported models by TensorFlow, Caffe or ONNX Model.

 MACE Interpreter
 ~~~~~~~~~~~~~~~~~
@@ -50,7 +49,7 @@ Build MACE dynamic or static libraries.

 3. Convert model
 ~~~~~~~~~~~~~~~~~~
-Convert TensorFlow or Caffe model to MACE model.
+Convert TensorFlow, Caffe or ONNX model to MACE model.

 4.1. Deploy
 ~~~~~~~~~~~~~~~~~~
@@ -86,7 +85,7 @@ MACE覆盖了常见的移动端计算设备（CPU，GPU和DSP），并且提供
 MACE Model
 ~~~~~~~~~~~~~~~~~~

-MACE定义了自有的模型格式（类似于Caffe2），通过MACE提供的工具可以将Caffe和TensorFlow的模型
+MACE定义了自有的模型格式（类似于Caffe2），通过MACE提供的工具可以将Caffe/TensorFlow/ONNX格式的模型
 转为MACE模型。

 MACE Interpreter
@@ -118,7 +117,7 @@ CPU/GPU/DSP Runtime对应于各个计算设备的算子实现。

 3. 转换模型
 ~~~~~~~~~~~~~~~~~~
-将TensorFlow 或者 Caffe的模型转为MACE的模型。
+将TensorFlow或者Caffe或者ONNX的模型转为MACE的模型。

 4.1. 部署
 ~~~~~~~~~~~~~~~~~~

--- a/docs/user_guide/advanced_usage.rst
+++ b/docs/user_guide/advanced_usage.rst
@@ -78,6 +78,8 @@ in one deployment file.
      - [optional] Specify Numpy validation inputs. When not provided, [-1, 1] random values will be used.
    * - validation_threshold
      - [optional] Specify the similarity threshold for validation. A dict with key in 'CPU', 'GPU' and/or 'HEXAGON' and value <= 1.0.
+    * - backend
+      - The onnx backend framework for validation, could be [tensorflow, caffe2, pytorch], default is tensorflow.
    * - runtime
      - The running device, one of [cpu, gpu, dsp, cpu_gpu]. cpu_gpu contains CPU and GPU model definition so you can run the model on both CPU and GPU.
    * - data_type

--- a/docs/user_guide/basic_usage.rst
+++ b/docs/user_guide/basic_usage.rst
@@ -114,6 +114,19 @@ MACE now supports models from TensorFlow and Caffe (more frameworks will be supp
       # Upgrade caffemodel
       $CAFFE_ROOT/build/tools/upgrade_net_proto_binary MODEL.caffemodel MODEL.new.caffemodel

+-  ONNX
+
+   Prepare your ONNX model.onnx file.
+
+   Use `ONNX Optimizer Tool <https://github.com/XiaoMi/mace/tree/master/tools/onnx_optimizer.py>`__ to optimize your model for inference.
+   This tool will improve the efficiency of inference like the `Graph Transform Tool <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md>`__
+   in TensorFlow.
+
+   .. code:: bash
+
+       # Optimize your model
+       $python MACE_ROOT/tools/onnx_optimizer.py model.onnx model_opt.onnx
+

 ===========================================
 2. Create a deployment file for your model
@@ -137,6 +150,12 @@ Modify one of them and use it for your own case.
   .. literalinclude:: models/demo_models_caffe.yml
      :language: yaml

+-  ONNX
+
+   .. literalinclude:: models/demo_models_onnx.yml
+      :language: yaml
+
+
 More details about model deployment file are in :doc:`advanced_usage`.

 ======================

--- a/docs/user_guide/models/demo_models_onnx.yml
+++ b/docs/user_guide/models/demo_models_onnx.yml
+# The name of library
+library_name: mobilenet
+target_abis: [arm64-v8a]
+model_graph_format: file
+model_data_format: file
+models:
+  mobilenet_v1: # model tag, which will be used in model loading and must be specific.
+    platform: onnx
+    # path to your onnx model file. Support local path, http:// and https://
+    model_file_path: https://cnbj1.fds.api.xiaomi.com/mace/miai-models/mobilenet-v1/mobilenet-v1-1.0.pb
+    # sha256_checksum of your model's onnx file.
+    # use this command to get the sha256_checksum: sha256sum path/to/your/pb/file
+    model_sha256_checksum: 71b10f540ece33c49a7b51f5d4095fc9bd78ce46ebf0300487b2ee23d71294e6
+    # define your model's interface
+    # if there multiple inputs or outputs, write like blow:
+    # subgraphs:
+    # - input_tensors:
+    #     - input0
+    #     - input1
+    #   input_shapes:
+    #     - 1,224,224,3
+    #     - 1,224,224,3
+    #    output_tensors:
+    #      - output0
+    #      - output1
+    #    output_shapes:
+    #      - 1,1001
+    #      - 1,1001
+    subgraphs:
+      - input_tensors:
+          - input
+        input_shapes:
+          - 1,224,224,3
+        output_tensors:
+          - MobilenetV1/Predictions/Reshape_1
+        output_shapes:
+          - 1,1001
+        # onnx backend framwork for validation. Suppport pytorch/caffe/tensorflow. Default is tensorflow.
+        backend: tensorflow
+    # cpu, gpu or cpu+gpu
+    runtime: cpu+gpu
+    winograd: 0
\ No newline at end of file
--- a/mace/ops/activation.h
+++ b/mace/ops/activation.h
@@ -32,7 +32,8 @@ enum ActivationType {
  RELUX = 2,
  PRELU = 3,
  TANH = 4,
-  SIGMOID = 5
+  SIGMOID = 5,
+  LEAKYRELU = 6,
 };

 inline ActivationType StringToActivationType(const std::string type) {
@@ -48,6 +49,8 @@ inline ActivationType StringToActivationType(const std::string type) {
    return ActivationType::SIGMOID;
  } else if (type == "NOOP") {
    return ActivationType::NOOP;
+  } else if (type == "LEAKYRELU") {
+    return ActivationType ::LEAKYRELU;
  } else {
    LOG(FATAL) << "Unknown activation type: " << type;
  }
@@ -90,6 +93,13 @@ void DoActivation(const T *input_ptr,
        output_ptr[i] = 1 / (1 + std::exp(-input_ptr[i]));
      }
      break;
+    case LEAKYRELU:
+#pragma omp parallel for schedule(runtime)
+      for (index_t i = 0; i < size; ++i) {
+        output_ptr[i] = std::max(input_ptr[i],
+                                 static_cast<T>(0)) * relux_max_limit;
+      }
+      break;
    default:
      LOG(FATAL) << "Unknown activation type: " << type;
  }
@@ -122,6 +132,9 @@ inline void DoActivation(const float *input_ptr,
        output_ptr[i] = 1 / (1 + std::exp(-input_ptr[i]));
      }
      break;
+    case LEAKYRELU:
+      LeakyReluNeon(input_ptr, relux_max_limit, size, output_ptr);
+      break;
    default:
      LOG(FATAL) << "Unknown activation type: " << type;
  }

--- a/mace/ops/argmax.cc
+++ b/mace/ops/argmax.cc
@@ -27,18 +27,29 @@ template <DeviceType D, class T>
 class ArgMaxOp : public Operation {
 public:
  explicit ArgMaxOp(OpConstructContext *context)
-      : Operation(context) {}
+      : Operation(context),
+        axis_(Operation::GetOptionalArg<int>("axis", 0)),
+        keep_dims_(Operation::GetOptionalArg<bool>("keepdims", true)),
+        argmin_(Operation::GetOptionalArg<bool>("argmin", false)) {}

  MaceStatus Run(OpContext *context) override {
    MACE_UNUSED(context);
    const Tensor *input = this->Input(0);
-    const Tensor *axis = this->Input(1);
+    const Tensor *axis = this->InputSize() == 2 ?
+                         this->Input(1) : nullptr;
    Tensor *output = this->Output(0);

+    MACE_CHECK(keep_dims_, "Mace only supports keep_dims ArgMax.");
    MACE_CHECK(input->dim_size() > 0, "ArgMax input should not be a scalar");
-    MACE_CHECK(axis->dim_size() == 0, "Mace argmax only supports scalar axis");
-    Tensor::MappingGuard axis_guard(axis);
-    int axis_value = axis->data<int32_t>()[0];
+    int axis_value = 0;
+    if (axis != nullptr) {
+      MACE_CHECK(axis->dim_size() == 0,
+                 "Mace argmax only supports scalar axis");
+      Tensor::MappingGuard axis_guard(axis);
+      axis_value = axis->data<int32_t>()[0];
+    } else {
+      axis_value = axis_;
+    }
    if (axis_value < 0) {
      axis_value += input->dim_size();
    }
@@ -59,22 +70,43 @@ class ArgMaxOp : public Operation {
    index_t outer_size = output->size();
    index_t inner_size = input->dim(axis_value);

+    if (argmin_) {
 #pragma omp parallel for schedule(runtime)
-    for (index_t i = 0; i < outer_size; ++i) {
-      int idx = 0;
-      T max_value = std::numeric_limits<T>::lowest();
-      const T *input_ptr = input_data + i * inner_size;
-      for (index_t j = 0; j < inner_size; ++j) {
-        if (input_ptr[j] > max_value) {
-          max_value = input_ptr[j];
-          idx = j;
+      for (index_t i = 0; i < outer_size; ++i) {
+        int idx = 0;
+        T min_value = std::numeric_limits<T>::max();
+        const T *input_ptr = input_data + i * inner_size;
+        for (index_t j = 0; j < inner_size; ++j) {
+          if (input_ptr[j] < min_value) {
+            min_value = input_ptr[j];
+            idx = j;
+          }
        }
+        output_data[i] = idx;
+      }
+    } else {
+#pragma omp parallel for schedule(runtime)
+      for (index_t i = 0; i < outer_size; ++i) {
+        int idx = 0;
+        T max_value = std::numeric_limits<T>::lowest();
+        const T *input_ptr = input_data + i * inner_size;
+        for (index_t j = 0; j < inner_size; ++j) {
+          if (input_ptr[j] > max_value) {
+            max_value = input_ptr[j];
+            idx = j;
+          }
+        }
+        output_data[i] = idx;
      }
-      output_data[i] = idx;
    }

    return MaceStatus::MACE_SUCCESS;
  }
+
+ protected:
+  const int axis_;
+  bool keep_dims_;
+  bool argmin_;
 };



--- a/mace/ops/arm/activation_neon.cc
+++ b/mace/ops/arm/activation_neon.cc
@@ -67,5 +67,29 @@ void ReluxNeon(const float *input, const float limit,
 #endif
 }

+void LeakyReluNeon(const float *input, const float alpha,
+                   const index_t size, float *output) {
+#if defined(MACE_ENABLE_NEON)
+  float32x4_t vzero = vdupq_n_f32(0.f);
+  float32x4_t valpha = vdupq_n_f32(alpha);
+#pragma omp parallel for schedule(runtime)
+  for (index_t i = 0; i <= size - 4; i += 4) {
+    float32x4_t v = vld1q_f32(input + i);
+    v = vmaxq_f32(v, vzero);
+    v = vmulq_f32(v, valpha);
+    vst1q_f32(output + i, v);
+  }
+  // remain
+  for (index_t i = (size >> 2) << 2; i < size; ++i) {
+    output[i] = std::max(input[i], 0.f) * alpha;
+  }
+#else
+#pragma omp parallel for schedule(runtime)
+  for (index_t i = 0; i < size; ++i) {
+    output[i] = std::max(input[i], 0.f) * alpha;
+  }
+#endif
+}
+
 }  // namespace ops
 }  // namespace mace
--- a/mace/ops/arm/activation_neon.h
+++ b/mace/ops/arm/activation_neon.h
@@ -25,6 +25,9 @@ void ReluNeon(const float *input, const index_t size, float *output);
 void ReluxNeon(const float *input, const float limit,
               const index_t size, float *output);

+void LeakyReluNeon(const float *input, const float alpha,
+                   const index_t size, float *output);
+
 }  // namespace ops
 }  // namespace mace


--- a/mace/ops/opencl/buffer/pooling.h
+++ b/mace/ops/opencl/buffer/pooling.h
@@ -43,6 +43,7 @@ class PoolingKernel : public OpenCLPoolingKernel {
      const Padding &padding_type,
      const std::vector<int> &padding_data,
      const int *dilations,
+      const RoundType round_type,
      Tensor *output) override;

 private:
@@ -62,6 +63,7 @@ MaceStatus PoolingKernel<T>::Compute(
    const Padding &padding_type,
    const std::vector<int> &padding_data,
    const int *dilations,
+    const RoundType round_type,
    Tensor *output) {
  MACE_CHECK(dilations[0] == 1 && dilations[1] == 1)
    << "Pooling opencl kernel not support dilation yet";
@@ -82,7 +84,7 @@ MaceStatus PoolingKernel<T>::Compute(
  } else {
    paddings = padding_data;
    CalcOutputSize(input->shape().data(), filter_shape.data(),
-                   padding_data.data(), dilations, strides, RoundType::CEIL,
+                   padding_data.data(), dilations, strides, round_type,
                   output_shape.data());
  }


--- a/mace/ops/opencl/cl/common.h
+++ b/mace/ops/opencl/cl/common.h
@@ -102,6 +102,9 @@ inline DATA_TYPE4 do_activation(DATA_TYPE4 in,
 #endif
 #ifdef USE_SIGMOID
  out = do_sigmoid(in);
+#endif
+#ifdef USE_LEAKYRELU
+  out = fmax(in, (DATA_TYPE)0) * relux_max_limit;
 #endif
  return out;
 }

--- a/mace/ops/opencl/cl/reduce_mean.cl
+++ b/mace/ops/opencl/cl/reduce_mean.cl
 #include <common.h>

-__kernel void reduce_mean(OUT_OF_RANGE_PARAMS
-                          GLOBAL_WORK_GROUP_SIZE_DIM3
-                          __read_only image2d_t input,
-                          __local float4 *group_sum,
-                          __private const int group_size,
-                          __private const int partial_len,
-                          __private const int remain_index,
-                          __private const int batch,
-                          __private const int in_height,
-                          __private const int in_width,
-                          __private const float image_size_reciprocal,
-                          __private const int channel_blocks,
-                          __write_only image2d_t output) {
+__kernel void reduce(OUT_OF_RANGE_PARAMS
+                     GLOBAL_WORK_GROUP_SIZE_DIM3
+                     __read_only image2d_t input,
+                     __local float4 *group_sum,
+                     __private const int group_size,
+                     __private const int partial_len,
+                     __private const int remain_index,
+                     __private const int batch,
+                     __private const int in_height,
+                     __private const int in_width,
+                     __private const float image_size_reciprocal,
+                     __private const int channel_blocks,
+                     __write_only image2d_t output) {
  const int i = get_local_id(0);
  const int j = get_local_id(1);
  const int k = get_global_id(2);
@@ -22,12 +22,22 @@ __kernel void reduce_mean(OUT_OF_RANGE_PARAMS
    return;
 #endif
  const int dim0_size = get_local_size(0);
-  float4 tmp = (float4){0, 0, 0, 0};
  const int index = mad24(j, dim0_size, i);
  const int b = k / channel_blocks;
  const int ch = mad24(b, -channel_blocks, k);

  DATA_TYPE4 in;
+
+#if REDUCE_TYPE == 1
+  float4 tmp = (float4){MAXFLOAT, MAXFLOAT, MAXFLOAT, MAXFLOAT};
+#elif REDUCE_TYPE == 2
+  float4 tmp = (float4){-MAXFLOAT, -MAXFLOAT, -MAXFLOAT, -MAXFLOAT};
+#elif REDUCE_TYPE == 3
+  float4 tmp = (float4){1, 1, 1, 1};
+#else
+  float4 tmp = (float4){0, 0, 0, 0};
+#endif
+
  const int valid_part_len = select(partial_len,
                                    partial_len - 1,
                                    remain_index > 0 && index >= remain_index);
@@ -43,19 +53,51 @@ __kernel void reduce_mean(OUT_OF_RANGE_PARAMS
    int pos_x = mad24(ch, in_width, w_id);
    int pos_y = mad24(b, in_height, h_id);
    in = READ_IMAGET(input, SAMPLER, (int2)(pos_x, pos_y));
+// MIN
+#if REDUCE_TYPE == 1
+    tmp = fmin(tmp, in);
+// MAX
+#elif REDUCE_TYPE == 2
+    tmp = fmax(tmp, in);
+// PROD
+#elif REDUCE_TYPE == 3
+    tmp = tmp * in;
+// MEAN
+#else
    tmp = tmp + in;
+#endif
  }
-  group_sum[index] = tmp * image_size_reciprocal;
+
+#if REDUCE_TYPE == 0
+  tmp = tmp * image_size_reciprocal;
+#endif
+  group_sum[index] = tmp;

 #ifdef NON_QUALCOMM_ADRENO
  barrier(CLK_LOCAL_MEM_FENCE);
 #endif

  if (i == 0 && j == 0) {
+#if REDUCE_TYPE == 1
+    DATA_TYPE4 out = (DATA_TYPE4){MAXFLOAT, MAXFLOAT, MAXFLOAT, MAXFLOAT};
+#elif REDUCE_TYPE == 2
+    DATA_TYPE4 out = (DATA_TYPE4){-MAXFLOAT, -MAXFLOAT, -MAXFLOAT, -MAXFLOAT};
+#elif REDUCE_TYPE == 3
+    DATA_TYPE4 out = (DATA_TYPE4){1, 1, 1, 1};
+#else
    DATA_TYPE4 out = (DATA_TYPE4){0, 0, 0, 0};
+#endif
 #pragma unroll
    for (int l = 0; l < group_size; ++l) {
+#if REDUCE_TYPE == 1
+      out = fmin(out, group_sum[l]);
+#elif REDUCE_TYPE == 2
+      out = fmax(out, group_sum[l]);
+#elif REDUCE_TYPE == 3
+      out = out * group_sum[l];
+#else
      out = out + group_sum[l];
+#endif
    }
    WRITE_IMAGET(output, (int2)(ch, b), out);
  }

--- a/mace/ops/opencl/image/activation.h
+++ b/mace/ops/opencl/image/activation.h
@@ -99,6 +99,10 @@ MaceStatus ActivationKernel<T>::Compute(
        tuning_key_prefix_ = "sigmoid_opencl_kernel";
        built_options.emplace("-DUSE_SIGMOID");
        break;
+      case LEAKYRELU:
+        tuning_key_prefix_ = "leakyrelu_opencl_kernel";
+        built_options.emplace("-DUSE_LEAKYRELU");
+        break;
      default:
        LOG(FATAL) << "Unknown activation type: " << activation_;
    }

--- a/mace/ops/opencl/image/pooling.h
+++ b/mace/ops/opencl/image/pooling.h
@@ -69,6 +69,7 @@ class PoolingKernel : public OpenCLPoolingKernel {
      const Padding &padding_type,
      const std::vector<int> &padding_data,
      const int *dilations,
+      const RoundType round_type,
      Tensor *output) override;

 private:
@@ -87,6 +88,7 @@ MaceStatus PoolingKernel<T>::Compute(
    const Padding &padding_type,
    const std::vector<int> &padding_data,
    const int *dilations,
+    const RoundType round_type,
    Tensor *output) {
  MACE_CHECK(dilations[0] == 1 && dilations[1] == 1)
    << "Pooling opencl kernel not support dilation yet";
@@ -103,7 +105,7 @@ MaceStatus PoolingKernel<T>::Compute(
  } else {
    paddings = padding_data;
    CalcOutputSize(input->shape().data(), filter_shape.data(),
-                   padding_data.data(), dilations, strides, RoundType::CEIL,
+                   padding_data.data(), dilations, strides, round_type,
                   output_shape.data());
  }


--- a/mace/ops/opencl/image/reduce_mean.h
+++ b/mace/ops/opencl/image/reduce_mean.h
@@ -11,10 +11,10 @@
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
-#ifndef MACE_OPS_OPENCL_IMAGE_REDUCE_MEAN_H_
-#define MACE_OPS_OPENCL_IMAGE_REDUCE_MEAN_H_
+#ifndef MACE_OPS_OPENCL_IMAGE_REDUCE_H_
+#define MACE_OPS_OPENCL_IMAGE_REDUCE_H_

-#include "mace/ops/opencl/reduce_mean.h"
+#include "mace/ops/opencl/reduce.h"

 #include <memory>
 #include <set>
@@ -24,6 +24,7 @@
 #include "mace/core/op_context.h"
 #include "mace/core/tensor.h"
 #include "mace/ops/opencl/helper.h"
+#include "mace/ops/reduce.h"

 namespace mace {
 namespace ops {
@@ -31,11 +32,12 @@ namespace opencl {
 namespace image {

 template <typename T>
-class ReduceMeanKernel : public OpenCLReduceMeanKernel {
+class ReduceKernel : public OpenCLReduceKernel {
 public:
-  ReduceMeanKernel(const std::vector<int> axis,
-                   const bool keep_dims)
-      : axis_(axis), keep_dims_(keep_dims) {}
+  ReduceKernel(ReduceType type,
+               const std::vector<int> axis,
+               const bool keep_dims)
+      : reduce_type_(type), axis_(axis), keep_dims_(keep_dims) {}

  MaceStatus Compute(
      OpContext *context,
@@ -43,6 +45,7 @@ class ReduceMeanKernel : public OpenCLReduceMeanKernel {
      Tensor *output) override;

 private:
+  ReduceType reduce_type_;
  const std::vector<int> axis_;
  bool keep_dims_;
  cl::Kernel kernel_;
@@ -51,16 +54,16 @@ class ReduceMeanKernel : public OpenCLReduceMeanKernel {
 };

 template <typename T>
-MaceStatus ReduceMeanKernel<T>::Compute(
+MaceStatus ReduceKernel<T>::Compute(
    OpContext *context,
    const Tensor *input,
    Tensor *output) {
  MACE_CHECK_NOTNULL(input);
  MACE_CHECK(keep_dims_, "reduce mean gpu only support keep dims.");
  MACE_CHECK(input->dim_size() == 4,
-             "reduce mean gpu only support 4-dim input");
+             "reduce gpu only support 4-dim input");
  MACE_CHECK(axis_.size() == 2 && axis_[0] == 1 && axis_[1] == 2,
-             "reduce mean gpu only support 1,2-axis reduce");
+             "reduce gpu only support 1,2-axis reduce");
  index_t batch = input->dim(0);
  const index_t in_height = input->dim(1);
  const index_t in_width = input->dim(2);
@@ -84,14 +87,15 @@ MaceStatus ReduceMeanKernel<T>::Compute(
    std::set<std::string> built_options;
    MACE_OUT_OF_RANGE_CONFIG;
    MACE_NON_UNIFORM_WG_CONFIG;
-    std::string kernel_name = MACE_OBFUSCATE_SYMBOL("reduce_mean");
-    built_options.emplace("-Dreduce_mean=" + kernel_name);
+    std::string kernel_name = MACE_OBFUSCATE_SYMBOL("reduce");
+    built_options.emplace("-Dreduce=" + kernel_name);
    built_options.emplace("-DDATA_TYPE=" + DtToUpCompatibleCLDt(dt));
    built_options.emplace("-DCMD_DATA_TYPE=" + DtToUpCompatibleCLCMDDt(dt));
+    built_options.emplace(MakeString("-DREDUCE_TYPE=", reduce_type_));
    if (runtime->gpu_type() != GPUType::QUALCOMM_ADRENO) {
      built_options.emplace("-DNON_QUALCOMM_ADRENO");
    }
-    MACE_RETURN_IF_ERROR(runtime->BuildKernel("reduce_mean",
+    MACE_RETURN_IF_ERROR(runtime->BuildKernel("reduce",
                                              kernel_name,
                                              built_options,
                                              &kernel_));
@@ -170,4 +174,4 @@ MaceStatus ReduceMeanKernel<T>::Compute(
 }  // namespace ops
 }  // namespace mace

-#endif  // MACE_OPS_OPENCL_IMAGE_REDUCE_MEAN_H_
+#endif  // MACE_OPS_OPENCL_IMAGE_REDUCE_H_
--- a/mace/ops/opencl/pooling.h
+++ b/mace/ops/opencl/pooling.h
@@ -36,6 +36,7 @@ class OpenCLPoolingKernel {
      const Padding &padding_type,
      const std::vector<int> &padding_data,
      const int *dilations,
+      const RoundType round_type,
      Tensor *output) = 0;
  MACE_EMPTY_VIRTUAL_DESTRUCTOR(OpenCLPoolingKernel);
 };

--- a/mace/ops/opencl/reduce_mean.h
+++ b/mace/ops/opencl/reduce_mean.h
@@ -12,8 +12,8 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.

-#ifndef MACE_OPS_OPENCL_REDUCE_MEAN_H_
-#define MACE_OPS_OPENCL_REDUCE_MEAN_H_
+#ifndef MACE_OPS_OPENCL_REDUCE_H_
+#define MACE_OPS_OPENCL_REDUCE_H_

 #include "mace/public/mace.h"
 #include "mace/utils/utils.h"
@@ -24,16 +24,16 @@ class OpContext;
 class Tensor;

 namespace ops {
-class OpenCLReduceMeanKernel {
+class OpenCLReduceKernel {
 public:
  virtual MaceStatus Compute(
      OpContext *context,
      const Tensor *input,
      Tensor *output) = 0;
-  MACE_EMPTY_VIRTUAL_DESTRUCTOR(OpenCLReduceMeanKernel);
+  MACE_EMPTY_VIRTUAL_DESTRUCTOR(OpenCLReduceKernel);
 };

 }  // namespace ops
 }  // namespace mace

-#endif  // MACE_OPS_OPENCL_REDUCE_MEAN_H_
+#endif  // MACE_OPS_OPENCL_REDUCE_H_
--- a/mace/ops/ops_registry.cc
+++ b/mace/ops/ops_registry.cc
@@ -44,7 +44,7 @@ extern void RegisterLocalResponseNorm(OpRegistryBase *op_registry);
 extern void RegisterMatMul(OpRegistryBase *op_registry);
 extern void RegisterPad(OpRegistryBase *op_registry);
 extern void RegisterPooling(OpRegistryBase *op_registry);
-extern void RegisterReduceMean(OpRegistryBase *op_registry);
+extern void RegisterReduce(OpRegistryBase *op_registry);
 extern void RegisterReshape(OpRegistryBase *op_registry);
 extern void RegisterResizeBicubic(OpRegistryBase *op_registry);
 extern void RegisterResizeBilinear(OpRegistryBase *op_registry);
@@ -102,7 +102,7 @@ OpRegistry::OpRegistry() : OpRegistryBase() {
  ops::RegisterMatMul(this);
  ops::RegisterPad(this);
  ops::RegisterPooling(this);
-  ops::RegisterReduceMean(this);
+  ops::RegisterReduce(this);
  ops::RegisterReshape(this);
  ops::RegisterResizeBicubic(this);
  ops::RegisterResizeBilinear(this);

--- a/mace/ops/pooling.cc
+++ b/mace/ops/pooling.cc
@@ -43,11 +43,14 @@ class PoolingOpBase : public ConvPool2dOpBase {
        kernels_(Operation::GetRepeatedArgs<int>("kernels")),
        pooling_type_(
            static_cast<PoolingType>(Operation::GetOptionalArg<int>(
-                "pooling_type", static_cast<int>(AVG)))) {}
+                "pooling_type", static_cast<int>(AVG)))),
+        round_type_(static_cast<RoundType>(Operation::GetOptionalArg<int>(
+            "round_mode", static_cast<int>(CEIL)))) {}

 protected:
  std::vector<int> kernels_;
  PoolingType pooling_type_;
+  RoundType round_type_;

  MACE_OP_INPUT_TAGS(INPUT);
  MACE_OP_OUTPUT_TAGS(OUTPUT);
@@ -82,7 +85,7 @@ class PoolingOp<DeviceType::CPU, float> : public PoolingOpBase {
                         paddings_.data(),
                         dilations_.data(),
                         strides_.data(),
-                         RoundType::CEIL,
+                         round_type_,
                         output_shape.data());
    }
    MACE_RETURN_IF_ERROR(output_tensor->Resize(output_shape));
@@ -255,7 +258,7 @@ class PoolingOp<DeviceType::CPU, uint8_t> : public PoolingOpBase {
                     paddings_.data(),
                     dilations_.data(),
                     strides_.data(),
-                     RoundType::CEIL,
+                     round_type_,
                     output_shape.data());
    }
    MACE_RETURN_IF_ERROR(output_tensor->Resize(output_shape));
@@ -442,7 +445,7 @@ class PoolingOp<DeviceType::GPU, T> : public PoolingOpBase {

    return kernel_->Compute(context, input, pooling_type_, kernels_.data(),
                            strides_.data(), padding_type_, paddings_,
-                            dilations_.data(), output);
+                            dilations_.data(), round_type_, output);
  }

 private:

--- a/mace/ops/reduce.cc
+++ b/mace/ops/reduce.cc
--- a/mace/ops/reduce.h
+++ b/mace/ops/reduce.h
+// Copyright 2018 Xiaomi, Inc.  All rights reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef MACE_OPS_REDUCE_H_
+#define MACE_OPS_REDUCE_H_
+
+
+namespace mace {
+enum ReduceType {
+//  SUM = 0,
+  MEAN = 0,
+  MIN = 1,
+  MAX = 2,
+  PROD = 3,
+//  SUM_SQR = 4,
+//  SQR_MEAN = 5,
+};
+}  // namespace mace
+
+#endif  // MACE_OPS_REDUCE_H_
--- a/mace/ops/reduce_mean_benchmark.cc
+++ b/mace/ops/reduce_mean_benchmark.cc
@@ -21,7 +21,7 @@ namespace test {

 namespace {
 template <DeviceType D, typename T>
-void ReduceMean(int iters, int batch, int channels,
+void Reduce(int iters, int batch, int channels,
                int height, int width) {
  mace::testing::StopTiming();

@@ -34,7 +34,7 @@ void ReduceMean(int iters, int batch, int channels,
    net.AddRandomInput<D, T>("Input", {batch, channels, height, width});
  }

-  OpDefBuilder("ReduceMean", "ReduceMeanBM")
+  OpDefBuilder("Reduce", "ReduceBM")
      .Input("Input")
      .AddIntsArg("axis", axis)
      .Output("OutputImage")
@@ -55,30 +55,30 @@ void ReduceMean(int iters, int batch, int channels,
 }
 }  // namespace

-#define MACE_BM_REDUCE_MEAN_MACRO(N, C, H, W, TYPE, DEVICE)       \
+#define MACE_BM_REDUCE_MACRO(N, C, H, W, TYPE, DEVICE)       \
  static void                                                \
-    MACE_BM_REDUCE_MEAN_##N##_##C##_##H##_##W##_##TYPE##_##DEVICE(\
+    MACE_BM_REDUCE_##N##_##C##_##H##_##W##_##TYPE##_##DEVICE(\
      int iters) {                                                   \
    const int64_t tot = static_cast<int64_t>(iters) * N * C * H * W; \
    mace::testing::MaccProcessed(tot);                               \
    mace::testing::BytesProcessed(tot *(sizeof(TYPE)));              \
-    ReduceMean<DEVICE, TYPE>(iters, N, C, H, W);        \
+    Reduce<DEVICE, TYPE>(iters, N, C, H, W);        \
  }                                                                  \
  MACE_BENCHMARK(                                                         \
-    MACE_BM_REDUCE_MEAN_##N##_##C##_##H##_##W##_##TYPE##_##DEVICE)
+    MACE_BM_REDUCE_##N##_##C##_##H##_##W##_##TYPE##_##DEVICE)

-#define MACE_BM_REDUCE_MEAN(N, C, H, W)                 \
-  MACE_BM_REDUCE_MEAN_MACRO(N, C, H, W, float, GPU);  \
-  MACE_BM_REDUCE_MEAN_MACRO(N, C, H, W, half, GPU);   \
-  MACE_BM_REDUCE_MEAN_MACRO(N, C, H, W, float, CPU);
+#define MACE_BM_REDUCE(N, C, H, W)                 \
+  MACE_BM_REDUCE_MACRO(N, C, H, W, float, GPU);  \
+  MACE_BM_REDUCE_MACRO(N, C, H, W, half, GPU);   \
+  MACE_BM_REDUCE_MACRO(N, C, H, W, float, CPU);


-MACE_BM_REDUCE_MEAN(1, 1, 512, 512);
-MACE_BM_REDUCE_MEAN(4, 3, 128, 128);
-MACE_BM_REDUCE_MEAN(4, 1, 512, 512);
-MACE_BM_REDUCE_MEAN(16, 32, 112, 112);
-MACE_BM_REDUCE_MEAN(8, 64, 256, 256);
-MACE_BM_REDUCE_MEAN(1, 32, 480, 640);
+MACE_BM_REDUCE(1, 1, 512, 512);
+MACE_BM_REDUCE(4, 3, 128, 128);
+MACE_BM_REDUCE(4, 1, 512, 512);
+MACE_BM_REDUCE(16, 32, 112, 112);
+MACE_BM_REDUCE(8, 64, 256, 256);
+MACE_BM_REDUCE(1, 32, 480, 640);


 }  // namespace test

--- a/mace/ops/reduce_mean.cc
+++ b/mace/ops/reduce_mean.cc
-// Copyright 2018 Xiaomi, Inc.  All rights reserved.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#include <algorithm>
-#include <memory>
-#include <vector>
-
-#include "mace/core/future.h"
-#include "mace/core/operator.h"
-#include "mace/core/tensor.h"
-#ifdef MACE_ENABLE_OPENCL
-#include "mace/ops/opencl/image/reduce_mean.h"
-#endif  // MACE_ENABLE_OPENCL
-
-namespace mace {
-namespace ops {
-
-class ReduceMeanOpBase : public Operation {
- public:
-  explicit ReduceMeanOpBase(OpConstructContext *context)
-  : Operation(context),
-    axis_(Operation::GetRepeatedArgs<int>("axis")),
-    keep_dims_(Operation::GetOptionalArg<bool>("keepdims", false)) {
-  }
-
- protected:
-  inline void Validate() {
-    const Tensor *input = this->Input(0);
-    const int left = static_cast<int>(input->dim_size() * -1);
-    const int right = static_cast<int>(input->dim_size());
-    if (axis_.size()) {
-      for (unsigned int i = 0; i < axis_.size(); ++i) {
-        MACE_CHECK(axis_[i] > left && axis_[i] < right, "Axis is over range.");
-      }
-    }
-  }
-
- protected:
-  std::vector<int> axis_;
-  bool keep_dims_;
-};
-
-template <DeviceType D, class T>
-class ReduceMeanOp;
-
-template <typename T>
-class ReduceMeanOp<DeviceType::CPU, T> : public ReduceMeanOpBase {
- public:
-  explicit ReduceMeanOp(OpConstructContext *context)
-      : ReduceMeanOpBase(context) {
-  }
-
-  MaceStatus Run(OpContext *context) override {
-    MACE_UNUSED(context);
-    Validate();
-    const Tensor *input = this->Input(0);
-    Tensor *output = this->Output(0);
-    Simplify(input);
-    output->Resize(out_shape_);
-    Compute(input, output);
-    return MaceStatus::MACE_SUCCESS;
-  }
-
- private:
-  void Simplify(const Tensor *input) {
-    std::vector<bool> bitmap(static_cast<uint32_t>(input->dim_size()), false);
-    if (axis_.size() == 0) {
-      for (int i = 0; i < input->dim_size(); ++i) {
-        bitmap[i] = true;
-      }
-    } else {
-      for (unsigned int i = 0; i < axis_.size(); ++i) {
-        int index = axis_[i] >= 0 ?
-                    axis_[i] :
-                    axis_[i] + input->dim_size();
-        // axis format is NHWC
-        if (input->dim_size() == 4) {
-          if (index == 1) index = 2;
-          else if (index == 2) index = 3;
-          else if (index == 3) index = 1;
-        }
-        bitmap[index] = true;
-      }
-    }
-    out_shape_.clear();
-    for (unsigned int i = 0; i < input->dim_size(); ++i) {
-      if (!bitmap[i]) {
-        out_shape_.push_back(input->dim(i));
-      } else if (keep_dims_) {
-        out_shape_.push_back(1);
-      }
-    }
-    data_reshape_.clear();
-    unsigned int dim_index = 0;
-    for (; dim_index < input->dim_size(); ++dim_index) {
-      if (input->dim(dim_index) != 1) break;
-    }
-    if (dim_index >= input->dim_size()) {
-      reduce_first_axis_ = true;
-    } else {
-      reduce_first_axis_ = bitmap[dim_index];
-      data_reshape_.push_back(input->dim(dim_index));
-      ++dim_index;
-      for (; dim_index < input->dim_size(); ++dim_index) {
-        const int n = input->dim(dim_index);
-        if (n == 1) {
-          bitmap[dim_index] = bitmap[dim_index - 1];
-        }
-        if (bitmap[dim_index-1] != bitmap[dim_index]) {
-          data_reshape_.push_back(n);
-        } else {
-          data_reshape_.back() *= n;
-        }
-      }
-    }
-  }
-
-  void Compute(const Tensor *input, Tensor *output) {
-    Tensor::MappingGuard input_mapper(input);
-    const T *input_ptr = input->data<T>();
-    Tensor::MappingGuard output_map(output);
-    T *output_ptr = output->mutable_data<T>();
-    memset(output_ptr, 0, output->size() * sizeof(T));
-    switch (data_reshape_.size()) {
-      case 1:
-        if (reduce_first_axis_) {
-          T sum = 0;
-          for (int i = 0; i < data_reshape_[0]; ++i) {
-            sum = sum + input_ptr[i];
-          }
-          output_ptr[0] = sum / data_reshape_[0];
-        } else {
-#pragma omp parallel for schedule(runtime)
-          for (int i = 0; i < data_reshape_[0]; ++i) {
-            output_ptr[i] = input_ptr[i];
-          }
-        }
-        break;
-      case 2:
-        if (reduce_first_axis_) {
-#pragma omp parallel for schedule(runtime)
-          for (int i = 0; i < data_reshape_[1]; ++i) {
-            for (int j = 0; j < data_reshape_[0]; ++j) {
-              output_ptr[i] += input_ptr[j * data_reshape_[1] + i];
-            }
-            output_ptr[i] /= data_reshape_[0];
-          }
-        } else {
-#pragma omp parallel for schedule(runtime)
-          for (int i = 0; i < data_reshape_[0]; ++i) {
-            for (int j = 0; j < data_reshape_[1]; ++j) {
-              output_ptr[i] += input_ptr[i * data_reshape_[1] + j];
-            }
-            output_ptr[i] /= data_reshape_[1];
-          }
-        }
-        break;
-      case 3:
-        if (reduce_first_axis_) {
-#pragma omp parallel for schedule(runtime)
-          for (int i = 0; i < data_reshape_[1]; ++i) {
-            for (int j = 0; j < data_reshape_[2]; ++j) {
-              for (int k = 0; k < data_reshape_[0]; ++k) {
-                output_ptr[i] +=
-                    input_ptr[(k * data_reshape_[1] + i) * data_reshape_[2]
-                        + j];
-              }
-            }
-            output_ptr[i] /= (data_reshape_[0] * data_reshape_[2]);
-          }
-        } else {
-#pragma omp parallel for collapse(2) schedule(runtime)
-          for (int i = 0; i < data_reshape_[0]; ++i) {
-            for (int j = 0; j < data_reshape_[2]; ++j) {
-              for (int k = 0; k < data_reshape_[1]; ++k) {
-                output_ptr[i * data_reshape_[2] + j] +=
-                    input_ptr[(i * data_reshape_[1] + k) * data_reshape_[2]
-                        + j];
-              }
-              output_ptr[i * data_reshape_[2] + j] /= data_reshape_[1];
-            }
-          }
-        }
-        break;
-      case 4:
-        if (reduce_first_axis_) {
-#pragma omp parallel for collapse(2) schedule(runtime)
-          for (int i = 0; i < data_reshape_[1]; ++i) {
-            for (int j = 0; j < data_reshape_[3]; ++j) {
-              for (int k = 0; k < data_reshape_[2]; ++k) {
-                for (int t = 0; t < data_reshape_[0]; ++t) {
-                  output_ptr[i * data_reshape_[3] + j] +=
-                      input_ptr[((t * data_reshape_[1] + i) *
-                          data_reshape_[2] + k)*data_reshape_[3] + j];
-                }
-              }
-              output_ptr[i * data_reshape_[3] + j] /=
-                  (data_reshape_[0] * data_reshape_[2]);
-            }
-          }
-        } else {
-#pragma omp parallel for collapse(2) schedule(runtime)
-          for (int i = 0; i < data_reshape_[0]; ++i) {
-            for (int j = 0; j < data_reshape_[2]; ++j) {
-              for (int k = 0; k < data_reshape_[1]; ++k) {
-                for (int t = 0; t < data_reshape_[3]; ++t) {
-                  output_ptr[i * data_reshape_[2] + j] +=
-                      input_ptr[((i * data_reshape_[1] + k) *
-                          data_reshape_[2] + j)*data_reshape_[3] + t];
-                }
-              }
-              output_ptr[i * data_reshape_[2] + j] /=
-                  (data_reshape_[1] * data_reshape_[3]);
-            }
-          }
-        }
-        break;
-      default:
-        MACE_CHECK(false, "not implemented in mace")
-          << "data reshape size" << data_reshape_.size()
-          << "reduce first axis:" << reduce_first_axis_;
-        break;
-    }
-  }
-
- private:
-  bool reduce_first_axis_;
-  std::vector<int> data_reshape_;
-  std::vector<index_t> out_shape_;
-};
-
-#ifdef MACE_ENABLE_OPENCL
-template <typename T>
-class ReduceMeanOp<DeviceType::GPU, T> : public ReduceMeanOpBase {
- public:
-  explicit ReduceMeanOp(OpConstructContext *context)
-      : ReduceMeanOpBase(context) {
-    if (context->device()->gpu_runtime()->UseImageMemory()) {
-      kernel_.reset(new opencl::image::ReduceMeanKernel<T>(axis_, keep_dims_));
-    } else {
-      MACE_NOT_IMPLEMENTED;
-    }
-  }
-  MaceStatus Run(OpContext *context) override {
-    Validate();
-    const Tensor *input = this->Input(0);
-    Tensor *output = this->Output(0);
-
-    return kernel_->Compute(context, input, output);
-  }
-
- private:
-  std::unique_ptr<OpenCLReduceMeanKernel> kernel_;
-};
-#endif  // MACE_ENABLE_OPENCL
-
-void RegisterReduceMean(OpRegistryBase *op_registry) {
-  MACE_REGISTER_OP(op_registry, "ReduceMean", ReduceMeanOp,
-                   DeviceType::CPU, float);
-
-#ifdef MACE_ENABLE_OPENCL
-  MACE_REGISTER_OP(op_registry, "ReduceMean", ReduceMeanOp,
-                   DeviceType::GPU, float);
-
-  MACE_REGISTER_OP(op_registry, "ReduceMean", ReduceMeanOp,
-                   DeviceType::GPU, half);
-#endif  // MACE_ENABLE_OPENCL
-}
-
-}  // namespace ops
-}  // namespace mace
--- a/mace/ops/reduce_mean_test.cc
+++ b/mace/ops/reduce_mean_test.cc
-// Copyright 2018 Xiaomi, Inc.  All rights reserved.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#include "mace/ops/ops_test_util.h"
-
-namespace mace {
-namespace ops {
-namespace test {
-
-class ReduceMeanOpTest : public OpsTestBase {};
-
-namespace {
-template <DeviceType D>
-void Simple(const std::vector<index_t> &input_shape,
-            const std::vector<float> &input,
-            const std::vector<int> &axis,
-            const std::vector<index_t> &output_shape,
-            const std::vector<float> &output,
-            const bool keepdims = true) {
-  // Construct graph
-  OpsTestNet net;
-  // Add input data
-  net.AddInputFromArray<D, float>("Input", input_shape, input);
-
-  if (D == DeviceType::CPU) {
-    net.TransformDataFormat<D, float>("Input", NHWC, "InputNCHW", NCHW);
-    OpDefBuilder("ReduceMean", "ReduceMeanTest")
-        .Input("InputNCHW")
-        .AddIntsArg("axis", axis)
-        .AddIntArg("keepdims", keepdims ? 1 : 0)
-        .Output("OutputNCHW")
-        .Finalize(net.NewOperatorDef());
-    // Run
-    net.RunOp(D);
-    net.TransformDataFormat<D, float>("OutputNCHW", NCHW, "Output", NHWC);
-  } else {
-    OpDefBuilder("ReduceMean", "ReduceMeanTest")
-        .Input("Input")
-        .AddIntsArg("axis", axis)
-        .AddIntArg("keepdims", keepdims ? 1 : 0)
-        .Output("Output")
-        .Finalize(net.NewOperatorDef());
-    // Run
-    net.RunOp(D);
-  }
-  auto expected = net.CreateTensor<float>(output_shape, output);
-  ExpectTensorNear<float>(*expected, *net.GetOutput("Output"), 1e-5, 1e-3);
-}
-
-template <DeviceType D>
-void Simple3D(const std::vector<index_t> &input_shape,
-              const std::vector<float> &input,
-              const std::vector<int> &axis,
-              const std::vector<index_t> &output_shape,
-              const std::vector<float> &output,
-              const bool keepdims = true) {
-  // Construct graph
-  OpsTestNet net;
-  // Add input data
-  net.AddInputFromArray<D, float>("Input", input_shape, input);
-
-  OpDefBuilder("ReduceMean", "ReduceMeanTest")
-      .Input("Input")
-      .AddIntsArg("axis", axis)
-      .AddIntArg("keepdims", keepdims ? 1 : 0)
-      .Output("Output")
-      .Finalize(net.NewOperatorDef());
-  // Run
-  net.RunOp(D);
-  auto expected = net.CreateTensor<float>(output_shape, output);
-  ExpectTensorNear<float>(*expected, *net.GetOutput("Output"), 1e-5, 1e-3);
-}
-
-template <DeviceType D>
-void Simple12Test() {
-  Simple<D>({2, 2, 3, 4},
-            {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-             12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
-             0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-             12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23},
-            {1, 2},
-            {2, 1, 1, 4},
-            {10, 11, 12, 13,
-             10, 11, 12, 13});
-}
-
-template <DeviceType D>
-void Simple1Axis() {
-  Simple<D>({2, 2, 3, 4},
-            {0, 1, 2, 3,
-             4, 5, 6, 7,
-             8, 9, 10, 11,
-             12, 13, 14, 15,
-             16, 17, 18, 19,
-             20, 21, 22, 23,
-             0, 1, 2, 3,
-             4, 5, 6, 7,
-             8, 9, 10, 11,
-             12, 13, 14, 15,
-             16, 17, 18, 19,
-             20, 21, 22, 23},
-            {1},
-            {2, 1, 3, 4},
-            {6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
-             6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17});
-  Simple<D>({1, 2, 3, 4},
-            {0, 1, 2, 3,
-             4, 5, 6, 7,
-             8, 9, 10, 11,
-             12, 13, 14, 15,
-             16, 17, 18, 19,
-             20, 21, 22, 23},
-            {-3},
-            {1, 1, 3, 4},
-            {6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17});
-  Simple<D>({1, 2, 3, 4},
-            {0, 1, 2, 3,
-             4, 5, 6, 7,
-             8, 9, 10, 11,
-             12, 13, 14, 15,
-             16, 17, 18, 19,
-             20, 21, 22, 23},
-            {2},
-            {1, 2, 1, 4},
-            {4, 5, 6, 7, 16, 17, 18, 19});
-  Simple<D>({1, 2, 3, 4},
-            {0, 1, 2, 3,
-             4, 5, 6, 7,
-             8, 9, 10, 11,
-             12, 13, 14, 15,
-             16, 17, 18, 19,
-             20, 21, 22, 23},
-            {-1},
-            {1, 2, 3, 1},
-            {1.5, 5.5, 9.5, 13.5, 17.5, 21.5});
-  Simple<D>({1, 3, 3, 3},
-            {0, 1, 2, 3, 4, 5, 6, 7, 8,
-             9, 10, 11, 12, 13, 14, 15, 16, 17,
-             18, 19, 20, 21, 22, 23, 24, 25, 26},
-            {1},
-            {1, 1, 3, 3},
-            {9, 10, 11, 12, 13, 14, 15, 16, 17});
-  Simple<D>({1, 3, 3, 3},
-            {0, 1, 2, 3, 4, 5, 6, 7, 8,
-             9, 10, 11, 12, 13, 14, 15, 16, 17,
-             18, 19, 20, 21, 22, 23, 24, 25, 26},
-            {-2},
-            {1, 3, 1, 3},
-            {3, 4, 5, 12, 13, 14, 21, 22, 23});
-  Simple<D>({1, 3, 3, 3},
-            {0, 1, 2, 3, 4, 5, 6, 7, 8,
-             9, 10, 11, 12, 13, 14, 15, 16, 17,
-             18, 19, 20, 21, 22, 23, 24, 25, 26},
-            {3},
-            {1, 3, 3, 1},
-            {1, 4, 7, 10, 13, 16, 19, 22, 25});
-}
-
-template <DeviceType D>
-void Simple2Axis() {
-  Simple<D>({1, 2, 3, 4},
-            {0, 1, 2, 3,
-             4, 5, 6, 7,
-             8, 9, 10, 11,
-             12, 13, 14, 15,
-             16, 17, 18, 19,
-             20, 21, 22, 23},
-            {0, 1},
-            {1, 1, 3, 4},
-            {6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17});
-  Simple<D>({1, 2, 3, 4},
-            {0, 1, 2, 3,
-             4, 5, 6, 7,
-             8, 9, 10, 11,
-             12, 13, 14, 15,
-             16, 17, 18, 19,
-             20, 21, 22, 23},
-            {0, 2},
-            {1, 2, 1, 4},
-            {4, 5, 6, 7, 16, 17, 18, 19});
-  Simple<D>({1, 2, 3, 4},
-            {0, 1, 2, 3,
-             4, 5, 6, 7,
-             8, 9, 10, 11,
-             12, 13, 14, 15,
-             16, 17, 18, 19,
-             20, 21, 22, 23},
-            {1, 3},
-            {1, 1, 3, 1},
-            {7.5, 11.5, 15.5});
-  Simple<D>({1, 3, 3, 3},
-            {0, 1, 2, 3, 4, 5, 6, 7, 8,
-             9, 10, 11, 12, 13, 14, 15, 16, 17,
-             18, 19, 20, 21, 22, 23, 24, 25, 26},
-            {1, 2},
-            {1, 1, 1, 3},
-            {12, 13, 14});
-  Simple<D>({1, 3, 3, 3},
-            {0, 1, 2, 3, 4, 5, 6, 7, 8,
-             9, 10, 11, 12, 13, 14, 15, 16, 17,
-             18, 19, 20, 21, 22, 23, 24, 25, 26},
-            {0, 1},
-            {1, 1, 3, 3},
-            {9, 10, 11, 12, 13, 14, 15, 16, 17});
-  Simple<D>({1, 3, 3, 3},
-            {0, 1, 2, 3, 4, 5, 6, 7, 8,
-             9, 10, 11, 12, 13, 14, 15, 16, 17,
-             18, 19, 20, 21, 22, 23, 24, 25, 26},
-            {2, 3},
-            {1, 3, 1, 1},
-            {4, 13, 22});
-}
-
-template <DeviceType D>
-void Simple2Axis3D() {
-  Simple3D<D>({2, 3, 4},
-              {0, 1, 2, 3,
-               4, 5, 6, 7,
-               8, 9, 10, 11,
-               12, 13, 14, 15,
-               16, 17, 18, 19,
-               20, 21, 22, 23},
-              {0, 1},
-              {1, 1, 4},
-              {10, 11, 12, 13});
-  Simple3D<D>({2, 3, 4},
-              {0, 1, 2, 3,
-               4, 5, 6, 7,
-               8, 9, 10, 11,
-               12, 13, 14, 15,
-               16, 17, 18, 19,
-               20, 21, 22, 23},
-              {1, 2},
-              {2, 1, 1},
-              {5.5, 17.5});
-}
-
-
-template <DeviceType D>
-void Simple3Axis() {
-  Simple<D>({1, 2, 3, 4},
-            {0, 1, 2, 3,
-             4, 5, 6, 7,
-             8, 9, 10, 11,
-             12, 13, 14, 15,
-             16, 17, 18, 19,
-             20, 21, 22, 23},
-            {1, 2, 3},
-            {1, 1, 1, 1},
-            {11.5});
-  Simple<D>({1, 2, 3, 4},
-            {0, 1, 2, 3,
-             4, 5, 6, 7,
-             8, 9, 10, 11,
-             12, 13, 14, 15,
-             16, 17, 18, 19,
-             20, 21, 22, 23},
-            {0, 2, 3},
-            {1, 2, 1, 1},
-            {5.5, 17.5});
-  Simple<D>({1, 2, 3, 4},
-            {0, 1, 2, 3,
-             4, 5, 6, 7,
-             8, 9, 10, 11,
-             12, 13, 14, 15,
-             16, 17, 18, 19,
-             20, 21, 22, 23},
-            {0, 1, 3},
-            {1, 1, 3, 1},
-            {7.5, 11.5, 15.5});
-  Simple<D>({1, 2, 3, 4},
-            {0, 1, 2, 3,
-             4, 5, 6, 7,
-             8, 9, 10, 11,
-             12, 13, 14, 15,
-             16, 17, 18, 19,
-             20, 21, 22, 23},
-            {0, 1, 2},
-            {1, 1, 1, 4},
-            {10, 11, 12, 13});
-  Simple<D>({1, 3, 3, 3},
-            {0, 1, 2, 3, 4, 5, 6, 7, 8,
-             9, 10, 11, 12, 13, 14, 15, 16, 17,
-             18, 19, 20, 21, 22, 23, 24, 25, 26},
-            {1, 2, 3},
-            {1, 1, 1, 1},
-            {13});
-  Simple<D>({1, 3, 3, 3},
-            {0, 1, 2, 3, 4, 5, 6, 7, 8,
-             9, 10, 11, 12, 13, 14, 15, 16, 17,
-             18, 19, 20, 21, 22, 23, 24, 25, 26},
-            {0, 2, 3},
-            {1, 3, 1, 1},
-            {4, 13, 22});
-  Simple<D>({1, 3, 3, 3},
-            {0, 1, 2, 3, 4, 5, 6, 7, 8,
-             9, 10, 11, 12, 13, 14, 15, 16, 17,
-             18, 19, 20, 21, 22, 23, 24, 25, 26},
-            {0, 1, 3},
-            {1, 1, 3, 1},
-            {10, 13, 16});
-  Simple<D>({1, 3, 3, 3},
-            {0, 1, 2, 3, 4, 5, 6, 7, 8,
-             9, 10, 11, 12, 13, 14, 15, 16, 17,
-             18, 19, 20, 21, 22, 23, 24, 25, 26},
-            {0, 1, 2},
-            {1, 1, 1, 3},
-            {12, 13, 14});
-}
-
-}  // namespace
-
-TEST_F(ReduceMeanOpTest, CPUSimple12) {
-  Simple12Test<DeviceType::CPU>();
-}
-
-TEST_F(ReduceMeanOpTest, GPUSimple12) {
-  Simple12Test<DeviceType::GPU>();
-}
-
-TEST_F(ReduceMeanOpTest, CPUSimple1Axis) {
-  Simple1Axis<DeviceType::CPU>();
-}
-
-TEST_F(ReduceMeanOpTest, CPUSimple2Axis) {
-  Simple2Axis<DeviceType::CPU>();
-}
-
-TEST_F(ReduceMeanOpTest, CPUSimple2Axis3D) {
-  Simple2Axis3D<DeviceType::CPU>();
-}
-
-TEST_F(ReduceMeanOpTest, CPUSimple3Axis) {
-  Simple3Axis<DeviceType::CPU>();
-}
-
-TEST_F(ReduceMeanOpTest, CPUSimpleReduceDims) {
-  Simple3D<CPU>({2, 3, 4},
-                {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23},
-                {0, 1},
-                {4},
-                {10, 11, 12, 13},
-                false);
-}
-
-namespace {
-template <DeviceType D, typename T>
-void RandomTest(const std::vector<index_t> &input_shape,
-                const std::vector<int> &axis) {
-  testing::internal::LogToStderr();
-  srand(time(NULL));
-  // Construct graph
-  OpsTestNet net;
-  // Add input data
-  net.AddRandomInput<D, float>("Input", input_shape);
-
-  net.TransformDataFormat<DeviceType::CPU, float>("Input", NHWC, "InputNCHW",
-                                                  NCHW);
-  OpDefBuilder("ReduceMean", "ReduceMeanTest")
-      .Input("InputNCHW")
-      .AddIntsArg("axis", axis)
-      .AddIntArg("keepdims", 1)
-      .Output("OutputNCHW")
-      .Finalize(net.NewOperatorDef());
-  // Run
-  net.RunOp();
-  net.TransformDataFormat<DeviceType::CPU, float>("OutputNCHW", NCHW,
-                                                  "Output", NHWC);
-  OpDefBuilder("ReduceMean", "ReduceMeanTest")
-      .Input("Input")
-      .AddIntsArg("axis", axis)
-      .AddIntArg("keepdims", 1)
-      .Output("OPENCLOutput")
-      .Finalize(net.NewOperatorDef());
-  // Run
-  net.RunOp(D);
-  if (DataTypeToEnum<T>::value == DT_FLOAT) {
-    ExpectTensorNear<float>(*net.GetTensor("Output"),
-                            *net.GetOutput("OPENCLOutput"), 1e-5, 1e-4);
-  } else {
-    ExpectTensorNear<float>(*net.GetTensor("Output"),
-                            *net.GetOutput("OPENCLOutput"), 1e-2, 1e-2);
-  }
-}
-}  // namespace
-
-TEST_F(ReduceMeanOpTest, GPURandomFloat) {
-  RandomTest<DeviceType::GPU, float>({4, 64, 64, 3}, {1, 2});
-  RandomTest<DeviceType::GPU, float>({2, 64, 64, 4}, {1, 2});
-  RandomTest<DeviceType::GPU, float>({8, 128, 128, 64}, {1, 2});
-  RandomTest<DeviceType::GPU, float>({1, 640, 480, 64}, {1, 2});
-  RandomTest<DeviceType::GPU, float>({1, 480, 640, 32}, {1, 2});
-  RandomTest<DeviceType::GPU, float>({1, 512, 512, 16}, {1, 2});
-  RandomTest<DeviceType::GPU, float>({8, 117, 87, 33}, {1, 2});
-  RandomTest<DeviceType::GPU, float>({1, 619, 450, 61}, {1, 2});
-  RandomTest<DeviceType::GPU, float>({1, 511, 561, 11}, {1, 2});
-}
-
-TEST_F(ReduceMeanOpTest, GPURandomHalf) {
-  RandomTest<DeviceType::GPU, half>({4, 64, 64, 3}, {1, 2});
-  RandomTest<DeviceType::GPU, half>({2, 64, 64, 4}, {1, 2});
-  RandomTest<DeviceType::GPU, half>({8, 128, 128, 64}, {1, 2});
-  RandomTest<DeviceType::GPU, half>({1, 640, 480, 64}, {1, 2});
-  RandomTest<DeviceType::GPU, half>({1, 480, 640, 32}, {1, 2});
-  RandomTest<DeviceType::GPU, half>({1, 512, 512, 16}, {1, 2});
-  RandomTest<DeviceType::GPU, half>({8, 117, 87, 33}, {1, 2});
-  RandomTest<DeviceType::GPU, half>({1, 619, 450, 61}, {1, 2});
-  RandomTest<DeviceType::GPU, half>({1, 511, 561, 11}, {1, 2});
-}
-
-}  // namespace test
-}  // namespace ops
-}  // namespace mace
--- a/mace/ops/reduce_test.cc
+++ b/mace/ops/reduce_test.cc
--- a/mace/ops/reshape.cc
+++ b/mace/ops/reshape.cc
@@ -36,6 +36,7 @@ class ReshapeOp : public Operation {
    int unknown_idx = -1;
    index_t product = 1;
    std::vector<index_t> out_shape;
+    index_t n = 0;

    for (int i = 0; i < num_dims; ++i) {
      if (shape_data[i] == -1) {
@@ -45,8 +46,15 @@ class ReshapeOp : public Operation {
      } else {
        MACE_CHECK(shape_data[i] >= 0, "Shape must be non-negative: ",
                   shape_data[i]);
-        out_shape.push_back(shape_data[i]);
-        product *= shape_data[i];
+        if (shape_data[i] == 0) {
+          MACE_CHECK(i < input->dim_size(),
+                     "dims:0 out of input dims' range.");
+          n = input->dim(i);
+        } else {
+          n = shape_data[i];
+        }
+        out_shape.push_back(n);
+        product *= n;
      }
    }


--- a/mace/python/tools/BUILD
+++ b/mace/python/tools/BUILD
@@ -13,6 +13,7 @@ py_library(
        "converter_tool/base_converter.py",
        "converter_tool/caffe_converter.py",
        "converter_tool/hexagon_converter.py",
+        "converter_tool/onnx_converter.py",
        "converter_tool/shape_inference.py",
        "converter_tool/tensorflow_converter.py",
        "converter_tool/tf_dsp_converter.py",

--- a/mace/python/tools/converter.py
+++ b/mace/python/tools/converter.py
@@ -101,7 +101,7 @@ def main(unused_args):
                       file=sys.stderr)
            sys.exit(-1)

-    if FLAGS.platform not in ['tensorflow', 'caffe']:
+    if FLAGS.platform not in ['tensorflow', 'caffe', 'onnx']:
        six.print_("platform %s is not supported." % FLAGS.platform,
                   file=sys.stderr)
        sys.exit(-1)
@@ -188,6 +188,9 @@ def main(unused_args):
            converter = caffe_converter.CaffeConverter(option,
                                                       FLAGS.model_file,
                                                       FLAGS.weight_file)
+        elif FLAGS.platform == 'onnx':
+            from mace.python.tools.converter_tool import onnx_converter
+            converter = onnx_converter.OnnxConverter(option, FLAGS.model_file)
        else:
            six.print_("Mace do not support platorm %s yet." % FLAGS.platform,
                       file=sys.stderr)
@@ -231,6 +234,7 @@ def parse_args():
        type=str,
        default="",
        help="TensorFlow \'GraphDef\' file to load, "
+             "Onnx model file .onnx to load, "
             "Caffe prototxt file to load.")
    parser.add_argument(
        "--weight_file", type=str, default="", help="Caffe data file to load.")
@@ -300,7 +304,10 @@ def parse_args():
    parser.add_argument(
        "--check_shape", type=str, default="", help="check shape.")
    parser.add_argument(
-        "--platform", type=str, default="tensorflow", help="tensorflow/caffe")
+        "--platform",
+        type=str,
+        default="tensorflow",
+        help="tensorflow/caffe/onnx")
    parser.add_argument(
        "--embed_model_data",
        type=str2bool,

--- a/mace/python/tools/converter_tool/base_converter.py
+++ b/mace/python/tools/converter_tool/base_converter.py
@@ -37,11 +37,14 @@ class FilterFormat(Enum):
    OHWI = 103


+# SAME_LOWER: if the amount of paddings to be added is odd,
+# it will add the extra data to the right or bottom
 class PaddingMode(Enum):
    VALID = 0
    SAME = 1
    FULL = 2
-    NA = 3
+    SAME_LOWER = 3
+    NA = 4


 class PoolingType(Enum):
@@ -49,6 +52,11 @@ class PoolingType(Enum):
    MAX = 2


+class RoundMode(Enum):
+    FLOOR = 0
+    CEIL = 1
+
+
 class ActivationType(Enum):
    NOOP = 0
    RELU = 1
@@ -56,6 +64,7 @@ class ActivationType(Enum):
    PRELU = 3
    TANH = 4
    SIGMOID = 5
+    LEAKYRELU = 6


 class EltwiseType(Enum):
@@ -72,9 +81,17 @@ class EltwiseType(Enum):
    EQUAL = 10


+class ReduceType(Enum):
+    MEAN = 0
+    MIN = 1
+    MAX = 2
+    PROD = 3
+
+
 class FrameworkType(Enum):
    TENSORFLOW = 0
    CAFFE = 1
+    ONNX = 2


 MaceSupportedOps = [
@@ -108,7 +125,7 @@ MaceSupportedOps = [
    'Pooling',
    'Proposal',
    'Quantize',
-    'ReduceMean',
+    'Reduce',
    'Reshape',
    'ResizeBicubic',
    'ResizeBilinear',
@@ -184,6 +201,10 @@ class MaceKeyword(object):
    mace_group_str = "group"
    mace_wino_arg_str = "wino_block_size"
    mace_quantize_flag_arg_str = "quantize_flag"
+    mace_epsilon_str = 'epsilon'
+    mace_reduce_type_str = 'reduce_type'
+    mace_argmin_str = 'argmin'
+    mace_round_mode_str = 'round_mode'


 class TransformerRule(Enum):

--- a/mace/python/tools/converter_tool/onnx_converter.py
+++ b/mace/python/tools/converter_tool/onnx_converter.py
--- a/mace/python/tools/converter_tool/tensorflow_converter.py
+++ b/mace/python/tools/converter_tool/tensorflow_converter.py
@@ -26,6 +26,7 @@ from mace.python.tools.converter_tool.base_converter import PaddingMode
 from mace.python.tools.converter_tool.base_converter import ActivationType
 from mace.python.tools.converter_tool.base_converter import EltwiseType
 from mace.python.tools.converter_tool.base_converter import FrameworkType
+from mace.python.tools.converter_tool.base_converter import ReduceType
 from mace.python.tools.converter_tool.base_converter import DataFormat
 from mace.python.tools.converter_tool.base_converter import FilterFormat
 from mace.python.tools.converter_tool.base_converter import MaceOp
@@ -465,15 +466,6 @@ class TensorflowConverter(base_converter.ConverterInterface):
                       "Mace only supports dilation == 1 conv2d_transpose.")
            mace_check(len(tf_op.inputs) >= 3,
                       "deconv should have (>=) 3 inputs.")
-            output_shape_arg = op.arg.add()
-            output_shape_arg.name = MaceKeyword.mace_output_shape_str
-            # if tf_op.inputs[0].op.type == TFOpType.Const.name:
-            #     output_shape_value = \
-            #         tf_op.inputs[0].eval().astype(np.int32).flat
-            #     output_shape_arg.ints.extend(output_shape_value)
-            # else:
-            #     output_shape_value = {}
-            #     output_shape_arg.ints.extend(output_shape_value)
            del op.input[:]
            op.input.extend([tf_op.inputs[2].name,
                             tf_op.inputs[1].name,
@@ -810,7 +802,12 @@ class TensorflowConverter(base_converter.ConverterInterface):
        op = self.convert_general_op(tf_op)
        del op.input[1:]

-        op.type = MaceOp.ReduceMean.name
+        op.type = MaceOp.Reduce.name
+
+        reduce_type_arg = op.arg.add()
+        reduce_type_arg.name = MaceKeyword.mace_reduce_type_str
+        reduce_type_arg.i = ReduceType.MEAN
+
        axis_arg = op.arg.add()
        axis_arg.name = MaceKeyword.mace_axis_str
        if len(tf_op.inputs) > 1:

--- a/mace/python/tools/converter_tool/transformer.py
+++ b/mace/python/tools/converter_tool/transformer.py
@@ -352,21 +352,26 @@ class Transformer(base_converter.ConverterInterface):
                if elttype == EltwiseType.SQR_DIFF.value and\
                        self.consumer_count(op.output[0]) == 1:
                    consumer_op = self._consumers[op.output[0]][0]
-                    axis = ConverterUtil.get_arg(
-                        consumer_op,
-                        MaceKeyword.mace_axis_str).ints
-                    keep_dims = ConverterUtil.get_arg(
-                        consumer_op,
-                        MaceKeyword.mace_keepdims_str).i
-                    if consumer_op.type == MaceOp.ReduceMean.name and\
-                            len(consumer_op.input) == 1 and \
-                            axis[0] == 1 and axis[1] == 2 and keep_dims != 0:
-                        print("Fold SquaredDiff ReduceMean: %s" % op.name)
-                        op.type = MaceOp.SqrDiffMean.name
-                        op.output[0] = consumer_op.output[0]
-                        self.replace_quantize_info(op, consumer_op)
-                        self.safe_remove_node(consumer_op, op)
-                        return True
+                    if consumer_op.type == MaceOp.Reduce.name:
+                        axis = ConverterUtil.get_arg(
+                            consumer_op,
+                            MaceKeyword.mace_axis_str).ints
+                        keep_dims = ConverterUtil.get_arg(
+                            consumer_op,
+                            MaceKeyword.mace_keepdims_str).i
+                        reduce_type = ConverterUtil.get_arg(
+                            consumer_op,
+                            MaceKeyword.mace_reduce_type_str).i
+                        if reduce_type == ReduceType.MEAN and\
+                                len(consumer_op.input) == 1 and\
+                                axis[0] == 1 and axis[1] == 2 and\
+                                keep_dims > 0:
+                            print("Fold SquaredDiff Reduce: %s" % op.name)
+                            op.type = MaceOp.SqrDiffMean.name
+                            op.output[0] = consumer_op.output[0]
+                            self.replace_quantize_info(op, consumer_op)
+                            self.safe_remove_node(consumer_op, op)
+                            return True

        return False

@@ -1005,13 +1010,13 @@ class Transformer(base_converter.ConverterInterface):
                                       'only support squeeze at at [2, 3]')
                            arg.ints[:] = [1, 2]

-            elif op.type == MaceOp.ReduceMean.name:
+            elif op.type == MaceOp.Reduce.name:
                for arg in op.arg:
                    if arg.name == MaceKeyword.mace_axis_str:
                        if ConverterUtil.data_format(
                                op) == DataFormat.NCHW \
                                and self._target_data_format == DataFormat.NHWC:  # noqa
-                            print("Transpose reduce mean args: %s(%s)"
+                            print("Transpose reduce args: %s(%s)"
                                  % (op.name, op.type))
                            reduce_axises = list(arg.ints)
                            new_axises = []

--- a/repository/opencl-kernel/opencl_kernel_configure.bzl
+++ b/repository/opencl-kernel/opencl_kernel_configure.bzl
@@ -48,7 +48,7 @@ def _opencl_encrypt_kernel_impl(repository_ctx):
    unused_var = repository_ctx.path(Label("//:mace/ops/opencl/cl/pad.cl"))
    unused_var = repository_ctx.path(Label("//:mace/ops/opencl/cl/pooling.cl"))
    unused_var = repository_ctx.path(Label("//:mace/ops/opencl/cl/pooling_buffer.cl"))
-    unused_var = repository_ctx.path(Label("//:mace/ops/opencl/cl/reduce_mean.cl"))
+    unused_var = repository_ctx.path(Label("//:mace/ops/opencl/cl/reduce.cl"))
    unused_var = repository_ctx.path(Label("//:mace/ops/opencl/cl/resize_bicubic.cl"))
    unused_var = repository_ctx.path(Label("//:mace/ops/opencl/cl/resize_bilinear.cl"))
    unused_var = repository_ctx.path(Label("//:mace/ops/opencl/cl/split.cl"))

--- a/tools/common.py
+++ b/tools/common.py
@@ -362,6 +362,7 @@ class YAMLKeyword(object):
    validation_threshold = 'validation_threshold'
    graph_optimize_options = 'graph_optimize_options'  # internal use for now
    cl_mem_type = 'cl_mem_type'
+    backend = 'backend'


 ################################

--- a/tools/converter.py
+++ b/tools/converter.py
@@ -55,6 +55,7 @@ ModelFormatStrs = [
 PlatformTypeStrs = [
    "tensorflow",
    "caffe",
+    "onnx",
 ]
 PlatformType = Enum('PlatformType', [(ele, ele) for ele in PlatformTypeStrs],
                    type=str)
@@ -469,6 +470,10 @@ def format_model_config(flags):
            else:
                subgraph[YAMLKeyword.validation_inputs_data] = \
                    validation_inputs_data
+
+            onnx_backend = subgraph.get(
+                YAMLKeyword.backend, "tensorflow")
+            subgraph[YAMLKeyword.backend] = onnx_backend
            input_ranges = subgraph.get(
                YAMLKeyword.input_ranges, [])
            if not isinstance(input_ranges, list):

--- a/tools/device.py
+++ b/tools/device.py
@@ -572,7 +572,8 @@ class DeviceWrapper:
                            YAMLKeyword.input_data_types],
                        caffe_env=flags.caffe_env,
                        validation_threshold=subgraphs[0][
-                            YAMLKeyword.validation_threshold][validate_type]
+                            YAMLKeyword.validation_threshold][validate_type],
+                        backend=subgraphs[0][YAMLKeyword.backend]
                    )
                if flags.report and flags.round > 0:
                    tuned = is_tuned and device_type == DeviceType.GPU

--- a/tools/onnx_optimizer.py
+++ b/tools/onnx_optimizer.py
+# Copyright 2018 Xiaomi, Inc.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import onnx
+import sys
+from onnx import optimizer
+
+
+# Usage: python onnx_optimizer.py model.onnx model_opt.onnx
+
+
+def main():
+    if len(sys.argv) != 3:
+        print "Usage: python onnx_optimizer.py model.onnx model_opt.onnx"
+        sys.exit(0)
+    in_path = sys.argv[1]
+    out_path = sys.argv[2]
+    original_model = onnx.load(in_path)
+    print "Start optimize ONNX model for inference:"
+    passes = ['eliminate_identity',
+              'fuse_consecutive_squeezes',
+              'fuse_consecutive_transposes',
+              'eliminate_nop_pad',
+              'eliminate_nop_transpose',
+              'eliminate_unused_initializer',
+              'extract_constant_to_initializer',
+              'fuse_add_bias_into_conv',
+              'fuse_bn_into_conv',
+              'fuse_transpose_into_gemm']
+    for i in range(len(passes)):
+        print i, ".", passes[i]
+    optimized_model = optimizer.optimize(original_model, passes)
+    onnx.save_model(optimized_model, out_path)
+    print "Optimize Finished!"
+    print "Please check new model in:", out_path
+
+
+if __name__ == '__main__':
+    main()
--- a/tools/sh_commands.py
+++ b/tools/sh_commands.py
@@ -621,7 +621,8 @@ def validate_model(abi,
                   caffe_env,
                   input_file_name="model_input",
                   output_file_name="model_out",
-                   validation_threshold=0.9):
+                   validation_threshold=0.9,
+                   backend="tensorflow"):
    six.print_("* Validate with %s" % platform)
    if abi != "host":
        for output_name in output_nodes:
@@ -638,7 +639,14 @@ def validate_model(abi,
                 "%s/%s" % (model_output_dir, output_file_name), device_type,
                 ":".join(input_shapes), ":".join(output_shapes),
                 ",".join(input_nodes), ",".join(output_nodes),
-                 validation_threshold, ",".join(input_data_types))
+                 validation_threshold, ",".join(input_data_types), backend)
+    elif platform == "onnx":
+        validate(platform, model_file_path, "",
+                 "%s/%s" % (model_output_dir, input_file_name),
+                 "%s/%s" % (model_output_dir, output_file_name), device_type,
+                 ":".join(input_shapes), ":".join(output_shapes),
+                 ",".join(input_nodes), ",".join(output_nodes),
+                 validation_threshold, ",".join(input_data_types), backend)
    elif platform == "caffe":
        image_name = "mace-caffe:latest"
        container_name = "mace_caffe_validator"
@@ -654,7 +662,7 @@ def validate_model(abi,
                     device_type,
                     ":".join(input_shapes), ":".join(output_shapes),
                     ",".join(input_nodes), ",".join(output_nodes),
-                     validation_threshold, ",".join(input_data_types))
+                     validation_threshold, ",".join(input_data_types), backend)
        elif caffe_env == common.CaffeEnvType.DOCKER:
            docker_image_id = sh.docker("images", "-q", image_name)
            if not docker_image_id:
@@ -720,6 +728,7 @@ def validate_model(abi,
                "--output_shape=%s" % ":".join(output_shapes),
                "--validation_threshold=%f" % validation_threshold,
                "--input_data_type=%s" % ",".join(input_data_types),
+                "--backend=%s" % ",".join(backend),
                _fg=True)

    six.print_("Validation done!\n")

--- a/tools/validate.py
+++ b/tools/validate.py
@@ -21,6 +21,10 @@ import re

 import common

+import onnx
+from onnx import helper
+from onnx import TensorProto
+
 # Validation Flow:
 # 1. Generate input data
 # 2. Use mace_run to run model on phone.
@@ -190,9 +194,64 @@ def validate_caffe_model(platform, device_type, model_file, input_file,
                       value, validation_threshold)


+def validate_onnx_model(platform, device_type, model_file, input_file,
+                        mace_out_file, input_names, input_shapes,
+                        output_names, output_shapes, validation_threshold,
+                        input_data_types, backend):
+    if backend == "tensorflow":
+        from onnx_tf.backend import prepare
+        print "valivate on onnx tensorflow backend."
+    elif backend == "caffe2" or backend == "pytorch":
+        from caffe2.python.onnx.backend import prepare
+        print "valivate on onnx caffe2 backend."
+    else:
+        common.MaceLogger.error(
+            VALIDATION_MODULE,
+            "onnx backend framwork '" + backend + "' is invalid.")
+    if not os.path.isfile(model_file):
+        common.MaceLogger.error(
+            VALIDATION_MODULE,
+            "Input graph file '" + model_file + "' does not exist!")
+    model = onnx.load(model_file)
+    input_dict = {}
+    for i in range(len(input_names)):
+        input_value = load_data(common.formatted_file_name(input_file,
+                                                           input_names[i]),
+                                input_data_types[i])
+        input_value = input_value.reshape(input_shapes[i]).transpose((0, 3, 1,
+                                                                      2))
+        input_dict[input_names[i]] = input_value
+    onnx_outputs = []
+    for i in range(len(output_names)):
+        out_shape = output_shapes[i]
+        if len(out_shape) == 4:
+            out_shape[1], out_shape[2], out_shape[3] = \
+                out_shape[3], out_shape[1], out_shape[2]
+        onnx_outputs.append(
+            helper.make_tensor_value_info(output_names[i],
+                                          TensorProto.FLOAT,
+                                          out_shape))
+    model.graph.output.extend(onnx_outputs)
+    rep = prepare(model)
+
+    output_values = rep.run(input_dict)
+    for i in range(len(output_names)):
+        out_name = output_names[i]
+        value = output_values[out_name].flatten()
+        out_shape = output_shapes[i]
+        if len(out_shape) == 4:
+            value = value.reshape(out_shape).transpose((0, 2, 3, 1))
+        output_file_name = common.formatted_file_name(mace_out_file,
+                                                      output_names[i])
+        mace_out_value = load_data(output_file_name)
+        compare_output(platform, device_type, output_names[i],
+                       mace_out_value, value,
+                       validation_threshold)
+
+
 def validate(platform, model_file, weight_file, input_file, mace_out_file,
             device_type, input_shape, output_shape, input_node, output_node,
-             validation_threshold, input_data_type):
+             validation_threshold, input_data_type, backend):
    input_names = [name for name in input_node.split(',')]
    input_shape_strs = [shape for shape in input_shape.split(':')]
    input_shapes = [[int(x) for x in shape.split(',')]
@@ -217,6 +276,15 @@ def validate(platform, model_file, weight_file, input_file, mace_out_file,
                             mace_out_file, weight_file, input_names,
                             input_shapes, output_names, output_shapes,
                             validation_threshold)
+    elif platform == 'onnx':
+        output_shape_strs = [shape for shape in output_shape.split(':')]
+        output_shapes = [[int(x) for x in shape.split(',')]
+                         for shape in output_shape_strs]
+        validate_onnx_model(platform, device_type, model_file, input_file,
+                            mace_out_file, input_names, input_shapes,
+                            output_names, output_shapes,
+                            validation_threshold,
+                            input_data_types, backend)


 def parse_args():
@@ -259,6 +327,11 @@ def parse_args():
    parser.add_argument(
        "--validation_threshold", type=float, default=0.995,
        help="validation similarity threshold")
+    parser.add_argument(
+        "--backend",
+        type=str,
+        default="tensorflow",
+        help="onnx backend framwork")

    return parser.parse_known_args()

@@ -276,4 +349,5 @@ if __name__ == '__main__':
             FLAGS.input_node,
             FLAGS.output_node,
             FLAGS.validation_threshold,
-             FLAGS.input_data_type)
+             FLAGS.input_data_type,
+             FLAGS.backend)