Merge branch 'master' of v9.git.n.xiaomi.com:deep-computing/mace into load_model_in_pb

81865af1 · yejianwu · 931a005c · ddfac7b5 · 81865af1 · 81865af1
24 changed file
--- a/README.md
+++ b/README.md
@@ -12,12 +12,17 @@
 mobile heterogeneous computing platforms. The design is focused on the following
 targets:
 * Performance
-  * The runtime is highly optimized with NEON, OpenCL and HVX. Except for the
+  * The runtime is highly optimized with NEON, OpenCL and Hexagon, and
-    inference speed, the initialization speed is also intensively optimized.
+    [Winograd algorithm](https://arxiv.org/abs/1509.09308) is introduced to
+    speed up the convolution operations. Except for the inference speed, the
+    initialization speed is also intensively optimized.
 * Power consumption
-  * Chip dependent power options are included as advanced API.
+  * Chip dependent power options like big.LITTLE scheduling, Adreno GPU hints are
+    included as advanced API.
 * Memory usage and library footprint
  * Graph level memory allocation optimization and buffer reuse is supported.
+    The core library tries to keep minium external dependencies to keep the
+    library footprint small.
 * Model protection
  * Model protection is one the highest priority feature from the beginning of 
    the design. Various techniques are introduced like coverting models to C++
@@ -28,31 +33,34 @@ targets:
    archetectures with limited performance.
 ## Getting Started
+* [Introduction](docs/getting_started/introduction.rst)
+* [How to build](docs/getting_started/how_to_build.rst)
+* [Create a model deployment file](docs/getting_started/create_a_model_deployment.rst)
 ## Performance
-[MiAI Model Zoo](http://v9.git.n.xiaomi.com/deep-computing/mace-models) contains
+[MiAI Compute Engine Model Zoo](http://v9.git.n.xiaomi.com/deep-computing/mace-models) contains
-several common neural networks models and built daily against several mobile
+several common neural networks models and built daily against a list of mobile
 phones. The benchmark result can be found in the CI result page.
 ## Communication
 * GitHub issues: bug reports, usage issues, feature requests
-* Gitter or Slack:
+* Gitter:
-* QQ群:
+* QQ群: 756046893
 ## Contributing
 Any kind of contributions are welcome. For bug reports, feature requests,
 please just open an issue without any hesitance. For code contributions, it's
 strongly suggested to open an issue for discussion first. For more details,
-please refer to [this guide](docs).
+please refer to [the contribution guide](docs/development/contributing.md).
 ## License
 [Apache License 2.0](LICENSE).
 ## Acknowledgement
-*MiAI Compute Engine* depends on several open source projects located in
+MiAI Compute Engine depends on several open source projects located in
 [third_party](mace/third_party) directory. Particularly, we learned a lot from
 the following projects during the development:
-* [nnlib](https://source.codeaurora.org/quic/hexagon_nn/nnlib): the DSP runtime
+* [Qualcomm Hexagon NN Offload Framework](https://source.codeaurora.org/quic/hexagon_nn/nnlib): the Hexagon DSP runtime
  depends on this library.
 * [TensorFlow](https://github.com/tensorflow/tensorflow),
  [Caffe](https://github.com/BVLC/caffe),

--- a/RELEASE.md
+++ b/RELEASE.md
@@ -5,10 +5,12 @@ v0.6.0 (2018-04-04)
 ------
 1. Change mace header interfaces, only including necessary methods.
+v0.6.3 (2018-05-21)
+------
+1. support `float` data_type when running in gpu
 v0.7.0 (2018-05-18)
 ------
 1. Change interface that report error type
 2. Improve cpu performace
 3. Merge cpu/gpu engine to one
\ No newline at end of file
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
 FROM ubuntu:16.04
 # Update source
-# Looks like mirrors.163.com does not work in xiaomi network
-# RUN sed -i 's/http:\/\/archive\.ubuntu\.com\/ubuntu\//http:\/\/mirrors\.163\.com\/ubuntu\//g' /etc/apt/sources.list
 RUN apt-get update -y
 ## Basic tools

--- a/docker/gitlab-runner/Dockerfile
+++ b/docker/gitlab-runner/Dockerfile
 FROM cr.d.xiaomi.net/mace/mace-dev:latest
 # Update source
-# Looks like mirrors.163.com does not work in xiaomi network
-# RUN sed -i 's/http:\/\/archive\.ubuntu\.com\/ubuntu\//http:\/\/mirrors\.163\.com\/ubuntu\//g' /etc/apt/sources.list
 RUN apt-get update -y
 # Install gitlab runner

--- a/docs/getting_started/mace-arch.png
+++ b/docs/getting_started/mace-arch.png
--- a/mace/kernels/arm/conv_2d_neon.h
+++ b/mace/kernels/arm/conv_2d_neon.h
@@ -65,6 +65,18 @@ extern void Conv2dNeonK7x7S3(const float *input,
                             const index_t *out_shape,
                             float *output);
+extern void Conv2dNeonK1x15S1(const float *input,
+                              const float *filter,
+                              const index_t *in_shape,
+                              const index_t *out_shape,
+                              float *output);
+extern void Conv2dNeonK15x1S1(const float *input,
+                              const float *filter,
+                              const index_t *in_shape,
+                              const index_t *out_shape,
+                              float *output);
 }  // namespace kernels
 }  // namespace mace

--- a/mace/kernels/arm/conv_2d_neon_15x1.cc
+++ b/mace/kernels/arm/conv_2d_neon_15x1.cc
+// Copyright 2018 Xiaomi, Inc.  All rights reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#if defined(MACE_ENABLE_NEON)
+#include <arm_neon.h>
+#endif
+#include "mace/kernels/arm/conv_2d_neon.h"
+#include "mace/utils/utils.h"
+namespace mace {
+namespace kernels {
+inline void Conv2dCPUK15x1Calc(const float *in_ptr,
+                               const float *filter_ptr,
+                               const index_t in_width,
+                               const index_t in_channels,
+                               const index_t out_height,
+                               const index_t out_width,
+                               const index_t w,
+                               const index_t tile_width,
+                               const index_t out_image_size,
+                               float *out_ptr,
+                               const index_t io,
+                               const int stride) {
+  for (index_t ih = 0; ih < out_height; ++ih) {
+    for (index_t iw = 0; iw < tile_width && w + iw < out_width; ++iw) {
+      for (int i = 0; i < 15; ++i) {
+        for (int j = 0; j < 1; ++j) {
+          out_ptr[io * out_image_size + ih * out_width + w + iw]
+              += in_ptr[(ih * stride + i) * in_width + ((w + iw) * stride + j)]
+              * filter_ptr[io * in_channels * 15 + i * 1 + j];
+        }
+      }
+    }
+  }
+}
+// Ho = 4, Wo = 1, Co = 1
+void Conv2dNeonK15x1S1(const float *input,
+                       const float *filter,
+                       const index_t *in_shape,
+                       const index_t *out_shape,
+                       float *output) {
+  const index_t in_image_size = in_shape[2] * in_shape[3];
+  const index_t out_image_size = out_shape[2] * out_shape[3];
+  const index_t in_batch_size = in_shape[1] * in_image_size;
+  const index_t out_batch_size = out_shape[1] * out_image_size;
+  const index_t tile_width =
+      out_shape[1] < 4 ?  RoundUpDiv4(out_shape[3]) : out_shape[3];
+#pragma omp parallel for collapse(3)
+  for (index_t b = 0; b < out_shape[0]; ++b) {
+    for (index_t m = 0; m < out_shape[1]; ++m) {
+      for (index_t w = 0; w < out_shape[3]; w += tile_width) {
+        const index_t out_height = out_shape[2];
+        const index_t out_width = out_shape[3];
+        const index_t in_channels = in_shape[1];
+        const index_t in_width = in_shape[3];
+        float *out_ptr_base =
+            output + b * out_batch_size + m * out_image_size;
+        for (index_t c = 0; c < in_channels; ++c) {
+          const float *in_ptr_base =
+              input + b * in_batch_size + c * in_image_size;
+          const float *filter_ptr = filter + m * in_channels * 15 + c * 15;
+#if defined(MACE_ENABLE_NEON) && !defined(__aarch64__)
+          /* load filter (1 outch x 1 height x 4 width) */
+          float32x4_t vf0, vf1, vf2, vf3;
+          vf0 = vld1q_f32(filter_ptr);
+          vf1 = vld1q_f32(filter_ptr + 4);
+          vf2 = vld1q_f32(filter_ptr + 8);
+          vf3 = vld1q_f32(filter_ptr + 11);
+          for (index_t h = 0; h + 3 < out_height; h += 4) {
+            for (index_t wt = 0; wt < tile_width && w + wt < out_width; ++wt) {
+              // load output
+              index_t out_offset = h * out_width + w + wt;
+              // output (1 outch x 1 height x 4 width): vo_outch_height
+              float32x4_t vo = {out_ptr_base[out_offset],
+                                out_ptr_base[out_offset + out_width],
+                                out_ptr_base[out_offset + 2 * out_width],
+                                out_ptr_base[out_offset + 3 * out_width]};
+              // input offset
+              index_t in_offset = h * in_width + w + wt;
+              // input (3 slide)
+              float32x4_t vi0 = {in_ptr_base[in_offset],
+                                 in_ptr_base[in_offset + in_width],
+                                 in_ptr_base[in_offset + 2 * in_width],
+                                 in_ptr_base[in_offset + 3 * in_width]};
+              float32x4_t vi4 = {in_ptr_base[in_offset + 4 * in_width],
+                                 in_ptr_base[in_offset + 5 * in_width],
+                                 in_ptr_base[in_offset + 6 * in_width],
+                                 in_ptr_base[in_offset + 7 * in_width]};
+              float32x4_t vi8 = {in_ptr_base[in_offset + 8 * in_width],
+                                 in_ptr_base[in_offset + 9 * in_width],
+                                 in_ptr_base[in_offset + 10 * in_width],
+                                 in_ptr_base[in_offset + 11 * in_width]};
+              float32x4_t vi12 = {in_ptr_base[in_offset + 12 * in_width],
+                                  in_ptr_base[in_offset + 13 * in_width],
+                                  in_ptr_base[in_offset + 14 * in_width],
+                                  in_ptr_base[in_offset + 15 * in_width]};
+              float32x4_t vi16 = {in_ptr_base[in_offset + 16 * in_width],
+                                  in_ptr_base[in_offset + 17 * in_width]};
+              float32x4_t vi1 = vextq_f32(vi0, vi4, 1);
+              float32x4_t vi2 = vextq_f32(vi0, vi4, 2);
+              float32x4_t vi3 = vextq_f32(vi0, vi4, 3);
+              float32x4_t vi5 = vextq_f32(vi4, vi8, 1);
+              float32x4_t vi6 = vextq_f32(vi4, vi8, 2);
+              float32x4_t vi7 = vextq_f32(vi4, vi8, 3);
+              float32x4_t vi9 = vextq_f32(vi8, vi12, 1);
+              float32x4_t vi10 = vextq_f32(vi8, vi12, 2);
+              float32x4_t vi11 = vextq_f32(vi8, vi12, 3);
+              float32x4_t vi13 = vextq_f32(vi12, vi16, 1);
+              float32x4_t vi14 = vextq_f32(vi12, vi16, 2);
+              vo = vmlaq_lane_f32(vo, vi0, vget_low_f32(vf0), 0);
+              vo = vmlaq_lane_f32(vo, vi1, vget_low_f32(vf0), 1);
+              vo = vmlaq_lane_f32(vo, vi2, vget_high_f32(vf0), 0);
+              vo = vmlaq_lane_f32(vo, vi3, vget_high_f32(vf0), 1);
+              vo = vmlaq_lane_f32(vo, vi4, vget_low_f32(vf1), 0);
+              vo = vmlaq_lane_f32(vo, vi5, vget_low_f32(vf1), 1);
+              vo = vmlaq_lane_f32(vo, vi6, vget_high_f32(vf1), 0);
+              vo = vmlaq_lane_f32(vo, vi7, vget_high_f32(vf1), 1);
+              vo = vmlaq_lane_f32(vo, vi8, vget_low_f32(vf2), 0);
+              vo = vmlaq_lane_f32(vo, vi9, vget_low_f32(vf2), 1);
+              vo = vmlaq_lane_f32(vo, vi10, vget_high_f32(vf2), 0);
+              vo = vmlaq_lane_f32(vo, vi11, vget_high_f32(vf2), 1);
+              vo = vmlaq_lane_f32(vo, vi12, vget_low_f32(vf3), 1);
+              vo = vmlaq_lane_f32(vo, vi13, vget_high_f32(vf3), 0);
+              vo = vmlaq_lane_f32(vo, vi14, vget_high_f32(vf3), 1);
+              out_ptr_base[out_offset] = vo[0];
+              out_ptr_base[out_offset + out_width] = vo[1];
+              out_ptr_base[out_offset + 2 * out_width] = vo[2];
+              out_ptr_base[out_offset + 3 * out_width] = vo[3];
+            }  // wt
+          }  // h
+#else
+          Conv2dCPUK15x1Calc(in_ptr_base, filter_ptr, in_width, in_channels,
+                             out_height, out_width, w, tile_width,
+                             out_image_size, out_ptr_base, 0, 1);
+#endif
+        }  // c
+      }  // w
+    }  // m
+  }  // b
+}
+}  // namespace kernels
+}  // namespace mace
--- a/mace/kernels/arm/conv_2d_neon_1x15.cc
+++ b/mace/kernels/arm/conv_2d_neon_1x15.cc
+// Copyright 2018 Xiaomi, Inc.  All rights reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#if defined(MACE_ENABLE_NEON)
+#include <arm_neon.h>
+#endif
+#include "mace/kernels/arm/conv_2d_neon.h"
+#include "mace/utils/utils.h"
+#include "mace/utils/logging.h"
+namespace mace {
+namespace kernels {
+inline void Conv2dCPUK1x15Calc(const float *in_ptr,
+                               const float *filter_ptr,
+                               const index_t in_width,
+                               const index_t in_channels,
+                               const index_t out_height,
+                               const index_t h,
+                               const index_t tile_height,
+                               const index_t out_width,
+                               const index_t out_image_size,
+                               float *out_ptr,
+                               const index_t io,
+                               const int stride) {
+  for (index_t ih = 0; ih < tile_height && h + ih < out_height; ++ih) {
+    for (index_t iw = 0; iw < out_width; ++iw) {
+      for (int i = 0; i < 1; ++i) {
+        for (int j = 0; j < 15; ++j) {
+          out_ptr[io * out_image_size + (h + ih) * out_width + iw]
+              += in_ptr[((h + ih) * stride + i) * in_width + (iw * stride + j)]
+              * filter_ptr[io * in_channels * 15 + i * 15 + j];
+        }
+      }
+    }
+  }
+}
+// Ho = 1, Wo = 4, Co = 1
+void Conv2dNeonK1x15S1(const float *input,
+                       const float *filter,
+                       const index_t *in_shape,
+                       const index_t *out_shape,
+                       float *output) {
+  const index_t in_image_size = in_shape[2] * in_shape[3];
+  const index_t out_image_size = out_shape[2] * out_shape[3];
+  const index_t in_batch_size = in_shape[1] * in_image_size;
+  const index_t out_batch_size = out_shape[1] * out_image_size;
+  const index_t tile_height =
+      out_shape[1] < 4 ? RoundUpDiv4(out_shape[2]) : out_shape[2];
+#pragma omp parallel for collapse(3)
+  for (index_t b = 0; b < out_shape[0]; ++b) {
+    for (index_t m = 0; m < out_shape[1]; ++m) {
+      for (index_t h = 0; h < out_shape[2]; h += tile_height) {
+        const index_t out_height = out_shape[2];
+        const index_t out_width = out_shape[3];
+        const index_t in_channels = in_shape[1];
+        const index_t in_width = in_shape[3];
+        float *out_ptr_base =
+            output + b * out_batch_size + m * out_image_size;
+        for (index_t c = 0; c < in_channels; ++c) {
+          const float *in_ptr_base =
+              input + b * in_batch_size + c * in_image_size;
+          const float *filter_ptr = filter + m * in_channels * 15 + c * 15;
+#if defined(MACE_ENABLE_NEON) && !defined(__aarch64__)
+          /* load filter (1 outch x 4 height x 1 width) */
+          float32x4_t vf0, vf1, vf2, vf3;
+          vf0 = vld1q_f32(filter_ptr);
+          vf1 = vld1q_f32(filter_ptr + 4);
+          vf2 = vld1q_f32(filter_ptr + 8);
+          vf3 = vld1q_f32(filter_ptr + 11);
+          for (index_t ht = 0; ht < tile_height && h + ht < out_height; ++ht) {
+            for (index_t w = 0; w + 3 < out_width; w += 4) {
+              // output (1 outch x 1 height x 4 width): vo_outch_height
+              float32x4_t vo;
+              // load output
+              index_t out_offset = (h + ht) * out_width + w;
+              vo = vld1q_f32(out_ptr_base + out_offset);
+              // input (3 slide)
+              float32x4_t vi0, vi1, vi2, vi3, vi4, vi5, vi6, vi7, vi8, vi9,
+                  vi10, vi11, vi12, vi13, vi14, vi16;
+              // input offset
+              index_t in_offset = (h + ht) * in_width + w;
+              // load input
+              vi0 = vld1q_f32(in_ptr_base + in_offset);
+              vi4 = vld1q_f32(in_ptr_base + in_offset + 4);
+              vi8 = vld1q_f32(in_ptr_base + in_offset + 8);
+              vi12 = vld1q_f32(in_ptr_base + in_offset + 12);
+              vi16 = vld1q_f32(in_ptr_base + in_offset + 16);
+              vi1 = vextq_f32(vi0, vi4, 1);
+              vi2 = vextq_f32(vi0, vi4, 2);
+              vi3 = vextq_f32(vi0, vi4, 3);
+              vi5 = vextq_f32(vi4, vi8, 1);
+              vi6 = vextq_f32(vi4, vi8, 2);
+              vi7 = vextq_f32(vi4, vi8, 3);
+              vi9 = vextq_f32(vi8, vi12, 1);
+              vi10 = vextq_f32(vi8, vi12, 2);
+              vi11 = vextq_f32(vi8, vi12, 3);
+              vi13 = vextq_f32(vi12, vi16, 1);
+              vi14 = vextq_f32(vi12, vi16, 2);
+              vo = vmlaq_lane_f32(vo, vi0, vget_low_f32(vf0), 0);
+              vo = vmlaq_lane_f32(vo, vi1, vget_low_f32(vf0), 1);
+              vo = vmlaq_lane_f32(vo, vi2, vget_high_f32(vf0), 0);
+              vo = vmlaq_lane_f32(vo, vi3, vget_high_f32(vf0), 1);
+              vo = vmlaq_lane_f32(vo, vi4, vget_low_f32(vf1), 0);
+              vo = vmlaq_lane_f32(vo, vi5, vget_low_f32(vf1), 1);
+              vo = vmlaq_lane_f32(vo, vi6, vget_high_f32(vf1), 0);
+              vo = vmlaq_lane_f32(vo, vi7, vget_high_f32(vf1), 1);
+              vo = vmlaq_lane_f32(vo, vi8, vget_low_f32(vf2), 0);
+              vo = vmlaq_lane_f32(vo, vi9, vget_low_f32(vf2), 1);
+              vo = vmlaq_lane_f32(vo, vi10, vget_high_f32(vf2), 0);
+              vo = vmlaq_lane_f32(vo, vi11, vget_high_f32(vf2), 1);
+              vo = vmlaq_lane_f32(vo, vi12, vget_low_f32(vf3), 1);
+              vo = vmlaq_lane_f32(vo, vi13, vget_high_f32(vf3), 0);
+              vo = vmlaq_lane_f32(vo, vi14, vget_high_f32(vf3), 1);
+              vst1q_f32(out_ptr_base + out_offset, vo);
+            }  // w
+          }  // ht
+#else
+          Conv2dCPUK1x15Calc(in_ptr_base, filter_ptr, in_width, in_channels,
+                             out_height, h, tile_height, out_width,
+                             out_image_size, out_ptr_base, 0, 1);
+#endif
+        }  // c
+      }  // h
+    }  // m
+  }  // b
+}
+}  // namespace kernels
+}  // namespace mace
--- a/mace/kernels/arm/conv_winograd.cc
+++ b/mace/kernels/arm/conv_winograd.cc
--- a/mace/kernels/conv_2d.h
+++ b/mace/kernels/conv_2d.h
@@ -363,6 +363,10 @@ struct Conv2dFunctor<DeviceType::CPU, float> : Conv2dFunctorBase {
        && stride_h == 2 && stride_w == 2 && dilation_h == 1 && dilation_w == 1;
    bool use_neon_7x7_s3 = filter_h == 7 && filter_w == 7
        && stride_h == 3 && stride_w == 3 && dilation_h == 1 && dilation_w == 1;
+    bool use_neon_1x15_s1 = filter_h == 1 && filter_w == 15
+        && stride_h == 1 && stride_w == 1 && dilation_h == 1 && dilation_w == 1;
+    bool use_neon_15x1_s1 = filter_h == 15 && filter_w == 1
+        && stride_h == 1 && stride_w == 1 && dilation_h == 1 && dilation_w == 1;
    std::vector<index_t> transformed_input_shape;
    std::vector<index_t> transformed_output_shape;
@@ -402,24 +406,26 @@ struct Conv2dFunctor<DeviceType::CPU, float> : Conv2dFunctorBase {
                                       tile_count});
      transformed_filter_shape.insert(transformed_filter_shape.end(),
                                      {in_tile_area, channels, input_channels});
-    } else if (use_neon_3x3_s1) {
+    } else {
-      extra_output_height = RoundUp<index_t>(height, 2);
+      index_t tile_h, tile_w;
-      extra_input_height =
+      if (use_neon_1x1_s1) {
-        std::max(padded_input_height, extra_output_height + 2);
+        tile_h = 1;
-      extra_output_width = RoundUp<index_t>(width, 4);
+        tile_w = 1;
-      extra_input_width = std::max(padded_input_width, extra_output_width + 2);
+      } else if (use_neon_3x3_s1) {
-      if (extra_input_height != padded_input_height) {
+        tile_h = 2;
-        pad_bottom += (extra_input_height - padded_input_height);
+        tile_w = 4;
-      }
+      } else if (use_neon_15x1_s1) {
-      if (extra_input_width != padded_input_width) {
+        tile_h = 4;
-        pad_right += (extra_input_width - padded_input_width);
+        tile_w = 1;
+      } else {
+        tile_h = 1;
+        tile_w = 4;
      }
-    } else if (!use_neon_1x1_s1) {
+      extra_output_height = RoundUp<index_t>(height, tile_h);
-      extra_output_height = height;
      extra_input_height =
          std::max(padded_input_height, (extra_output_height - 1) * stride_h
              + (filter_h - 1) * dilation_h + 1);
-      extra_output_width = RoundUp<index_t>(width, 4);
+      extra_output_width = RoundUp<index_t>(width, tile_w);
      extra_input_width =
          std::max(padded_input_width, (extra_output_width - 1) * stride_w
              + (filter_w - 1) * dilation_w + 1);
@@ -584,6 +590,22 @@ struct Conv2dFunctor<DeviceType::CPU, float> : Conv2dFunctorBase {
                         extra_output_shape,
                         pad_output);
      };
+    } else if (use_neon_1x15_s1) {
+      conv_func = [=](const float *pad_input, float *pad_output) {
+        Conv2dNeonK1x15S1(pad_input,
+                         filter_data,
+                         extra_input_shape,
+                         extra_output_shape,
+                         pad_output);
+      };
+    } else if (use_neon_15x1_s1) {
+      conv_func = [=](const float *pad_input, float *pad_output) {
+        Conv2dNeonK15x1S1(pad_input,
+                          filter_data,
+                          extra_input_shape,
+                          extra_output_shape,
+                          pad_output);
+      };
    } else {
      conv_func = [=](const float *pad_input, float *pad_output) {
        Conv2dGeneral(pad_input,

--- a/mace/kernels/softmax.h
+++ b/mace/kernels/softmax.h
@@ -43,6 +43,7 @@ struct SoftmaxFunctor<DeviceType::CPU, float> {
    const index_t batch = input->dim(0);
    const index_t class_count = input->dim(1);
    const index_t class_size = input->dim(2) * input->dim(3);
+    const index_t batch_size = class_count * class_size;
    Tensor::MappingGuard input_guard(input);
    Tensor::MappingGuard output_guard(output);
@@ -50,46 +51,37 @@ struct SoftmaxFunctor<DeviceType::CPU, float> {
    float *output_data = output->mutable_data<float>();
    for (index_t b = 0; b < batch; ++b) {
-      std::vector<float>
-        max_val(class_size, std::numeric_limits<float>::lowest());
-      std::vector<float> sum_val(class_size, 0.f);
-      // calculate max for each class
-      for (index_t c = 0; c < class_count; ++c) {
-        const float
-          *input_ptr = input_data + (b * class_count + c) * class_size;
-        for (index_t k = 0; k < class_size; ++k) {
-          max_val[k] = std::max(max_val[k], input_ptr[k]);
-        }
-      }
-      // calculate data - max for each class
 #pragma omp parallel for
-      for (index_t c = 0; c < class_count; ++c) {
+      for (index_t k = 0; k < class_size; ++k) {
-        const float
+        const float *input_ptr = input_data + b * batch_size + k;
-          *input_ptr = input_data + (b * class_count + c) * class_size;
+        float *output_ptr = output_data + b * batch_size + k;
-        float *output_ptr = output_data + (b * class_count + c) * class_size;
-        for (index_t k = 0; k < class_size; ++k) {
+        float max_val = std::numeric_limits<float>::lowest();
-          output_ptr[k] = ::exp(input_ptr[k] - max_val[k]);
+        index_t channel_offset = 0;
+        for (index_t c = 0; c < class_count; ++c) {
+          float data = input_ptr[channel_offset];
+          if (data > max_val) {
+            max_val = data;
+          }
+          channel_offset += class_size;
        }
-      }
-      // calculate sum for each class
+        channel_offset = 0;
-      for (index_t c = 0; c < class_count; ++c) {
+        float sum = 0;
-        float *output_ptr = output_data + (b * class_count + c) * class_size;
+        for (index_t c = 0; c < class_count; ++c) {
-        for (index_t k = 0; k < class_size; ++k) {
+          float exp_value = ::exp(input_ptr[channel_offset] - max_val);
-          sum_val[k] += output_ptr[k];
+          sum += exp_value;
+          output_ptr[channel_offset] = exp_value;
+          channel_offset += class_size;
        }
-      }
-      // calculate (data - max) / sum for each class
+        channel_offset = 0;
-      for (index_t c = 0; c < class_count; ++c) {
+        for (index_t c = 0; c < class_count; ++c) {
-        float *output_ptr = output_data + (b * class_count + c) * class_size;
+          output_ptr[channel_offset] /= sum;
-        for (index_t k = 0; k < class_size; ++k) {
+          channel_offset += class_size;
-          output_ptr[k] /= sum_val[k];
        }
-      }
+      }  // k
-    }
+    }  // b
  }
 };

--- a/mace/kernels/transpose.h
+++ b/mace/kernels/transpose.h
@@ -15,6 +15,10 @@
 #ifndef MACE_KERNELS_TRANSPOSE_H_
 #define MACE_KERNELS_TRANSPOSE_H_
+#if defined(MACE_ENABLE_NEON)
+#include <arm_neon.h>
+#endif
 #include <vector>
 #include "mace/core/future.h"
@@ -25,6 +29,65 @@
 namespace mace {
 namespace kernels {
+static void TransposeNHWCToNCHWC3(const float *input,
+                                  float *output,
+                                  const index_t height,
+                                  const index_t width) {
+  index_t image_size = height * width;
+#pragma omp parallel for
+  for (index_t h = 0; h < height; ++h) {
+    index_t in_offset = h * width * 3;
+    index_t out_offset = h * width;
+    index_t w;
+    for (w = 0; w + 3 < width; w += 4) {
+      float32x4x3_t vi = vld3q_f32(input + in_offset);
+      vst1q_f32(output + out_offset, vi.val[0]);
+      vst1q_f32(output + out_offset + image_size, vi.val[1]);
+      vst1q_f32(output + out_offset + image_size * 2, vi.val[2]);
+      in_offset += 12;
+      out_offset += 4;
+    }
+    for (; w < width; ++w) {
+      for (index_t c = 0; c < 3; ++c) {
+        output[h * width + image_size * c + w] =
+          input[h * width * 3 + w * 3 + c];
+      }
+    }
+  }
+}
+static void TransposeNCHWToNHWCC2(const float *input,
+                                  float *output,
+                                  const index_t height,
+                                  const index_t width) {
+  index_t image_size = height * width;
+#pragma omp parallel for
+  for (index_t h = 0; h < height; ++h) {
+    index_t in_offset = h * width;
+    index_t out_offset = h * width * 2;
+    index_t w;
+    for (w = 0; w + 3 < width; w += 4) {
+      float32x4_t vi0 = vld1q_f32(input + in_offset);
+      float32x4_t vi1 = vld1q_f32(input + in_offset + image_size);
+      float32x4x2_t vi = {vi0, vi1};
+      vst2q_f32(output + out_offset, vi);
+      in_offset += 4;
+      out_offset += 8;
+    }
+    for (; w < width; ++w) {
+      for (index_t c = 0; c < 2; ++c) {
+        output[h * width * 2 + w * 2 + c] =
+          input[h * width + image_size * c + w];
+      }
+    }
+  }
+}
 template<DeviceType D, typename T>
 struct TransposeFunctor {
  explicit TransposeFunctor(const std::vector<int> &dims) : dims_(dims) {}
@@ -48,28 +111,48 @@ struct TransposeFunctor {
        }
      }
    } else if (input->dim_size() == 4) {
-      std::vector<index_t>
+      std::vector<int> transpose_order_from_NHWC_to_NCHW{0, 3, 1, 2};
-        in_stride{input_shape[1] * input_shape[2] * input_shape[3],
+      std::vector<int> transpose_order_from_NCHW_to_NHWC{0, 2, 3, 1};
-                  input_shape[2] * input_shape[3], input_shape[3], 1};
+      index_t batch_size = input->dim(1) * input->dim(2) * input->dim(3);
-      std::vector<index_t>
+      if (dims_ == transpose_order_from_NHWC_to_NCHW && input->dim(3) == 3) {
-        out_stride{output_shape[1] * output_shape[2] * output_shape[3],
+        for (index_t b = 0; b < input->dim(0); ++b) {
-                   output_shape[2] * output_shape[3], output_shape[3], 1};
+          TransposeNHWCToNCHWC3(input_data + b * batch_size,
+                                output_data + b * batch_size,
-      std::vector<index_t> idim(4, 0);
+                                input->dim(1),
-      std::vector<index_t> odim(4, 0);
+                                input->dim(2));
-      for (odim[0] = 0; odim[0] < output_shape[0]; ++odim[0]) {
+        }
-        for (odim[1] = 0; odim[1] < output_shape[1]; ++odim[1]) {
+      } else if (dims_ == transpose_order_from_NCHW_to_NHWC
-          for (odim[2] = 0; odim[2] < output_shape[2]; ++odim[2]) {
+        && input->dim(1) == 2) {
-            for (odim[3] = 0; odim[3] < output_shape[3]; ++odim[3]) {
+        for (index_t b = 0; b < input->dim(0); ++b) {
-              idim[dims_[0]] = odim[0];
+          TransposeNCHWToNHWCC2(input_data + b * batch_size,
-              idim[dims_[1]] = odim[1];
+                                output_data + b * batch_size,
-              idim[dims_[2]] = odim[2];
+                                input->dim(2),
-              idim[dims_[3]] = odim[3];
+                                input->dim(3));
+        }
-              output_data[odim[0] * out_stride[0] + odim[1] * out_stride[1]
+      } else {
-                + odim[2] * out_stride[2] + odim[3]] =
+        std::vector<index_t>
-                input_data[idim[0] * in_stride[0] + idim[1] * in_stride[1]
+          in_stride{input_shape[1] * input_shape[2] * input_shape[3],
-                  + idim[2] * in_stride[2] + idim[3]];
+                    input_shape[2] * input_shape[3], input_shape[3], 1};
+        std::vector<index_t>
+          out_stride{output_shape[1] * output_shape[2] * output_shape[3],
+                     output_shape[2] * output_shape[3], output_shape[3], 1};
+        std::vector<index_t> idim(4, 0);
+        std::vector<index_t> odim(4, 0);
+        for (odim[0] = 0; odim[0] < output_shape[0]; ++odim[0]) {
+          for (odim[1] = 0; odim[1] < output_shape[1]; ++odim[1]) {
+            for (odim[2] = 0; odim[2] < output_shape[2]; ++odim[2]) {
+              for (odim[3] = 0; odim[3] < output_shape[3]; ++odim[3]) {
+                idim[dims_[0]] = odim[0];
+                idim[dims_[1]] = odim[1];
+                idim[dims_[2]] = odim[2];
+                idim[dims_[3]] = odim[3];
+                output_data[odim[0] * out_stride[0] + odim[1] * out_stride[1]
+                  + odim[2] * out_stride[2] + odim[3]] =
+                  input_data[idim[0] * in_stride[0] + idim[1] * in_stride[1]
+                    + idim[2] * in_stride[2] + idim[3]];
+              }
            }
          }
        }

--- a/mace/ops/conv_2d_benchmark.cc
+++ b/mace/ops/conv_2d_benchmark.cc
@@ -165,6 +165,13 @@ BM_CONV_2D(1, 32, 256, 256, 3, 3, 1, 4, VALID, 32);
 BM_CONV_2D(1, 128, 56, 56, 1, 1, 1, 1, SAME, 128);
 BM_CONV_2D(1, 1024, 7, 7, 1, 1, 1, 1, SAME, 1024);
+BM_CONV_2D(64, 32, 34, 34, 3, 3, 1, 1, VALID, 32);
+BM_CONV_2D(1, 32, 34, 34, 3, 3, 1, 1, VALID, 32);
+BM_CONV_2D(1, 32, 256, 256, 1, 15, 1, 1, SAME, 2);
+BM_CONV_2D(1, 32, 256, 256, 15, 1, 1, 1, SAME, 2);
+BM_CONV_2D(1, 64, 64, 64, 15, 1, 1, 1, SAME, 2);
 }  // namespace test
 }  // namespace ops
 }  // namespace mace
--- a/mace/ops/conv_2d_test.cc
+++ b/mace/ops/conv_2d_test.cc
@@ -779,11 +779,17 @@ TEST_F(Conv2dOpTest, OPENCLHalfAlignedConv3x3S12) {
 TEST_F(Conv2dOpTest, OPENCLHalfAlignedConv15x1S12) {
  TestHalfComplexConvNxNS12<DeviceType::GPU>({32, 32}, {15, 1, 256, 2},
                                                {1, 1});
+  TestHalfComplexConvNxNS12<DeviceType::GPU>({64, 64}, {15, 1, 64, 2},
+                                             {1, 1});
+  TestHalfComplexConvNxNS12<DeviceType::GPU>({256, 256}, {15, 1, 32, 2},
+                                             {1, 1});
 }
 TEST_F(Conv2dOpTest, OPENCLHalfAlignedConv1x15S12) {
  TestHalfComplexConvNxNS12<DeviceType::GPU>({32, 32}, {1, 15, 256, 2},
                                                {1, 1});
+  TestHalfComplexConvNxNS12<DeviceType::GPU>({256, 256}, {1, 15, 32, 2},
+                                             {1, 1});
 }
 TEST_F(Conv2dOpTest, OPENCLHalfAlignedConv7x75S12) {

--- a/mace/ops/transpose_benchmark.cc
+++ b/mace/ops/transpose_benchmark.cc
@@ -83,6 +83,9 @@ void TransposeBenchmark(int iters,
 #define BM_TRANSPOSE4D(N, C, H, W, D0, D1, D2, D3)                   \
  BM_TRANSPOSE4D_MACRO(N, C, H, W, D0, D1, D2, D3, float, CPU);
+BM_TRANSPOSE4D(1, 512, 512, 3, 0, 3, 1, 2);
+BM_TRANSPOSE4D(1, 2, 512, 512, 0, 2, 3, 1);
 BM_TRANSPOSE4D(1, 64, 64, 512, 0, 3, 1, 2);
 BM_TRANSPOSE4D(1, 512, 64, 64, 0, 2, 3, 1);
 BM_TRANSPOSE2D(128, 128);

--- a/mace/ops/transpose_test.cc
+++ b/mace/ops/transpose_test.cc
@@ -37,16 +37,51 @@ void TransposeNCHWTest(const std::vector<index_t> &input_shape) {
  // Run on cpu
  net.RunOp();
-  net.FillNHWCInputToNCHWInput<DeviceType::CPU, float>("InputNCHW", "Input");
+  net.TransformDataFormat<DeviceType::CPU, float>("Input",
+                                                  DataFormat::NHWC,
+                                                  "InputNCHW",
+                                                  DataFormat::NCHW);
  ExpectTensorNear<float>(*net.GetOutput("InputNCHW"),
                          *net.GetOutput("Output"));
 }
+void TransposeNHWCTest(const std::vector<index_t> &input_shape) {
+  // Construct graph
+  OpsTestNet net;
+  // Add input data
+  net.AddRandomInput<CPU, float>("Input", input_shape);
+  OpDefBuilder("Transpose", "TransposeNHWCTest")
+    .Input("Input")
+    .Output("Output")
+    .AddIntsArg("dims", {0, 2, 3, 1})
+    .Finalize(net.NewOperatorDef());
+  // Run on cpu
+  net.RunOp();
+  net.TransformDataFormat<DeviceType::CPU, float>("Input",
+                                                  DataFormat::NCHW,
+                                                  "InputNHWC",
+                                                  DataFormat::NHWC);
+  ExpectTensorNear<float>(*net.GetOutput("InputNHWC"),
+                          *net.GetOutput("Output"));
+}
 }  // namespace
-TEST_F(TransposeOpTest, NCHW) {
+TEST_F(TransposeOpTest, NHWC_to_NCHW) {
  TransposeNCHWTest({3, 64, 64, 128});
  TransposeNCHWTest({1, 64, 48, 128});
+  TransposeNCHWTest({1, 512, 512, 3});
+  TransposeNCHWTest({2, 512, 512, 3});
+}
+TEST_F(TransposeOpTest, NCHW_to_NHWC) {
+  TransposeNHWCTest({1, 2, 512, 512});
+  TransposeNHWCTest({1, 3, 512, 512});
+  TransposeNHWCTest({2, 2, 512, 512});
 }
 TEST_F(TransposeOpTest, Rank2) {

--- a/mace/python/tools/converter.py
+++ b/mace/python/tools/converter.py
@@ -40,11 +40,6 @@ FLAGS = None
 device_type_map = {'cpu': cvt.DeviceType.CPU.value,
                   'gpu': cvt.DeviceType.GPU.value,
                   'dsp': cvt.DeviceType.HEXAGON.value}
-device_data_type_map = {
-    cvt.DeviceType.CPU.value: mace_pb2.DT_FLOAT,
-    cvt.DeviceType.GPU.value: mace_pb2.DT_HALF,
-    cvt.DeviceType.HEXAGON.value: mace_pb2.DT_UINT8
-}
 def file_checksum(fname):
@@ -129,6 +124,17 @@ def main(unused_args):
                                                       FLAGS.weight_file)
        output_graph_def = converter.run()
+        if FLAGS.gpu_data_type == 'half':
+            gpu_data_type = mace_pb2.DT_HALF
+        else:
+            gpu_data_type = mace_pb2.DT_FLOAT
+        device_data_type_map = {
+            cvt.DeviceType.CPU.value: mace_pb2.DT_FLOAT,
+            cvt.DeviceType.GPU.value: gpu_data_type,
+            cvt.DeviceType.HEXAGON.value: mace_pb2.DT_UINT8
+        }
        print("Transform model to one that can better run on device")
        if not FLAGS.runtime:
            cpu_graph_def = copy.deepcopy(output_graph_def)
@@ -180,7 +186,7 @@ def main(unused_args):
        tensor_util.rename_tensor(output_graph_def)
    tensor_infos, model_data = tensor_util.get_tensor_info_and_model_data(
-            output_graph_def, FLAGS.runtime)
+            output_graph_def, FLAGS.runtime, FLAGS.gpu_data_type)
    source_converter_lib.convert_to_source(
            output_graph_def, model_checksum, weight_checksum, FLAGS.template,
@@ -194,7 +200,10 @@ def main(unused_args):
            f.write(bytearray(model_data))
    if FLAGS.model_load_type == 'pb':
-        tensor_util.del_tensor_data(output_graph_def, FLAGS.runtime)
+        tensor_util.del_tensor_data(
+                output_graph_def, FLAGS.runtime, FLAGS.gpu_data_type)
+        tensor_util.update_tensor_data_type(
+                output_graph_def, FLAGS.runtime, FLAGS.gpu_data_type)
        with open(FLAGS.pb_output, "wb") as f:
            f.write(output_graph_def.SerializeToString())
        # with open(FLAGS.pb_output + '_txt', "wb") as f:
@@ -253,8 +262,6 @@ def parse_args():
        help="e.g., input_node")
    parser.add_argument(
        "--output_node", type=str, default="softmax", help="e.g., softmax")
-    parser.add_argument(
-        "--output_type", type=str, default="pb", help="output type: source/pb")
    parser.add_argument(
        "--template", type=str, default="", help="template path")
    parser.add_argument(
@@ -293,6 +300,8 @@ def parse_args():
        default="source",
        help="[source|pb] Load models in generated `source` code" +
                "or `pb` file.")
+    parser.add_argument(
+        "--gpu_data_type", type=str, default="half", help="half/float")
    return parser.parse_known_args()

--- a/mace/python/tools/converter_tool/base_converter.py
+++ b/mace/python/tools/converter_tool/base_converter.py
@@ -153,14 +153,15 @@ class TransformerRule(Enum):
    TRANSFORM_GPU_WINOGRAD = 8
    TRANSFORM_ADD_TO_BIASADD = 9
    FOLD_BIASADD = 10
-    FOLD_ACTIVATION = 11
+    FLATTEN_ATROUS_CONV = 11
-    TRANSPOSE_FILTERS = 12
+    FOLD_ACTIVATION = 12
-    RESHAPE_FC_WEIGHT = 13
+    TRANSPOSE_FILTERS = 13
-    TRANSPOSE_DATA_FORMAT = 14
+    RESHAPE_FC_WEIGHT = 14
-    TRANSFORM_GLOBAL_CONV_TO_FC = 15
+    TRANSPOSE_DATA_FORMAT = 15
-    TRANSFORM_BUFFER_IMAGE = 16
+    TRANSFORM_GLOBAL_CONV_TO_FC = 16
-    ADD_DEVICE_AND_DATA_TYPE = 17
+    TRANSFORM_BUFFER_IMAGE = 17
-    SORT_BY_EXECUTION = 18
+    ADD_DEVICE_AND_DATA_TYPE = 18
+    SORT_BY_EXECUTION = 19
 class ConverterInterface(object):
@@ -218,6 +219,7 @@ class ConverterOption(object):
            TransformerRule.TRANSFORM_GPU_WINOGRAD,
            TransformerRule.TRANSFORM_ADD_TO_BIASADD,
            TransformerRule.FOLD_BIASADD,
+            TransformerRule.FLATTEN_ATROUS_CONV,
            TransformerRule.FOLD_ACTIVATION,
            TransformerRule.TRANSPOSE_FILTERS,
            TransformerRule.TRANSPOSE_DATA_FORMAT,

--- a/mace/python/tools/converter_tool/tensorflow_converter.py
+++ b/mace/python/tools/converter_tool/tensorflow_converter.py
@@ -16,6 +16,7 @@
 import math
 import numpy as np
 import tensorflow as tf
+from enum import Enum
 from mace.proto import mace_pb2
 from mace.python.tools.converter_tool import base_converter
@@ -41,6 +42,50 @@ tf_epsilon_str = 'epsilon'
 tf_align_corners = 'align_corners'
 tf_block_size = 'block_size'
+TFSupportedOps = [
+    'Conv2D',
+    'DepthwiseConv2dNative',
+    'Conv2DBackpropInput',
+    'BiasAdd',
+    'Add',
+    'Sub',
+    'Mul',
+    'Div',
+    'Min',
+    'Max',
+    'Neg',
+    'Abs',
+    'RealDiv',
+    'SquaredDifference',
+    'Pow',
+    'Relu',
+    'Relu6',
+    'Tanh',
+    'Sigmoid',
+    'FusedBatchNorm',
+    'AvgPool',
+    'MaxPool',
+    'Squeeze',
+    'MatMul',
+    'Identity',
+    'Reshape',
+    'Shape',
+    'Transpose',
+    'Softmax',
+    'ResizeBilinear',
+    'Placeholder',
+    'SpaceToBatchND',
+    'BatchToSpaceND',
+    'DepthToSpace',
+    'SpaceToDepth',
+    'Pad',
+    'ConcatV2',
+    'Mean',
+    'Const',
+]
+TFOpType = Enum('TFOpType', [(op, op) for op in TFSupportedOps], type=str)
 class TensorflowConverter(base_converter.ConverterInterface):
    """A class for convert tensorflow frozen model to mace model.
@@ -53,71 +98,70 @@ class TensorflowConverter(base_converter.ConverterInterface):
        'FULL': PaddingMode.FULL
    }
    pooling_type_mode = {
-        'AvgPool': PoolingType.AVG,
+        TFOpType.AvgPool.name: PoolingType.AVG,
-        'MaxPool': PoolingType.MAX
+        TFOpType.MaxPool.name: PoolingType.MAX
    }
    eltwise_type = {
-        'Add': EltwiseType.SUM,
+        TFOpType.Add.name: EltwiseType.SUM,
-        'Sub': EltwiseType.SUB,
+        TFOpType.Sub.name: EltwiseType.SUB,
-        'Mul': EltwiseType.PROD,
+        TFOpType.Mul.name: EltwiseType.PROD,
-        'Div': EltwiseType.DIV,
+        TFOpType.Div.name: EltwiseType.DIV,
-        'Min': EltwiseType.MIN,
+        TFOpType.Min.name: EltwiseType.MIN,
-        'Max': EltwiseType.MAX,
+        TFOpType.Max.name: EltwiseType.MAX,
-        'Neg': EltwiseType.NEG,
+        TFOpType.Neg.name: EltwiseType.NEG,
-        'Abs': EltwiseType.ABS,
+        TFOpType.Abs.name: EltwiseType.ABS,
-        'RealDiv': EltwiseType.DIV,
+        TFOpType.RealDiv.name: EltwiseType.DIV,
-        'SquaredDifference': EltwiseType.SQR_DIFF,
+        TFOpType.SquaredDifference.name: EltwiseType.SQR_DIFF,
-        'Pow': EltwiseType.POW
+        TFOpType.Pow.name: EltwiseType.POW
    }
    activation_type = {
-        'Relu': ActivationType.RELU,
+        TFOpType.Relu.name: ActivationType.RELU,
-        'Relu6': ActivationType.RELUX,
+        TFOpType.Relu6.name: ActivationType.RELUX,
-        'Tanh': ActivationType.TANH,
+        TFOpType.Tanh.name: ActivationType.TANH,
-        'Sigmoid': ActivationType.SIGMOID
+        TFOpType.Sigmoid.name: ActivationType.SIGMOID
    }
    def __init__(self, option, src_model_file):
        self._op_converters = {
-            'Conv2D': self.convert_conv2d,
+            TFOpType.Conv2D.name: self.convert_conv2d,
-            'DepthwiseConv2dNative': self.convert_conv2d,
+            TFOpType.DepthwiseConv2dNative.name: self.convert_conv2d,
-            'Conv2DBackpropInput': self.convert_conv2d,
+            TFOpType.Conv2DBackpropInput.name: self.convert_conv2d,
-            'BiasAdd': self.convert_biasadd,
+            TFOpType.BiasAdd.name: self.convert_biasadd,
-            'Add': self.convert_add,
+            TFOpType.Add.name: self.convert_add,
-            'Sub': self.convert_elementwise,
+            TFOpType.Sub.name: self.convert_elementwise,
-            'Mul': self.convert_elementwise,
+            TFOpType.Mul.name: self.convert_elementwise,
-            'Div': self.convert_elementwise,
+            TFOpType.Div.name: self.convert_elementwise,
-            'Min': self.convert_elementwise,
+            TFOpType.Min.name: self.convert_elementwise,
-            'Max': self.convert_elementwise,
+            TFOpType.Max.name: self.convert_elementwise,
-            'Neg': self.convert_elementwise,
+            TFOpType.Neg.name: self.convert_elementwise,
-            'Abs': self.convert_elementwise,
+            TFOpType.Abs.name: self.convert_elementwise,
-            'RealDiv': self.convert_elementwise,
+            TFOpType.RealDiv.name: self.convert_elementwise,
-            'SquaredDifference': self.convert_elementwise,
+            TFOpType.SquaredDifference.name: self.convert_elementwise,
-            'Pow': self.convert_elementwise,
+            TFOpType.Pow.name: self.convert_elementwise,
-            'Relu': self.convert_activation,
+            TFOpType.Relu.name: self.convert_activation,
-            'Relu6': self.convert_activation,
+            TFOpType.Relu6.name: self.convert_activation,
-            'Tanh': self.convert_activation,
+            TFOpType.Tanh.name: self.convert_activation,
-            'Sigmoid': self.convert_activation,
+            TFOpType.Sigmoid.name: self.convert_activation,
-            'FusedBatchNorm': self.convert_fused_batchnorm,
+            TFOpType.FusedBatchNorm.name: self.convert_fused_batchnorm,
-            'AvgPool': self.convert_pooling,
+            TFOpType.AvgPool.name: self.convert_pooling,
-            'MaxPool': self.convert_pooling,
+            TFOpType.MaxPool.name: self.convert_pooling,
-            'Squeeze': self.convert_identity,
+            TFOpType.Squeeze.name: self.convert_identity,
-            'MatMul': self.convert_matmul,
+            TFOpType.MatMul.name: self.convert_matmul,
-            'Identity': self.convert_identity,
+            TFOpType.Identity.name: self.convert_identity,
-            'Reshape': self.convert_reshape,
+            TFOpType.Reshape.name: self.convert_reshape,
-            'Shape': self.convert_nop,
+            TFOpType.Shape.name: self.convert_nop,
-            'Transpose': self.convert_transpose,
+            TFOpType.Transpose.name: self.convert_transpose,
-            'Softmax': self.convert_softmax,
+            TFOpType.Softmax.name: self.convert_softmax,
-            'ResizeBilinear': self.convert_resize_bilinear,
+            TFOpType.ResizeBilinear.name: self.convert_resize_bilinear,
-            'Placeholder': self.convert_nop,
+            TFOpType.Placeholder.name: self.convert_nop,
-            'SpaceToBatchND': self.convert_space_batch,
+            TFOpType.SpaceToBatchND.name: self.convert_space_batch,
-            'BatchToSpaceND': self.convert_space_batch,
+            TFOpType.BatchToSpaceND.name: self.convert_space_batch,
-            'DepthToSpace': self.convert_space_depth,
+            TFOpType.DepthToSpace.name: self.convert_space_depth,
-            'SpaceToDepth': self.convert_space_depth,
+            TFOpType.SpaceToDepth.name: self.convert_space_depth,
-            'Pad': self.convert_pad,
+            TFOpType.Pad.name: self.convert_pad,
-            'ConcatV2': self.convert_concat,
+            TFOpType.ConcatV2.name: self.convert_concat,
-            'Mean': self.convert_mean,
+            TFOpType.Mean.name: self.convert_mean,
-            # Const converter_tool should be placed at the end
+            TFOpType.Const.name: self.convert_nop,
-            'Const': self.convert_tensor,
        }
        self._option = option
        self._mace_net_def = mace_pb2.NetDef()
@@ -180,24 +224,29 @@ class TensorflowConverter(base_converter.ConverterInterface):
                       "Mace does not support tensorflow op type %s yet"
                       % tf_op.type)
            self._op_converters[tf_op.type](tf_op)
+        self.convert_tensors()
-    def convert_tensor(self, tf_op):
+    def convert_tensors(self):
-        output_name = tf_op.outputs[0].name
+        for tf_op in self._tf_graph.get_operations():
-        if output_name not in self._skip_tensor:
+            if tf_op.type != TFOpType.Const.name:
-            tensor = self._mace_net_def.tensors.add()
+                continue
-            tensor.name = tf_op.outputs[0].name
+            output_name = tf_op.outputs[0].name
-            tf_tensor = tf_op.outputs[0].eval()
+            if output_name not in self._skip_tensor:
-            tensor.dims.extend(list(tf_tensor.shape))
+                tensor = self._mace_net_def.tensors.add()
+                tensor.name = tf_op.outputs[0].name
-            tf_dt = tf_op.get_attr('dtype')
+                tf_tensor = tf_op.outputs[0].eval()
-            if tf_dt == tf.float32:
+                tensor.dims.extend(list(tf_tensor.shape))
-                tensor.data_type = mace_pb2.DT_FLOAT
-                tensor.float_data.extend(tf_tensor.astype(np.float32).flat)
+                tf_dt = tf_op.get_attr('dtype')
-            elif tf_dt == tf.int32:
+                if tf_dt == tf.float32:
-                tensor.data_type = mace_pb2.DT_INT32
+                    tensor.data_type = mace_pb2.DT_FLOAT
-                tensor.int32_data.extend(tf_tensor.astype(np.int32).flat)
+                    tensor.float_data.extend(tf_tensor.astype(np.float32).flat)
-            else:
+                elif tf_dt == tf.int32:
-                mace_check(False, "Not supported tensor type: %s" % tf_dt.name)
+                    tensor.data_type = mace_pb2.DT_INT32
+                    tensor.int32_data.extend(tf_tensor.astype(np.int32).flat)
+                else:
+                    mace_check(False,
+                               "Not supported tensor type: %s" % tf_dt.name)
    def add_tensor(self, name, shape, data_type, value):
        tensor = self._mace_net_def.tensors.add()
@@ -229,9 +278,9 @@ class TensorflowConverter(base_converter.ConverterInterface):
    def convert_conv2d(self, tf_op):
        op = self.convert_general_op(tf_op)
-        if tf_op.type == 'DepthwiseConv2dNative':
+        if tf_op.type == TFOpType.DepthwiseConv2dNative.name:
            op.type = MaceOp.DepthwiseConv2d.name
-        elif tf_op.type == 'Conv2DBackpropInput':
+        elif tf_op.type == TFOpType.Conv2DBackpropInput.name:
            op.type = MaceOp.Deconv2D.name
        else:
            op.type = MaceOp.Conv2D.name
@@ -274,7 +323,7 @@ class TensorflowConverter(base_converter.ConverterInterface):
        type_arg.name = MaceKeyword.mace_activation_type_str
        type_arg.s = self.activation_type[tf_op.type].name
-        if tf_op.type == 'Relu6':
+        if tf_op.type == TFOpType.Relu6.name:
            limit_arg = op.arg.add()
            limit_arg.name = MaceKeyword.mace_activation_max_limit_str
            limit_arg.f = 6.0
@@ -335,7 +384,7 @@ class TensorflowConverter(base_converter.ConverterInterface):
        size_arg.name = MaceKeyword.mace_resize_size_str
        size_value = tf_op.inputs[1].eval().astype(np.int32)
        size_arg.ints.extend(size_value)
-        self._skip_tensor.update(tf_op.inputs[1].name)
+        self._skip_tensor.add(tf_op.inputs[1].name)
        align_corners_arg = op.arg.add()
        align_corners_arg.name = MaceKeyword.mace_align_corners_str
        align_corners_arg.i = tf_op.get_attr(tf_align_corners)
@@ -357,7 +406,7 @@ class TensorflowConverter(base_converter.ConverterInterface):
        size_arg.ints.extend(size_value)
        crops_or_paddings_arg = op.arg.add()
-        if op.type == 'BatchToSpaceND':
+        if op.type == TFOpType.BatchToSpaceND.name:
            op.type = MaceOp.BatchToSpaceND.name
            crops_or_paddings_arg.name = \
                MaceKeyword.mace_batch_to_space_crops_str
@@ -367,12 +416,12 @@ class TensorflowConverter(base_converter.ConverterInterface):
        crops_or_paddings_value = tf_op.inputs[2].eval().astype(np.int32).flat
        crops_or_paddings_arg.ints.extend(crops_or_paddings_value)
-        self._skip_tensor.update(tf_op.inputs[1].name)
+        self._skip_tensor.add(tf_op.inputs[1].name)
-        self._skip_tensor.update(tf_op.inputs[2].name)
+        self._skip_tensor.add(tf_op.inputs[2].name)
    def convert_space_depth(self, tf_op):
        op = self.convert_general_op(tf_op)
-        if op.type == 'SpaceToDepth':
+        if op.type == TFOpType.SpaceToDepth.name:
            op.type = MaceOp.SpaceToDepth.name
        else:
            op.type = MaceOp.DepthToSpace.name
@@ -390,14 +439,14 @@ class TensorflowConverter(base_converter.ConverterInterface):
        paddings_arg.name = MaceKeyword.mace_paddings_str
        paddings_value = tf_op.inputs[1].eval().astype(np.int32).flat
        paddings_arg.ints.extend(paddings_value)
-        self._skip_tensor.update(tf_op.inputs[1].name)
+        self._skip_tensor.add(tf_op.inputs[1].name)
        if len(tf_op.inputs) == 3:
            constant_value_arg = op.arg.add()
            constant_value_arg.name = MaceKeyword.mace_constant_value_str
            constant_value = tf_op.inputs[2].eval().astype(np.int32).flat[0]
            constant_value_arg.i = constant_value
-            self._skip_tensor.update(tf_op.inputs[2].name)
+            self._skip_tensor.add(tf_op.inputs[2].name)
    def convert_concat(self, tf_op):
        op = self.convert_general_op(tf_op)
@@ -412,7 +461,7 @@ class TensorflowConverter(base_converter.ConverterInterface):
        mace_check(axis == 3, "only support concat at channel dimension")
-        self._skip_tensor.update(tf_op.inputs[-1].name)
+        self._skip_tensor.add(tf_op.inputs[-1].name)
    def convert_matmul(self, tf_op):
        op = self.convert_general_op(tf_op)
@@ -426,13 +475,13 @@ class TensorflowConverter(base_converter.ConverterInterface):
        shape_arg = op.arg.add()
        shape_arg.name = MaceKeyword.mace_shape_str
        shape_value = []
-        if tf_op.inputs[1].op.type == 'Const':
+        if tf_op.inputs[1].op.type == TFOpType.Const.name:
            shape_value = list(tf_op.inputs[1].eval().astype(np.int32))
            for i in xrange(len(shape_value)):
                if shape_value[i] == -1:
                    shape_value[i] = 1
-            self._skip_tensor.update(tf_op.inputs[-1].name)
+            self._skip_tensor.add(tf_op.inputs[-1].name)
-        elif tf_op.inputs[1].op.type == 'Shape':
+        elif tf_op.inputs[1].op.type == TFOpType.Shape.name:
            shape_value = list(tf_op.inputs[1].op.inputs[0].shape.as_list())
        shape_arg.ints.extend(shape_value)

--- a/mace/python/tools/converter_tool/transformer.py
+++ b/mace/python/tools/converter_tool/transformer.py
@@ -66,6 +66,8 @@ class Transformer(base_converter.ConverterInterface):
            TransformerRule.TRANSFORM_ADD_TO_BIASADD,
            TransformerRule.FOLD_BIASADD,
            TransformerRule.FOLD_ACTIVATION,
+            TransformerRule.FLATTEN_ATROUS_CONV,
+            TransformerRule.FOLD_ACTIVATION,
            TransformerRule.TRANSPOSE_FILTERS,
            TransformerRule.TRANSPOSE_DATA_FORMAT,
            TransformerRule.TRANSFORM_GLOBAL_CONV_TO_FC,
@@ -93,6 +95,7 @@ class Transformer(base_converter.ConverterInterface):
            TransformerRule.TRANSFORM_ADD_TO_BIASADD:
                self.transform_add_to_biasadd,
            TransformerRule.FOLD_BIASADD: self.fold_biasadd,
+            TransformerRule.FLATTEN_ATROUS_CONV: self.flatten_atrous_conv,
            TransformerRule.FOLD_ACTIVATION: self.fold_activation,
            TransformerRule.TRANSPOSE_FILTERS: self.transpose_filters,
            TransformerRule.TRANSPOSE_DATA_FORMAT: self.transpose_data_format,
@@ -616,6 +619,65 @@ class Transformer(base_converter.ConverterInterface):
        return False
+    def flatten_atrous_conv(self):
+        if self._option.device != DeviceType.GPU.value:
+            return
+        net = self._model
+        for op in net.op:
+            if (op.type == MaceOp.SpaceToBatchND.name
+                    and len(self._consumers.get(op.output[0], [])) == 1):
+                conv_op = self._consumers.get(op.output[0])[0]
+                if (conv_op.type == MaceOp.Conv2D.name
+                        or conv_op.type == MaceOp.DepthwiseConv2d.name) \
+                        and len(self._consumers.get(conv_op.output[0], [])) == 1:  # noqa
+                    b2s_op = self._consumers.get(conv_op.output[0])[0]
+                    if b2s_op.type == MaceOp.BatchToSpaceND.name:
+                        print "Flatten atrous convolution"
+                        # Add args.
+                        padding_arg_values = ConverterUtil.get_arg(
+                            op,
+                            MaceKeyword.mace_paddings_str).ints
+                        blocks_arg_values = ConverterUtil.get_arg(
+                            b2s_op,
+                            MaceKeyword.mace_space_batch_block_shape_str).ints
+                        dilation_arg = ConverterUtil.get_arg(
+                            conv_op,
+                            MaceKeyword.mace_dilations_str)
+                        if dilation_arg is None:
+                            dilation_arg = conv_op.arg.add()
+                        dilation_arg.name = MaceKeyword.mace_dilations_str
+                        dilation_arg.ints[:] = blocks_arg_values
+                        padding_arg = ConverterUtil.get_arg(
+                            conv_op,
+                            MaceKeyword.mace_padding_str)
+                        if padding_arg is None:
+                            padding_arg = conv_op.arg.add()
+                        padding_arg.name = MaceKeyword.mace_padding_str
+                        if len(padding_arg_values) > 0 \
+                                and padding_arg_values[0] > 0:
+                            padding_arg.i = PaddingMode.SAME.value
+                        else:
+                            padding_arg.i = PaddingMode.VALID.value
+                        strides_arg = ConverterUtil.get_arg(
+                            conv_op,
+                            MaceKeyword.mace_strides_str)
+                        if strides_arg is None:
+                            strides_arg = conv_op.arg.add()
+                        strides_arg.name = MaceKeyword.mace_strides_str
+                        strides_arg.ints[:] = [1, 1]
+                        # update output shape
+                        conv_op.output_shape[0].dims[:] = \
+                            b2s_op.output_shape[0].dims[:]
+                        self.safe_remove_node(op, None)
+                        self.safe_remove_node(b2s_op, conv_op)
+                        return True
+        return False
    def fold_activation(self):
        net = self._model
        for op in net.op:

--- a/mace/python/tools/source_converter_lib.py
+++ b/mace/python/tools/source_converter_lib.py
@@ -27,7 +27,6 @@ def convert_to_source(net_def, model_checksum, weight_checksum, template_dir,
                      obfuscate, model_tag, output, runtime, embed_model_data,
                      winograd_conv, model_load_type, tensor_infos,
                      model_data):
    # Capture our current directory
    print template_dir

--- a/mace/python/tools/tensor_util.py
+++ b/mace/python/tools/tensor_util.py
@@ -105,11 +105,11 @@ def rename_tensor(net_def):
 class TensorInfo:
-    def __init__(self, id, t, runtime):
+    def __init__(self, id, t, runtime, gpu_data_type):
        self.id = id
        self.data_type = mace_pb2.DataType.Name(t.data_type)
        if t.data_type == mace_pb2.DT_FLOAT:
-            if runtime == 'gpu':
+            if runtime == 'gpu' and gpu_data_type == 'half':
                self.data_type = mace_pb2.DT_HALF
                self.data = bytearray(
                    np.array(t.float_data).astype(np.float16).tobytes())
@@ -127,13 +127,13 @@ class TensorInfo:
            raise Exception('Tensor data type %s not supported' % t.data_type)
-def get_tensor_info_and_model_data(net_def, runtime):
+def get_tensor_info_and_model_data(net_def, runtime, gpu_data_type):
    model_data = []
    offset = 0
    counter = 0
    tensor_infos = []
    for t in net_def.tensors:
-        tensor_info = TensorInfo(counter, t, runtime)
+        tensor_info = TensorInfo(counter, t, runtime, gpu_data_type)
        tensor_infos.append(tensor_info)
        # align
        if tensor_info.data_type != 'DT_UINT8' and offset % 4 != 0:
@@ -156,15 +156,17 @@ def get_tensor_info_and_model_data(net_def, runtime):
    return tensor_infos, model_data
-def del_tensor_data(net_def, runtime):
+def del_tensor_data(net_def, runtime, gpu_data_type):
    for t in net_def.tensors:
        if t.data_type == mace_pb2.DT_FLOAT:
            del t.float_data[:]
-            if runtime == 'gpu':
-                t.data_type = mace_pb2.DT_HALF
-            else:
-                t.data_type = mace_pb2.DT_FLOAT
        elif t.data_type == mace_pb2.DT_INT32:
            del t.int32_data[:]
        elif t.data_type == mace_pb2.DT_UINT8:
            del t.int32_data[:]
+def update_tensor_data_type(net_def, runtime, gpu_data_type):
+    for t in net_def.tensors:
+        if t.data_type == mace_pb2.DT_FLOAT and runtime == 'gpu' \
+                and gpu_data_type == 'half':
+            t.data_type = mace_pb2.DT_HALF
--- a/tools/mace_tools.py
+++ b/tools/mace_tools.py
@@ -538,6 +538,11 @@ def parse_args():
        default="source",
        help="[source|pb] Load models in generated `source` code" +
                "or `pb` file.")
+    parser.add_argument(
+        "--gpu_data_type",
+        type=str,
+        default="half",
+        help="[half | float].")
    return parser.parse_known_args()
@@ -809,7 +814,8 @@ def main(unused_args):
                model_config["fast_conv"],
                model_config["obfuscate"],
                model_output_base_dir,
-                FLAGS.model_load_type)
+                FLAGS.model_load_type,
+                FLAGS.gpu_data_type)
    for target_abi in configs["target_abis"]:
        for target_soc in target_socs:

--- a/tools/sh_commands.py
+++ b/tools/sh_commands.py
@@ -470,7 +470,8 @@ def gen_model_code(model_codegen_dir,
                   fast_conv,
                   obfuscate,
                   model_output_dir,
-                   model_load_type):
+                   model_load_type,
+                   gpu_data_type):
    print("* Genearte model code")
    bazel_build_common("//mace/python/tools:converter")
@@ -499,6 +500,7 @@ def gen_model_code(model_codegen_dir,
                  "--codegen_output=%s/model.cc" % model_codegen_dir,
                  "--pb_output=%s/%s.pb" % (model_output_dir, model_tag),
                  "--model_load_type=%s" % model_load_type,
+                  "--gpu_data_type=%s" % gpu_data_type,
                  _out=process_output,
                  _bg=True,
                  _err_to_out=True)