Merge branch 'master' of v9.git.n.xiaomi.com:deep-computing/mace into refactor_target_deps

feacc76e · yejianwu · 75b995b5 · 4c7db855 · feacc76e · feacc76e
31 changed file
--- a/README.md
+++ b/README.md
@@ -4,9 +4,9 @@


 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
+[![Build Status](https://travis-ci.org/travis-ci/travis-web.svg?branch=master)](https://travis-ci.org/travis-ci/travis-web)
 [![pipeline status](https://gitlab.com/llhe/mace/badges/master/pipeline.svg)](https://gitlab.com/llhe/mace/pipelines)
 [![doc build status](https://readthedocs.org/projects/mace/badge/?version=latest)](https://readthedocs.org/projects/mace/badge/?version=latest)
-[![Build Status](https://travis-ci.org/travis-ci/travis-web.svg?branch=master)](https://travis-ci.org/travis-ci/travis-web)

 [Documentation](https://mace.readthedocs.io) |
 [FAQ](https://mace.readthedocs.io/en/latest/faq.html) |
@@ -44,9 +44,10 @@ targets:
    architectures with limited performance.

 ## Getting Started
-* [Introduction](https://mace.readthedocs.io/en/latest/getting_started/introduction.html)
-* [Create a model deployment file](https://mace.readthedocs.io/en/latest/getting_started/create_a_model_deployment.html)
-* [How to build](https://mace.readthedocs.io/en/latest/getting_started/how_to_build.html)
+* [Introduction](https://mace.readthedocs.io/en/latest/introduction.html)
+* [Installation](https://mace.readthedocs.io/en/latest/installation/env_requirement.html)
+* [Basic Usage](https://mace.readthedocs.io/en/latest/user_guide/basic_usage.html)
+* [Advanced Usage](https://mace.readthedocs.io/en/latest/user_guide/advanced_usage.html)

 ## Performance
 [MACE Model Zoo](https://github.com/XiaoMi/mace-models) contains

--- a/README_zh.md
+++ b/README_zh.md
@@ -35,9 +35,10 @@
  同时支持在具有POSIX接口的系统的CPU上运行。

 ## 开始使用
-* [简介](https://mace.readthedocs.io/en/latest/getting_started/introduction.html)
-* [创建模型部署文件](https://mace.readthedocs.io/en/latest/getting_started/create_a_model_deployment.html)
-* [如何构建](https://mace.readthedocs.io/en/latest/getting_started/how_to_build.html)
+* [简介](https://mace.readthedocs.io/en/latest/introduction.html)
+* [安装](https://mace.readthedocs.io/en/latest/installation/env_requirement.html)
+* [基本用法](https://mace.readthedocs.io/en/latest/user_guide/basic_usage.html)
+* [高级用法](https://mace.readthedocs.io/en/latest/user_guide/advanced_usage.html)

 ## 性能评测
 [MACE Model Zoo](https://github.com/XiaoMi/mace-models)

--- a/docs/conf.py
+++ b/docs/conf.py
@@ -6,7 +6,7 @@
 import recommonmark.parser
 import sphinx_rtd_theme

-project = u'Mobile AI Compute Engine (MACE)'
+project = u'MACE'
 author = u'%s Developers' % project
 copyright = u'2018, %s' % author


--- a/docs/development/adding_a_new_op.md
+++ b/docs/development/adding_a_new_op.md
@@ -96,6 +96,10 @@ Add test and benchmark
 It's strongly recommended to add unit tests and micro benchmarks for your
 new Op. If you wish to contribute back, it's required.

+Add Op in model converter
+-------------------------
+You need to add this new Op in the model converter.
+
 Document the new Op
 ---------------------
 Finally, add an entry in operator table in the document.
--- a/docs/development/how_to_run_tests.md
+++ b/docs/development/how_to_run_tests.md
+How to run tests
+=================
+
+To run tests, you need to first cross compile the code, push the binary
+into the device and then execute the binary. To automate this process,
+MACE provides `tools/bazel_adb_run.py` tool.
+
+You need to make sure your device has been connected to your dev pc before running tests.
+
+Run unit tests
+---------------
+
+MACE use [gtest](https://github.com/google/googletest) for unit tests.
+
+* Run all unit tests defined in a Bazel target, for example, run `ops_test`:
+
+  ```sh
+  python tools/bazel_adb_run.py --target="//mace/ops:ops_test" \
+                                --run_target=True
+  ```
+
+* Run unit tests with [gtest](https://github.com/google/googletest) filter,
+for example, run `Conv2dOpTest` unit tests:
+
+  ```sh
+  python tools/bazel_adb_run.py --target="//mace/ops:ops_test" \
+                                --run_target=True \
+                                --args="--gtest_filter=Conv2dOpTest*"
+  ```
+
+Run micro benchmarks
+--------------------
+
+MACE provides a micro benchmark framework for performance tuning.
+
+* Run all micro benchmarks defined in a Bazel target, for example, run all
+`ops_benchmark` micro benchmarks:
+
+  ```sh
+  python tools/bazel_adb_run.py --target="//mace/ops:ops_benchmark" \
+                                --run_target=True
+  ```
+
+* Run micro benchmarks with regex filter, for example, run all `CONV_2D` GPU
+micro benchmarks:
+
+  ```sh
+  python tools/bazel_adb_run.py --target="//mace/ops:ops_benchmark" \
+                                --run_target=True \
+                                --args="--filter=MACE_BM_CONV_2D_.*_GPU"
+  ```
--- a/docs/development/memory_layout.rst
+++ b/docs/development/memory_layout.rst
 Memory layout
-===========================
+==============

 CPU runtime memory layout
-------------------------
+--------------------------
 The CPU tensor buffer is organized in the following order:

 .. list-table::
-    :widths: auto
    :header-rows: 1
-    :align: left

    * - Tensor type
      - Buffer
@@ -22,7 +20,7 @@ The CPU tensor buffer is organized in the following order:
      - W

 GPU runtime memory layout
-----------------------------
+--------------------------
 GPU runtime implementation base on OpenCL, which uses 2D image with CL_RGBA
 channel order as the tensor storage. This requires OpenCL 1.2 and above.

@@ -34,14 +32,12 @@ The following tables describe the mapping from different type of tensors to
 2D RGBA Image.

 Input/Output Tensor
-~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~

 The Input/Output Tensor is stored in NHWC format:

 .. list-table::
-    :widths: auto
    :header-rows: 1
-    :align: left

    * - Tensor type
      - Buffer
@@ -64,9 +60,7 @@ Each Pixel of **Image** contains 4 elements. The below table list the
 coordination relation between **Image** and **Buffer**.

 .. list-table::
-    :widths: auto
    :header-rows: 1
-    :align: left

    * - Tensor type
      - Pixel coordinate relationship
@@ -82,12 +76,10 @@ coordination relation between **Image** and **Buffer**.
      - k=[0, 4)

 Filter Tensor
-~~~~~~~~~~~~~
+~~~~~~~~~~~~~~

 .. list-table::
-    :widths: auto
    :header-rows: 1
-    :align: left

    * - Tensor
      - Buffer
@@ -106,9 +98,7 @@ Each Pixel of **Image** contains 4 elements. The below table list the
 coordination relation between **Image** and **Buffer**.

 .. list-table::
-    :widths: auto
    :header-rows: 1
-    :align: left

    * - Tensor type
      - Pixel coordinate relationship
@@ -121,12 +111,10 @@ coordination relation between **Image** and **Buffer**.
      - only support multiplier == 1, k=[0, 4)

 1-D Argument Tensor
-~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~

 .. list-table::
-    :widths: auto
    :header-rows: 1
-    :align: left

    * - Tensor type
      - Buffer
@@ -141,9 +129,7 @@ Each Pixel of **Image** contains 4 elements. The below table list the
 coordination relation between **Image** and **Buffer**.

 .. list-table::
-    :widths: auto
    :header-rows: 1
-    :align: left

    * - Tensor type
      - Pixel coordinate relationship

--- a/docs/getting_started/create_a_model_deployment.rst
+++ b/docs/getting_started/create_a_model_deployment.rst
-Create a model deployment file
-==============================
-
-The first step to deploy your models is to create a YAML model deployment
-file.
-
-One deployment file describes a case of model deployment,
-each file will generate one static library (if more than one ABIs specified,
-there will be one static library for each). The deployment file can contain
-one or more models, for example, a smart camera application may contain face
-recognition, object recognition, and voice recognition models, which can be
-defined in one deployment file.
-
-
-Example
----------
-Here is an example deployment file used by an Android demo application.
-
-TODO: change this example file to the demo deployment file
-(reuse the same file) and rename to a reasonable name.
-
-.. literalinclude:: models/demo_app_models.yaml
-   :language: yaml
-
-Configurations
--------------------
-
-.. list-table::
-    :widths: auto
-    :header-rows: 1
-    :align: left
-
-    * - library_name
-      - library name.
-    * - target_abis
-      - The target ABI to build, can be one or more of 'host', 'armeabi-v7a' or 'arm64-v8a'.
-    * - target_socs
-      - [optional] build for specified socs if you just want use the model for that socs.
-    * - embed_model_data
-      - Whether embedding model weights as the code, default to 0.
-    * - build_type
-      - model build type, can be ['proto', 'code']. 'proto' for converting model to ProtoBuf file and 'code' for converting model to c++ code.
-    * - linkshared
-      - [optional] Use dynamic linking for libmace library when setting to 1, or static linking when setting to 0, default to 0.
-    * - model_name
-      - model name, should be unique if there are multiple models.
-        **LIMIT: if build_type is code, model_name will used in c++ code so that model_name must fulfill c++ name specification.**
-    * - platform
-      - The source framework, one of [tensorflow, caffe].
-    * - model_file_path
-      - The path of the model file, can be local or remote.
-    * - model_sha256_checksum
-      - The SHA256 checksum of the model file.
-    * - weight_file_path
-      - [optional] The path of the model weights file, used by Caffe model.
-    * - weight_sha256_checksum
-      - [optional] The SHA256 checksum of the weight file, used by Caffe model.
-    * - subgraphs
-      - subgraphs key. **DO NOT EDIT**
-    * - input_tensors
-      - The input tensor names (tensorflow), top name of inputs' layer (caffe). one or more strings.
-    * - output_tensors
-      - The output tensor names (tensorflow), top name of outputs' layer (caffe). one or more strings.
-    * - input_shapes
-      - The shapes of the input tensors, in NHWC order.
-    * - output_shapes
-      - The shapes of the output tensors, in NHWC order.
-    * - input_ranges
-      - The numerical range of the input tensors, default [-1, 1]. It is only for test.
-    * - validation_inputs_data
-      - [optional] Specify Numpy validation inputs. When not provided, [-1, 1] random values will be used.
-    * - runtime
-      - The running device, one of [cpu, gpu, dsp, cpu_gpu]. cpu_gpu contains CPU and GPU model definition so you can run the model on both CPU and GPU.
-    * - data_type
-      - [optional] The data type used for specified runtime. [fp16_fp32, fp32_fp32] for GPU, default is fp16_fp32. [fp32] for CPU. [uint8] for DSP.
-    * - limit_opencl_kernel_time
-      - [optional] Whether splitting the OpenCL kernel within 1 ms to keep UI responsiveness, default to 0.
-    * - nnlib_graph_mode
-      - [optional] Control the DSP precision and performance, default to 0 usually works for most cases.
-    * - obfuscate
-      - [optional] Whether to obfuscate the model operator name, default to 0.
-    * - winograd
-      - [optional] Whether to enable Winograd convolution, **will increase memory consumption**.
--- a/docs/getting_started/how_to_build.rst
+++ b/docs/getting_started/how_to_build.rst
-How to build
-============
-
-Supported Platforms
-------------------
-
-.. list-table::
-    :widths: auto
-    :header-rows: 1
-    :align: left
-
-    * - Platform
-      - Explanation
-    * - TensorFlow
-      - >= 1.6.0.
-    * - Caffe
-      - >= 1.0.
-
-Environment Requirement
-------------------------
-
-MACE requires the following dependencies:
-
-.. list-table::
-    :widths: auto
-    :header-rows: 1
-    :align: left
-
-    * - software
-      - version
-      - install command
-    * - bazel
-      - >= 0.13.0
-      - `bazel installation guide <https://docs.bazel.build/versions/master/install.html>`__
-    * - android-ndk
-      - r15c/r16b
-      - `NDK installation guide <https://developer.android.com/ndk/guides/setup#install>`__ or refers to the docker file
-    * - adb
-      - >= 1.0.32
-      - apt-get install android-tools-adb
-    * - tensorflow
-      - >= 1.6.0
-      - pip install -I tensorflow==1.6.0 (if you use tensorflow model)
-    * - numpy
-      - >= 1.14.0
-      - pip install -I numpy==1.14.0
-    * - scipy
-      - >= 1.0.0
-      - pip install -I scipy==1.0.0
-    * - jinja2
-      - >= 2.10
-      - pip install -I jinja2==2.10
-    * - PyYaml
-      - >= 3.12.0
-      - pip install -I pyyaml==3.12
-    * - sh
-      - >= 1.12.14
-      - pip install -I sh==1.12.14
-    * - filelock
-      - >= 3.0.0
-      - pip install -I filelock==3.0.0
-    * - docker (for caffe)
-      - >= 17.09.0-ce
-      - `docker installation guide <https://docs.docker.com/install/linux/docker-ce/ubuntu/#set-up-the-repository>`__
-
-.. note::
-
-    ``export ANDROID_NDK_HOME=/path/to/ndk`` to specify ANDROID_NDK_HOME
-
-MACE provides a Dockerfile with these dependencies installed,
-you can build the image from it,
-
-.. code:: sh
-
-    docker build -t registry.cn-hangzhou.aliyuncs.com/xiaomimace/mace-dev-lite ./docker/mace-dev-lite
-
-or pull the pre-built image from Docker Hub,
-
-.. code:: sh
-
-    docker pull registry.cn-hangzhou.aliyuncs.com/xiaomimace/mace-dev-lite
-
-and then run the container with the following command.
-
-.. code:: sh
-
-    # Create container
-    # Set 'host' network to use ADB
-    docker run -it --privileged -v /dev/bus/usb:/dev/bus/usb --net=host \
-               -v /local/path:/container/path \
-               registry.cn-hangzhou.aliyuncs.com/xiaomimace/mace-dev-lite \
-               /bin/bash
-
-
-Usage
--------
-
-=======================================
-1. Pull MACE source code
-=======================================
-
-.. code:: sh
-
-    git clone https://github.com/XiaoMi/mace.git
-    git fetch --all --tags --prune
-
-    # Checkout the latest tag (i.e. release version)
-    tag_name=`git describe --abbrev=0 --tags`
-    git checkout tags/${tag_name}
-
-.. note::
-
-    It's highly recommanded to use a release version instead of master branch.
-
-============================
-2. Model Preprocessing
-============================
-
-  TensorFlow
-
-TensorFlow provides 
-`Graph Transform Tool <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md>`__
-to improve inference efficiency by making various optimizations like Ops
-folding, redundant node removal etc. It's strongly recommended to make these
-optimizations before graph conversion step.
-
-The following commands show the suggested graph transformations and
-optimizations for different runtimes,
-
-.. code:: sh
-
-    # CPU/GPU:
-    ./transform_graph \
-        --in_graph=tf_model.pb \
-        --out_graph=tf_model_opt.pb \
-        --inputs='input' \
-        --outputs='output' \
-        --transforms='strip_unused_nodes(type=float, shape="1,64,64,3") 
-            strip_unused_nodes(type=float, shape="1,64,64,3")
-            remove_nodes(op=Identity, op=CheckNumerics)
-            fold_constants(ignore_errors=true)
-            flatten_atrous_conv
-            fold_batch_norms
-            fold_old_batch_norms
-            strip_unused_nodes
-            sort_by_execution_order'
-
-.. code:: sh
-
-    # DSP:
-    ./transform_graph \
-        --in_graph=tf_model.pb \
-        --out_graph=tf_model_opt.pb \
-        --inputs='input' \
-        --outputs='output' \
-        --transforms='strip_unused_nodes(type=float, shape="1,64,64,3") 
-            strip_unused_nodes(type=float, shape="1,64,64,3")
-            remove_nodes(op=Identity, op=CheckNumerics)
-            fold_constants(ignore_errors=true)
-            fold_batch_norms
-            fold_old_batch_norms
-            backport_concatv2
-            quantize_weights(minimum_size=2)
-            quantize_nodes
-            strip_unused_nodes
-            sort_by_execution_order'
-
-  Caffe
-
-MACE converter only supports Caffe 1.0+, you need to upgrade
-your models with Caffe built-in tool when necessary,
-
-.. code:: bash
-
-    # Upgrade prototxt
-    $CAFFE_ROOT/build/tools/upgrade_net_proto_text MODEL.prototxt MODEL.new.prototxt
-
-    # Upgrade caffemodel
-    $CAFFE_ROOT/build/tools/upgrade_net_proto_binary MODEL.caffemodel MODEL.new.caffemodel
-
-==============================
-3. Build static/shared library
-==============================
-
-----------------
-3.1 Overview
-----------------
-MACE can build either static or shared library (which is
-specified by ``linkshared`` in YAML model deployment file).
-The followings are two use cases.
-
-* **Build well tuned library for specific SoCs**
-
-    When ``target_socs`` is specified in YAML model deployment file, the build
-    tool will enable automatic tuning for GPU kernels. This usually takes some
-    time to finish depending on the complexity of your model.
-
-    .. note::
-
-         You should plug in device(s) with the correspoding SoC(s).
-
-* **Build generic library for all SoCs**
-
-    When ``target_socs`` is not specified, the generated library is compatible
-    with general devices.
-
-    .. note::
-
-         There will be around of 1 ~ 10% performance drop for GPU
-         runtime compared to the well tuned library.
-
-MACE provide command line tool (``tools/converter.py``) for
-model conversion, compiling, test run, benchmark and correctness validation.
-
-.. note::
-
-     1. ``tools/converter.py`` should be run at the root directory of this project.
-     2. When ``linkshared`` is set to ``1``, ``build_type`` should be ``proto``.
-        And currently only android devices supported.
-
-
------------------------------------------
-3.2 \ ``tools/converter.py``\  usage
------------------------------------------
-
-**Commands**
-
-    * **build**
-
-        build library and test tools.
-
-    .. code:: sh
-
-        # Build library 
-        python tools/converter.py build --config=models/config.yaml
-
-
-
-    * **run**
-
-        run the model(s).
-
-    .. code:: sh
-
-    	# Test model run time
-        python tools/converter.py run --config=models/config.yaml --round=100
-
-    	# Validate the correctness by comparing the results against the
-    	# original model and framework, measured with cosine distance for similarity.
-    	python tools/converter.py run --config=models/config.yaml --validate
-
-    	# Check the memory usage of the model(**Just keep only one model in configuration file**)
-    	python tools/converter.py run --config=models/config.yaml --round=10000 &
-    	sleep 5
-    	adb shell dumpsys meminfo | grep mace_run
-    	kill %1
-
-
-    .. warning::
-
-        ``run`` rely on ``build`` command, you should ``run`` after ``build``.
-
-    * **benchmark**
-
-        benchmark and profiling model.
-
-    .. code:: sh
-
-        # Benchmark model, get detailed statistics of each Op.
-        python tools/converter.py benchmark --config=models/config.yaml
-
-
-    .. warning::
-
-        ``benchmark`` rely on ``build`` command, you should ``benchmark`` after ``build``.
-
-**Common arguments**
-
-    .. list-table::
-        :widths: auto
-        :header-rows: 1
-        :align: left
-
-        * - option
-          - type
-          - default
-          - commands
-          - explanation
-        * - --omp_num_threads
-          - int
-          - -1
-          - ``run``/``benchmark``
-          - number of threads
-        * - --cpu_affinity_policy
-          - int
-          - 1
-          - ``run``/``benchmark``
-          - 0:AFFINITY_NONE/1:AFFINITY_BIG_ONLY/2:AFFINITY_LITTLE_ONLY
-        * - --gpu_perf_hint
-          - int
-          - 3
-          - ``run``/``benchmark``
-          - 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
-        * - --gpu_perf_hint
-          - int
-          - 3
-          - ``run``/``benchmark``
-          - 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
-        * - --gpu_priority_hint
-          - int
-          - 3
-          - ``run``/``benchmark``
-          - 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
-
-Using ``-h`` to get detailed help.
-
-.. code:: sh
-
-    python tools/converter.py -h
-    python tools/converter.py build -h
-    python tools/converter.py run -h
-    python tools/converter.py benchmark -h
-
-
-=============
-4. Deployment
-=============
-
-``build`` command will generate the static/shared library, model files and
-header files and package them as
-``build/${library_name}/libmace_${library_name}.tar.gz``.
-
-  The generated ``static`` libraries are organized as follows,
-
-.. code::
-
-      build/
-      └── mobilenet-v2-gpu
-          ├── include
-          │   └── mace
-          │       └── public
-          │           ├── mace.h
-          │           └── mace_runtime.h
-          ├── libmace_mobilenet-v2-gpu.tar.gz
-          ├── lib
-          │   ├── arm64-v8a
-          │   │   └── libmace_mobilenet-v2-gpu.MI6.msm8998.a
-          │   └── armeabi-v7a
-          │       └── libmace_mobilenet-v2-gpu.MI6.msm8998.a
-          ├── model
-          │   ├── mobilenet_v2.data
-          │   └── mobilenet_v2.pb
-          └── opencl
-              ├── arm64-v8a
-              │   └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
-              └── armeabi-v7a
-                  └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
-
-  The generated ``shared`` libraries are organized as follows,
-
-.. code::
-
-      build
-      └── mobilenet-v2-gpu
-          ├── include
-          │   └── mace
-          │       └── public
-          │           ├── mace.h
-          │           └── mace_runtime.h
-          ├── lib
-          │   ├── arm64-v8a
-          │   │   ├── libgnustl_shared.so
-          │   │   └── libmace.so
-          │   └── armeabi-v7a
-          │       ├── libgnustl_shared.so
-          │       └── libmace.so
-          ├── model
-          │   ├── mobilenet_v2.data
-          │   └── mobilenet_v2.pb
-          └── opencl
-              ├── arm64-v8a
-              │   └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
-              └── armeabi-v7a
-                  └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
-
-.. note::
-
-    1. DSP runtime depends on ``libhexagon_controller.so``.
-    2. ``${MODEL_TAG}.pb`` file will be generated only when ``build_type`` is ``proto``.
-    3. ``${library_name}_compiled_opencl_kernel.${device_name}.${soc}.bin`` will
-       be generated only when ``target_socs`` and ``gpu`` runtime are specified.
-    4. Generated shared library depends on ``libgnustl_shared.so``.
-
-.. warning::
-
-    ``${library_name}_compiled_opencl_kernel.${device_name}.${soc}.bin`` depends
-    on the OpenCL version of the device, you should maintan the compatibility or
-    configure compiling cache store with ``ConfigKVStorageFactory``.
-
-=========================================
-5. How to use the library in your project
-=========================================
-
-Please refer to \ ``mace/examples/example.cc``\ for full usage. The following list the key steps.
-
-.. code:: cpp
-
-    // Include the headers
-    #include "mace/public/mace.h"
-    #include "mace/public/mace_runtime.h"
-    // If the build_type is code
-    #include "mace/public/mace_engine_factory.h"
-
-    // 0. Set pre-compiled OpenCL binary program file paths when available
-    if (device_type == DeviceType::GPU) {
-      mace::SetOpenCLBinaryPaths(opencl_binary_paths);
-    }
-
-    // 1. Set compiled OpenCL kernel cache, this is used to reduce the
-    // initialization time since the compiling is too slow. It's suggested
-    // to set this even when pre-compiled OpenCL program file is provided
-    // because the OpenCL version upgrade may also leads to kernel
-    // recompilations.
-    const std::string file_path ="path/to/opencl_cache_file";
-    std::shared_ptr<KVStorageFactory> storage_factory(
-        new FileStorageFactory(file_path));
-    ConfigKVStorageFactory(storage_factory);
-
-    // 2. Declare the device type (must be same with ``runtime`` in configuration file)
-    DeviceType device_type = DeviceType::GPU;
-
-    // 3. Define the input and output tensor names.
-    std::vector<std::string> input_names = {...};
-    std::vector<std::string> output_names = {...};
-
-    // 4. Create MaceEngine instance 
-    std::shared_ptr<mace::MaceEngine> engine;
-    MaceStatus create_engine_status;
-    // Create Engine from compiled code
-    create_engine_status =
-        CreateMaceEngineFromCode(model_name.c_str(),
-                                 nullptr,
-                                 input_names,
-                                 output_names,
-                                 device_type,
-                                 &engine);
-    // Create Engine from model file
-    create_engine_status =
-        CreateMaceEngineFromProto(model_pb_data,
-                                  model_data_file.c_str(),
-                                  input_names,
-                                  output_names,
-                                  device_type,
-                                  &engine);
-    if (create_engine_status != MaceStatus::MACE_SUCCESS) {
-      // Report error
-    }
-
-    // 5. Create Input and Output tensor buffers
-    std::map<std::string, mace::MaceTensor> inputs;
-    std::map<std::string, mace::MaceTensor> outputs;
-    for (size_t i = 0; i < input_count; ++i) {
-      // Allocate input and output
-      int64_t input_size =
-          std::accumulate(input_shapes[i].begin(), input_shapes[i].end(), 1,
-                          std::multiplies<int64_t>());
-      auto buffer_in = std::shared_ptr<float>(new float[input_size],
-                                              std::default_delete<float[]>());
-      // Load input here
-      // ...
-
-      inputs[input_names[i]] = mace::MaceTensor(input_shapes[i], buffer_in);
-    }
-
-    for (size_t i = 0; i < output_count; ++i) {
-      int64_t output_size =
-          std::accumulate(output_shapes[i].begin(), output_shapes[i].end(), 1,
-                          std::multiplies<int64_t>());
-      auto buffer_out = std::shared_ptr<float>(new float[output_size],
-                                               std::default_delete<float[]>());
-      outputs[output_names[i]] = mace::MaceTensor(output_shapes[i], buffer_out);
-    }
-
-    // 6. Run the model
-    MaceStatus status = engine.Run(inputs, &outputs);
-
--- a/docs/getting_started/introduction.rst
+++ b/docs/getting_started/introduction.rst
-Introduction
-============
-
-Mobile AI Compute Engine (MACE) is a deep learning inference framework optimized for
-mobile heterogeneous computing platforms. The following figure shows the
-overall architecture.
-
-.. image:: mace-arch.png
-   :scale: 40 %
-   :align: center
-
-Model format
------------
-
-MACE defines a customized model format which is similar to
-Caffe2. The MACE model can be converted from exported models by TensorFlow
-and Caffe. A YAML file is used to describe the model deployment details. In the
-next chapter, there is a detailed guide showing how to create this YAML file.
-
-Model conversion
----------------
-
-Currently, we provide model converters for TensorFlow and Caffe. And
-more frameworks will be supported in the future.
-
-Model loading
-------------
-
-The MACE model format contains two parts: the model graph definition and
-the model parameter tensors. The graph part utilizes Protocol Buffers
-for serialization. All the model parameter tensors are concatenated
-together into a continuous byte array, and we call this array tensor data in
-the following paragraphs. In the model graph, the tensor data offsets
-and lengths are recorded.
-
-The models can be loaded in 3 ways:
-
-1. Both model graph and tensor data are dynamically loaded externally
-   (by default, from file system, but the users are free to choose their own
-   implementations, for example, with compression or encryption). This
-   approach provides the most flexibility but the weakest model protection.
-2. Both model graph and tensor data are converted into C++ code and loaded
-   by executing the compiled code. This approach provides the strongest
-   model protection and simplest deployment.
-3. The model graph is converted into C++ code and constructed as the second
-   approach, and the tensor data is loaded externally as the first approach.
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -6,21 +6,37 @@ The main documentation is organized into the following sections:

 .. toctree::
   :maxdepth: 1
-   :caption: Getting started
-   :name: sec-start
+   :caption: Introduction
+   :name: sec-intro

-   getting_started/introduction
-   getting_started/create_a_model_deployment
-   getting_started/how_to_build
-   getting_started/op_lists
+   introduction

 .. toctree::
   :maxdepth: 1
-   :caption: Development
+   :caption: Installation
+   :name: sec-install
+
+   installation/env_requirement
+   installation/using_docker
+   installation/manual_setup
+
+.. toctree::
+   :maxdepth: 1
+   :caption: User guide
+   :name: sec-user
+
+   user_guide/basic_usage
+   user_guide/advanced_usage
+   user_guide/op_lists
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Developer guide
   :name: sec-devel

   development/contributing
   development/adding_a_new_op
+   development/how_to_run_tests
   development/memory_layout

 .. toctree::

--- a/docs/installation/env_requirement.rst
+++ b/docs/installation/env_requirement.rst
+Environment requirement
+========================
+
+MACE requires the following dependencies:
+
+Required dependencies
+---------------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Software
+      - Installation command
+      - Tested version
+    * - Python
+      -
+      - 2.7
+    * - Bazel
+      - `bazel installation guide <https://docs.bazel.build/versions/master/install.html>`__
+      - 0.13.0
+    * - CMake
+      - apt-get install cmake
+      - >= 3.11.3
+    * - Jinja2
+      - pip install -I jinja2==2.10
+      - 2.10
+    * - PyYaml
+      - pip install -I pyyaml==3.12
+      - 3.12.0
+    * - sh
+      - pip install -I sh==1.12.14
+      - 1.12.14
+
+Optional dependencies
+---------------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Software
+      - Installation command
+      - Remark
+    * - Android NDK
+      - `NDK installation guide <https://developer.android.com/ndk/guides/setup#install>`__
+      - Required by Android build, r15b, r15c, r16b
+    * - ADB
+      - apt-get install android-tools-adb
+      - Required by Android run, >= 1.0.32
+    * - TensorFlow
+      - pip install -I tensorflow==1.6.0
+      - Required by TensorFlow model
+    * - Docker
+      - `docker installation guide <https://docs.docker.com/install/linux/docker-ce/ubuntu/#set-up-the-repository>`__
+      - Required by docker mode for Caffe model
+    * - Numpy
+      - pip install -I numpy==1.14.0
+      - Required by model validation
+    * - Scipy
+      - pip install -I scipy==1.0.0
+      - Required by model validation
+    * - FileLock
+      - pip install -I filelock==3.0.0
+      - Required by Android run
+
+.. note::
+
+    For Android build, `ANDROID_NDK_HOME` must be confifigured by using ``export ANDROID_NDK_HOME=/path/to/ndk``
--- a/docs/installation/manual_setup.rst
+++ b/docs/installation/manual_setup.rst
+Manual setup
+=============
+
+The setup steps are based on ``Ubuntu``, you can change the commands
+correspondingly for other systems.
+For the detailed installation dependencies, please refer to :doc:`env_requirement`.
+
+Install Bazel
+-------------
+
+Recommend bazel with version larger than ``0.13.0`` (Refer to `Bazel documentation <https://docs.bazel.build/versions/master/install.html>`__).
+
+.. code:: sh
+
+    export BAZEL_VERSION=0.13.1
+    mkdir /bazel && \
+        cd /bazel && \
+        wget https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \
+        chmod +x bazel-*.sh && \
+        ./bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \
+        cd / && \
+        rm -f /bazel/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh
+
+Install Android NDK
+--------------------
+
+The recommended Android NDK versions includes r15b, r15c and r16b (Refers to
+`NDK installation guide <https://developer.android.com/ndk/guides/setup#install>`__).
+
+.. code:: sh
+
+    # Download NDK r15c
+    cd /opt/ && \
+        wget -q https://dl.google.com/android/repository/android-ndk-r15c-linux-x86_64.zip && \
+        unzip -q android-ndk-r15c-linux-x86_64.zip && \
+        rm -f android-ndk-r15c-linux-x86_64.zip
+
+    export ANDROID_NDK_VERSION=r15c
+    export ANDROID_NDK=/opt/android-ndk-${ANDROID_NDK_VERSION}
+    export ANDROID_NDK_HOME=${ANDROID_NDK}
+
+    # add to PATH
+    export PATH=${PATH}:${ANDROID_NDK_HOME}
+
+Install extra tools
+--------------------
+
+.. code:: sh
+
+    apt-get install -y --no-install-recommends \
+        cmake \
+        android-tools-adb
+    pip install -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com setuptools
+    pip install -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com \
+        "numpy>=1.14.0" \
+        scipy \
+        jinja2 \
+        pyyaml \
+        sh==1.12.14 \
+        pycodestyle==2.4.0 \
+        filelock
+
+Install TensorFlow (Optional)
+------------------------------
+
+.. code:: sh
+
+    pip install -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com tensorflow==1.6.0
+
+
+Install Caffe (Optional)
+-------------------------
+
+Please follow the installation instruction of `Caffe <http://caffe.berkeleyvision.org/installation.html>`__.
--- a/docs/installation/using_docker.rst
+++ b/docs/installation/using_docker.rst
+Using docker
+=============
+
+Pull or build docker image
+---------------------------
+
+MACE provides docker images with dependencies installed and also Dockerfiles for images building,
+you can pull the existing ones directly or build them from the Dockerfiles.
+In most cases, the ``lite edition`` image can satisfy developer's basic needs.
+
+.. note::
+    It's highly recommended to pull built images.
+
+- ``lite edition`` docker image.
+
+.. code:: sh
+
+    # Pull lite edition docker image
+    docker pull registry.cn-hangzhou.aliyuncs.com/xiaomimace/mace-dev-lite
+    # Build lite edition docker image
+    docker build -t registry.cn-hangzhou.aliyuncs.com/xiaomimace/mace-dev-lite ./docker/mace-dev-lite
+
+- ``full edition`` docker image (which contains multiple NDK versions and other dev tools).
+
+.. code:: sh
+
+    # Pull full edition docker image
+    docker pull registry.cn-hangzhou.aliyuncs.com/xiaomimace/mace-dev
+    # Build full edition docker image
+    docker build -t registry.cn-hangzhou.aliyuncs.com/xiaomimace/mace-dev ./docker/mace-dev
+
+.. note::
+
+    We will show steps with lite edition later.
+
+
+Using the image
+-----------------
+
+Create container with the following command
+
+.. code:: sh
+
+    # Create a container named `mace-dev`
+    docker run -it --privileged -d --name mace-dev \
+               -v /dev/bus/usb:/dev/bus/usb --net=host \
+               -v /local/path:/container/path \
+               registry.cn-hangzhou.aliyuncs.com/xiaomimace/mace-dev-lite
+    # Execute an interactive bash shell on the container
+    docker exec -it mace-dev /bin/bash
--- a/docs/introduction.rst
+++ b/docs/introduction.rst
+Introduction
+============
+
+MACE (Mobile AI Compute Engine) is a deep learning inference framework optimized for
+mobile heterogeneous computing platforms. 
+MACE provides tools and documents to help users to deploy deep learning models
+to mobile phones, tablets, personal computers and IoT devices.
+
+Architecture
+-------------
+The following figure shows the overall architecture.
+
+.. image:: mace-arch.png
+   :scale: 40 %
+   :align: center
+
+MACE Model
+~~~~~~~~~~
+
+MACE defines a customized model format which is similar to
+Caffe2. The MACE model can be converted from exported models by TensorFlow
+and Caffe.
+
+MACE Interpreter
+~~~~~~~~~~~~~~~~~
+
+Mace Interpreter mainly parses the NN graph and manages the tensors in the graph.
+
+Runtime
+~~~~~~~
+
+CPU/GPU/DSP runtime correspond to the Ops for different devices.
+
+Workflow
+--------
+The following figure shows the basic work flow of MACE.
+
+.. image:: mace-work-flow.png
+   :scale: 60 %
+   :align: center
+
+1. Configure model deployment file
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Model deploy configuration file (.yml) describes the information of the model and library,
+MACE will build the library based on the file.
+
+2. Build libraries
+~~~~~~~~~~~~~~~~~~
+Build MACE dynamic or static libraries.
+
+3. Convert model
+~~~~~~~~~~~~~~~~~~
+Convert TensorFlow or Caffe model to MACE model.
+
+4.1. Deploy
+~~~~~~~~~~~~~~~~~~
+Integrate the MACE library into your application and run with MACE API.
+
+4.2. Run (CLI)
+~~~~~~~~~~~~~~~~~~
+MACE provides `mace_run` command line tool, which could be used to run model
+and validate model correctness against original TensorFlow or Caffe results.
+
+4.3. Benchmark
+~~~~~~~~~~~~~~~~~~
+MACE provides benchmark tool to get the Op level profiling result of the model.
+
+
+
+简介
+----
+
+Mobile AI Compute Engine (MACE) 是一个专为移动端异构计算设备优化的深度学习前向预测框架。
+MACE覆盖了常见的移动端计算设备（CPU，GPU和DSP），并且提供了完整的工具链和文档，用户借助MACE能够
+很方便地在移动端部署深度学习模型。MACE已经在小米内部广泛使用并且被充分验证具有业界领先的性能和稳定性。
+
+框架
+----
+下图描述了MACE的基本框架。
+
+.. image:: mace-arch.png
+   :scale: 40 %
+   :align: center
+
+
+MACE Model
+~~~~~~~~~~~~~~~~~~
+
+MACE定义了自有的模型格式（类似于Caffe2），通过MACE提供的工具可以将Caffe和TensorFlow的模型
+转为MACE模型。
+
+MACE Interpreter
+~~~~~~~~~~~~~~~~~~
+
+MACE Interpreter主要负责解析运行神经网络图（DAG）并管理网络中的Tensors。
+
+Runtime
+~~~~~~~~~~~~~~~~~~
+
+CPU/GPU/DSP Runtime对应于各个计算设备的算子实现。
+
+使用流程
+------------
+下图描述了MACE使用的基本流程。
+
+.. image:: mace-work-flow-zh.png
+   :scale: 60 %
+   :align: center
+
+
+1. 配置模型部署文件(.yml)
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+模型部署文件详细描述了需要部署的模型以及生成库的信息，MACE根据该文件最终生成对应的库文件。
+
+2. 编译MACE库
+~~~~~~~~~~~~~~~~~~
+编译MACE的静态库或者动态库。
+
+3. 转换模型
+~~~~~~~~~~~~~~~~~~
+将TensorFlow 或者 Caffe的模型转为MACE的模型。
+
+4.1. 部署
+~~~~~~~~~~~~~~~~~~
+根据不同使用目的集成Build阶段生成的库文件，然后调用MACE相应的接口执行模型。
+
+4.2. 命令行运行
+~~~~~~~~~~~~~~~~~~
+MACE提供了命令行工具，可以在命令行运行模型，可以用来测试模型运行时间，内存占用和正确性。
+
+4.3. Benchmark
+~~~~~~~~~~~~~~~~~~
+MACE提供了命令行benchmark工具，可以细粒度的查看模型中所涉及的所有算子的运行时间。
+
--- a/docs/getting_started/mace-arch.png
+++ b/docs/getting_started/mace-arch.png
--- a/docs/mace-work-flow-zh.png
+++ b/docs/mace-work-flow-zh.png
--- a/docs/mace-work-flow.png
+++ b/docs/mace-work-flow.png
--- a/docs/user_guide/advanced_usage.rst
+++ b/docs/user_guide/advanced_usage.rst
+Advanced usage
+===============
+
+This part contains the full usage of MACE.
+
+Overview
+---------
+
+As mentioned in the previous part, a model deployment file defines a case of model deployment.
+The building process includes parsing model deployment file, converting models,
+building MACE core library and packing generated model libraries.
+
+Deployment file
+---------------
+
+
+One deployment file will generate one library normally, but if more than one ABIs are specified,
+one library will be generated for each ABI.
+A deployment file can also contain multiple models. For example, an AI camera application may
+contain face recognition, object recognition, and voice recognition models, all of which can be defined
+in one deployment file.
+
+* **Example**
+
+    Here is an example deployment file with two models.
+
+    .. literalinclude:: models/demo_models.yml
+        :language: yaml
+
+
+* **Configurations**
+
+
+.. list-table::
+    :header-rows: 1
+
+    * - Options
+      - Usage
+    * - library_name
+      - Library name.
+    * - target_abis
+      - The target ABI(s) to build, could be 'host', 'armeabi-v7a' or 'arm64-v8a'.
+        If more than one ABIs will be used, separate them by commas.
+    * - target_socs
+      - [optional] Build for specific SoCs.
+    * - model_graph_format
+      - model graph format, could be 'file' or 'code'. 'file' for converting model graph to ProtoBuf file(.pb) and 'code' for converting model graph to c++ code.
+    * - model_data_format
+      - model data format, could be 'file' or 'code'. 'file' for converting model weight to data file(.data) and 'code' for converting model weight to c++ code.
+    * - model_name
+      - model name should be unique if there are more than one models.
+        **LIMIT: if build_type is code, model_name will be used in c++ code so that model_name must comply with c++ name specification.**
+    * - platform
+      - The source framework, tensorflow or caffe.
+    * - model_file_path
+      - The path of your model file which can be local path or remote URL.
+    * - model_sha256_checksum
+      - The SHA256 checksum of the model file.
+    * - weight_file_path
+      - [optional] The path of Caffe model weights file.
+    * - weight_sha256_checksum
+      - [optional] The SHA256 checksum of Caffe model weights file.
+    * - subgraphs
+      - subgraphs key. **DO NOT EDIT**
+    * - input_tensors
+      - The input tensor name(s) (tensorflow) or top name(s) of inputs' layer (caffe).
+        If there are more than one tensors, use one line for a tensor.
+    * - output_tensors
+      - The output tensor name(s) (tensorflow) or top name(s) of outputs' layer (caffe).
+        If there are more than one tensors, use one line for a tensor.
+    * - input_shapes
+      - The shapes of the input tensors, in NHWC order.
+    * - output_shapes
+      - The shapes of the output tensors, in NHWC order.
+    * - input_ranges
+      - The numerical range of the input tensors' data, default [-1, 1]. It is only for test.
+    * - validation_inputs_data
+      - [optional] Specify Numpy validation inputs. When not provided, [-1, 1] random values will be used.
+    * - runtime
+      - The running device, one of [cpu, gpu, dsp, cpu_gpu]. cpu_gpu contains CPU and GPU model definition so you can run the model on both CPU and GPU.
+    * - data_type
+      - [optional] The data type used for specified runtime. [fp16_fp32, fp32_fp32] for GPU, default is fp16_fp32, [fp32] for CPU and [uint8] for DSP.
+    * - limit_opencl_kernel_time
+      - [optional] Whether splitting the OpenCL kernel within 1 ms to keep UI responsiveness, default is 0.
+    * - obfuscate
+      - [optional] Whether to obfuscate the model operator name, default to 0.
+    * - winograd
+      - [optional] Which type winograd to use, could be [0, 2, 4]. 0 for disable winograd, 2 and 4 for enable winograd, 4 may be faster than 2 but may take more memory.
+
+
+.. note::
+
+    Some command tools:
+
+    .. code:: bash
+
+        # Get device's soc info.
+        adb shell getprop | grep platform
+
+        # command for generating sha256_sum
+        sha256sum /path/to/your/file
+
+
+Advanced usage
+--------------
+
+There are two common advanced use cases:
+  - converting model to C++ code.
+  - tuning GPU kernels for a specific SoC.
+
+* **Convert model(s) to C++ code**
+
+    .. warning::
+
+         If you want to use this case, you can just use static mace library.
+
+    * **1. Change the model deployment file(.yml)**
+
+        If you want to protect your model, you can convert model to C++ code. there are also two cases:
+
+        * convert model graph to code and model weight to file with below model configuration.
+
+        .. code:: sh
+
+            model_graph_format: code
+            model_data_format: file
+
+        * convert both model graph and model weight to code with below model configuration.
+
+        .. code:: sh
+
+            model_graph_format: code
+            model_data_format: code
+
+        .. note::
+
+             Another model protection method is using ``obfuscate`` to obfuscate names of model's operators.
+
+    * **2. Convert model(s) to code**
+
+        .. code:: sh
+
+            python tools/converter.py convert --config=/path/to/model_deployment_file.yml
+
+        The command will generate **${library_name}.a** in **builds/${library_name}/model** directory and
+        ** *.h ** in **builds/${library_name}/include** like the following dir-tree.
+
+        .. code::
+
+             # model_graph_format: code
+             # model_data_format: file
+
+             builds
+               ├── include
+               │   └── mace
+               │       └── public
+               │           ├── mace_engine_factory.h
+               │           └── mobilenet_v1.h
+               └── model
+                   ├── mobilenet-v1.a
+                   └── mobilenet_v1.data
+
+
+    * **3. Deployment**
+        * Link `libmace.a` and `${library_name}.a` to your target.
+        * Refer to \ ``mace/examples/example.cc``\ for full usage. The following list the key steps.
+
+        .. code:: cpp
+
+            // Include the headers
+            #include "mace/public/mace.h"
+            #include "mace/public/mace_runtime.h"
+            // If the model_graph_format is code
+            #include "mace/public/${model_name}.h"
+            #include "mace/public/mace_engine_factory.h"
+
+            // ... Same with the code in basic usage
+
+            // 4. Create MaceEngine instance
+            std::shared_ptr<mace::MaceEngine> engine;
+            MaceStatus create_engine_status;
+            // Create Engine from compiled code
+            create_engine_status =
+                CreateMaceEngineFromCode(model_name.c_str(),
+                                         nullptr,
+                                         input_names,
+                                         output_names,
+                                         device_type,
+                                         &engine);
+            if (create_engine_status != MaceStatus::MACE_SUCCESS) {
+              // Report error
+            }
+
+            // ... Same with the code in basic usage
+
+
+* **Tuning for specific SoC's GPU**
+
+    If you want to use the GPU of a specific device, you can just specify the ``target_socs`` in your YAML file and
+    then tune the MACE lib for it (OpenCL kernels), which may get 1~10% performance improvement.
+
+    * **1. Change the model deployment file(.yml)**
+
+        Specify ``target_socs`` in your model deployment file(.yml):
+
+        .. code:: sh
+
+            target_socs: [sdm845]
+
+        .. note::
+
+            Get device's soc info: `adb shell getprop | grep platform`
+
+    * **2. Convert model(s)**
+
+        .. code:: sh
+
+            python tools/converter.py convert --config=/path/to/model_deployment_file.yml
+
+    * **3. Tuning**
+
+        The tools/converter.py will enable automatic tuning for GPU kernels. This usually takes some
+        time to finish depending on the complexity of your model.
+
+        .. note::
+
+             You should plug in device(s) with the specific SoC(s).
+
+
+        .. code:: sh
+
+            python tools/converter.py run --config=/path/to/model_deployment_file.yml --validate
+
+        The command will generate two files in `builds/${library_name}/opencl`, like the following dir-tree.
+
+        .. code::
+
+              builds
+              └── mobilenet-v2
+                  ├── model
+                  │   ├── mobilenet_v2.data
+                  │   └── mobilenet_v2.pb
+                  └── opencl
+                      └── arm64-v8a
+                         ├── moblinet-v2_compiled_opencl_kernel.MiNote3.sdm660.bin
+                         └── moblinet-v2_tuned_opencl_parameter.MiNote3.sdm660.bin
+
+
+        * **mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin** stands for the OpenCL binaries
+          used for your models, which could accelerate the initialization stage.
+          Details please refer to `OpenCL Specification <https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateProgramWithBinary.html>`__.
+        * **mobilenet-v2-tuned_opencl_parameter.MI6.msm8998.bin** stands for the tuned OpenCL parameters
+          for the SoC.
+
+    * **4. Deployment**
+        * Change the names of files generated above for not collision and push them to **your own device's directory**.
+        * Use like the previous procedure, below lists the key steps differently.
+
+        .. code:: cpp
+
+            // Include the headers
+            #include "mace/public/mace.h"
+            #include "mace/public/mace_runtime.h"
+
+            // 0. Set pre-compiled OpenCL binary program file paths and OpenCL parameters file path when available
+            if (device_type == DeviceType::GPU) {
+              mace::SetOpenCLBinaryPaths(path/to/opencl_binary_paths);
+              mace::SetOpenCLParameterPath(path/to/opencl_parameter_file);
+            }
+
+            // ... Same with the code in basic usage.
+
+
+Useful Commands
+---------------
+* **run the model**
+
+.. code:: sh
+
+    # Test model run time
+    python tools/converter.py run --config=/path/to/model_deployment_file.yml --round=100
+
+    # Validate the correctness by comparing the results against the
+    # original model and framework, measured with cosine distance for similarity.
+    python tools/converter.py run --config=/path/to/model_deployment_file.yml --validate
+
+    # Check the memory usage of the model(**Just keep only one model in deployment file**)
+    python tools/converter.py run --config=/path/to/model_deployment_file.yml --round=10000 &
+    sleep 5
+    adb shell dumpsys meminfo | grep mace_run
+    kill %1
+
+
+.. warning::
+
+    ``run`` rely on ``convert`` command, you should ``convert`` before ``run``.
+
+* **benchmark and profile model**
+
+.. code:: sh
+
+    # Benchmark model, get detailed statistics of each Op.
+    python tools/converter.py benchmark --config=/path/to/model_deployment_file.yml
+
+
+.. warning::
+
+    ``benchmark`` rely on ``convert`` command, you should ``benchmark`` after ``convert``.
+
+**Common arguments**
+
+    .. list-table::
+        :header-rows: 1
+
+        * - option
+          - type
+          - default
+          - commands
+          - explanation
+        * - --omp_num_threads
+          - int
+          - -1
+          - ``run``/``benchmark``
+          - number of threads
+        * - --cpu_affinity_policy
+          - int
+          - 1
+          - ``run``/``benchmark``
+          - 0:AFFINITY_NONE/1:AFFINITY_BIG_ONLY/2:AFFINITY_LITTLE_ONLY
+        * - --gpu_perf_hint
+          - int
+          - 3
+          - ``run``/``benchmark``
+          - 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
+        * - --gpu_perf_hint
+          - int
+          - 3
+          - ``run``/``benchmark``
+          - 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
+        * - --gpu_priority_hint
+          - int
+          - 3
+          - ``run``/``benchmark``
+          - 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
+
+Use ``-h`` to get detailed help.
+
+.. code:: sh
+
+    python tools/converter.py -h
+    python tools/converter.py build -h
+    python tools/converter.py run -h
+    python tools/converter.py benchmark -h
--- a/docs/user_guide/basic_usage.rst
+++ b/docs/user_guide/basic_usage.rst
+Basic usage
+============
+
+
+Build and run an example model
+-------------------------------
+
+At first, make sure the environment has been set up correctly already (refer to :doc:`../installation/env_requirement`).
+
+The followings are instructions about how to quickly build and run a provided model in
+`MACE Model Zoo <https://github.com/XiaoMi/mace-models>`__.
+
+Here we use the mobilenet-v2 model as an example.
+
+**Commands**
+
+    1. Pull `MACE <https://github.com/XiaoMi/mace>`__ project.
+
+    .. code:: sh
+
+        git clone https://github.com/XiaoMi/mace.git
+        git fetch --all --tags --prune
+
+        # Checkout the latest tag (i.e. release version)
+        tag_name=`git describe --abbrev=0 --tags`
+        git checkout tags/${tag_name}
+
+    .. note::
+
+        It's highly recommanded to use a release version instead of master branch.
+
+
+    2. Pull `MACE Model Zoo <https://github.com/XiaoMi/mace-models>`__ project.
+
+    .. code:: sh
+
+        git clone https://github.com/XiaoMi/mace-models.git
+
+
+    3. Build a generic MACE library.
+
+    .. code:: sh
+
+        cd path/to/mace
+        # Build library
+        # output lib path: builds/lib
+        bash tools/build-standalone-lib.sh
+
+
+    4. Convert the pre-trained mobilenet-v2 model to MACE format model.
+
+    .. code:: sh
+
+        cd path/to/mace
+        # Build library
+        python tools/converter.py convert --config=/path/to/mace-models/mobilenet-v2/mobilenet-v2.yml
+
+
+    5. Run the model.
+
+    .. note::
+
+        If you want to run on device/phone, please plug in at least one device/phone.
+
+    .. code:: sh
+
+        # Run example
+        python tools/converter.py run --config=/path/to/mace-models/mobilenet-v2/mobilenet-v2.yml --example
+
+    	# Test model run time
+        python tools/converter.py run --config=/path/to/mace-models/mobilenet-v2/mobilenet-v2.yml --round=100
+
+    	# Validate the correctness by comparing the results against the
+    	# original model and framework, measured with cosine distance for similarity.
+    	python tools/converter.py run --config=/path/to/mace-models/mobilenet-v2/mobilenet-v2.yml --validate
+
+
+Build your own model
+---------------------
+
+This part will show you how to use your own pre-trained model in MACE.
+
+======================
+1. Prepare your model
+======================
+
+MACE now supports models from TensorFlow and Caffe (more frameworks will be supported).
+
+-  TensorFlow
+
+   Prepare your pre-trained TensorFlow model.pb file.
+
+   Use `Graph Transform Tool <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md>`__
+   to optimize your model for inference.
+   This tool will improve the efficiency of inference by making several optimizations like operators
+   folding, redundant node removal etc. We strongly recommend MACE users to use it before building.
+
+   Usage for CPU/GPU,
+
+   .. code:: bash
+
+       # CPU/GPU:
+       ./transform_graph \
+           --in_graph=/path/to/your/tf_model.pb \
+           --out_graph=/path/to/your/output/tf_model_opt.pb \
+           --inputs='input node name' \
+           --outputs='output node name' \
+           --transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
+               strip_unused_nodes(type=float, shape="1,64,64,3")
+               remove_nodes(op=Identity, op=CheckNumerics)
+               fold_constants(ignore_errors=true)
+               flatten_atrous_conv
+               fold_batch_norms
+               fold_old_batch_norms
+               strip_unused_nodes
+               sort_by_execution_order'
+
+-  Caffe
+
+   Caffe 1.0+ models are supported in MACE converter tool.
+
+   If your model is from lower version Caffe, you need to upgrade it by using the Caffe built-in tool before converting.
+
+   .. code:: bash
+
+       # Upgrade prototxt
+       $CAFFE_ROOT/build/tools/upgrade_net_proto_text MODEL.prototxt MODEL.new.prototxt
+
+       # Upgrade caffemodel
+       $CAFFE_ROOT/build/tools/upgrade_net_proto_binary MODEL.caffemodel MODEL.new.caffemodel
+
+
+===========================================
+2. Create a deployment file for your model
+===========================================
+
+When converting a model or building a library, MACE needs to read a YAML file which is called model deployment file here.
+
+A model deployment file contains all the information of your model(s) and building options. There are several example
+deployment files in *MACE Model Zoo* project.
+
+The following shows two basic usage of deployment files for TensorFlow and Caffe models.
+Modify one of them and use it for your own case.
+
+-  TensorFlow
+
+   .. literalinclude:: models/demo_models_tf.yml
+      :language: yaml
+
+-  Caffe
+
+   .. literalinclude:: models/demo_models_caffe.yml
+      :language: yaml
+
+More details about model deployment file are in :doc:`advanced_usage`.
+
+======================
+3. Convert your model
+======================
+
+When the deployment file is ready, you can use MACE converter tool to convert your model(s).
+
+.. code:: bash
+
+    python tools/converter.py convert --config=/path/to/your/model_deployment_file.yml
+
+This command will download or load your pre-trained model and convert it to a MACE model proto file and weights data file.
+The generated model files will be stored in ``build/${library_name}/model`` folder.
+
+.. warning::
+
+    Please set ``model_graph_format: file`` and ``model_data_format: file`` in your deployment file before converting.
+    The usage of ``model_graph_format: code`` will be demonstrated in :doc:`advanced_usage`.
+
+=============================
+4. Build MACE into a library
+=============================
+
+Use bazel to build MACE source code into a library.
+
+    .. code:: sh
+
+        cd path/to/mace
+        # Build library
+        # output lib path: builds/lib
+        bash tools/build-standalone-lib.sh
+
+The above command will generate dynamic library ``builds/lib/${ABI}/libmace.so`` and static library ``builds/lib/${ABI}/libmace.a``.
+
+    .. warning::
+
+        Please verify that the target_abis param in the above command and your deployment file are the same.
+
+
+==================
+5. Run your model
+==================
+
+With the converted model, the static or shared library and header files, you can use the following commands
+to run and validate your model.
+
+    .. warning::
+
+        If you want to run on device/phone, please plug in at least one device/phone.
+
+* **run**
+
+    run the model.
+
+    .. code:: sh
+
+    	# Test model run time
+        python tools/converter.py run --config=/path/to/your/model_deployment_file.yml --round=100
+
+    	# Validate the correctness by comparing the results against the
+    	# original model and framework, measured with cosine distance for similarity.
+    	python tools/converter.py run --config=/path/to/your/model_deployment_file.yml --validate
+
+* **benchmark**
+
+    benchmark and profile the model.
+
+    .. code:: sh
+
+        # Benchmark model, get detailed statistics of each Op.
+        python tools/converter.py benchmark --config=/path/to/your/model_deployment_file.yml
+
+
+=======================================
+6. Deploy your model into applications
+=======================================
+
+In the converting and building steps, you've got the static/shared library, model files and
+header files.
+
+``${library_name}`` is the name you defined in the first line of your deployment YAML file.
+
+-  The generated ``static`` library files are organized as follows,
+
+.. code::
+
+    builds
+    ├── include
+    │   └── mace
+    │       └── public
+    │           ├── mace.h
+    │           └── mace_runtime.h
+    ├── lib
+    │   ├── arm64-v8a
+    │   │   ├── libmace.a
+    │   │   └── libmace.so
+    │   ├── armeabi-v7a
+    │   │   ├── libhexagon_controller.so
+    │   │   ├── libmace.a
+    │   │   └── libmace.so
+    │   └── linux-x86-64
+    │       ├── libmace.a
+    │       └── libmace.so
+    └── mobilenet-v1
+        ├── model
+        │   ├── mobilenet_v1.data
+        │   └── mobilenet_v1.pb
+        └── _tmp
+            └── arm64-v8a
+                └── mace_run_static
+
+
+Please refer to \ ``mace/examples/example.cc``\ for full usage. The following list the key steps.
+
+.. code:: cpp
+
+    // Include the headers
+    #include "mace/public/mace.h"
+    #include "mace/public/mace_runtime.h"
+
+    // 0. Set pre-compiled OpenCL binary program file paths when available
+    if (device_type == DeviceType::GPU) {
+      mace::SetOpenCLBinaryPaths(opencl_binary_paths);
+    }
+
+    // 1. Set compiled OpenCL kernel cache, this is used to reduce the
+    // initialization time since the compiling is too slow. It's suggested
+    // to set this even when pre-compiled OpenCL program file is provided
+    // because the OpenCL version upgrade may also leads to kernel
+    // recompilations.
+    const std::string file_path ="path/to/opencl_cache_file";
+    std::shared_ptr<KVStorageFactory> storage_factory(
+        new FileStorageFactory(file_path));
+    ConfigKVStorageFactory(storage_factory);
+
+    // 2. Declare the device type (must be same with ``runtime`` in configuration file)
+    DeviceType device_type = DeviceType::GPU;
+
+    // 3. Define the input and output tensor names.
+    std::vector<std::string> input_names = {...};
+    std::vector<std::string> output_names = {...};
+
+    // 4. Create MaceEngine instance
+    std::shared_ptr<mace::MaceEngine> engine;
+    MaceStatus create_engine_status;
+
+    // Create Engine from model file
+    create_engine_status =
+        CreateMaceEngineFromProto(model_pb_data,
+                                  model_data_file.c_str(),
+                                  input_names,
+                                  output_names,
+                                  device_type,
+                                  &engine);
+    if (create_engine_status != MaceStatus::MACE_SUCCESS) {
+      // Report error
+    }
+
+    // 5. Create Input and Output tensor buffers
+    std::map<std::string, mace::MaceTensor> inputs;
+    std::map<std::string, mace::MaceTensor> outputs;
+    for (size_t i = 0; i < input_count; ++i) {
+      // Allocate input and output
+      int64_t input_size =
+          std::accumulate(input_shapes[i].begin(), input_shapes[i].end(), 1,
+                          std::multiplies<int64_t>());
+      auto buffer_in = std::shared_ptr<float>(new float[input_size],
+                                              std::default_delete<float[]>());
+      // Load input here
+      // ...
+
+      inputs[input_names[i]] = mace::MaceTensor(input_shapes[i], buffer_in);
+    }
+
+    for (size_t i = 0; i < output_count; ++i) {
+      int64_t output_size =
+          std::accumulate(output_shapes[i].begin(), output_shapes[i].end(), 1,
+                          std::multiplies<int64_t>());
+      auto buffer_out = std::shared_ptr<float>(new float[output_size],
+                                               std::default_delete<float[]>());
+      outputs[output_names[i]] = mace::MaceTensor(output_shapes[i], buffer_out);
+    }
+
+    // 6. Run the model
+    MaceStatus status = engine.Run(inputs, &outputs);
+
+More details are in :doc:`advanced_usage`.
--- a/docs/user_guide/models/demo_models.yml
+++ b/docs/user_guide/models/demo_models.yml
+# The name of library
+library_name: mobile_squeeze
+# host, armeabi-v7a or arm64-v8a
+target_abis: [arm64-v8a]
+# The build mode for model(s).
+# 'code' for transferring model(s) into cpp code, 'file' for keeping model(s) in protobuf file(s) (.pb).
+model_graph_format: code
+# 'code' for transferring model data(s) into cpp code, 'file' for keeping model data(s) in file(s) (.data).
+model_data_format: code
+# One yaml config file can contain multi models' deployment info.
+models:
+  mobilenet_v1:
+      platform: tensorflow
+      model_file_path: https://cnbj1.fds.api.xiaomi.com/mace/miai-models/mobilenet-v1/mobilenet-v1-1.0.pb
+      model_sha256_checksum: 71b10f540ece33c49a7b51f5d4095fc9bd78ce46ebf0300487b2ee23d71294e6
+      subgraphs:
+        - input_tensors:
+            - input
+          input_shapes:
+            - 1,224,224,3
+          output_tensors:
+            - MobilenetV1/Predictions/Reshape_1
+          output_shapes:
+            - 1,1001
+          validation_inputs_data:
+            - https://cnbj1.fds.api.xiaomi.com/mace/inputs/dog.npy
+      runtime: cpu+gpu
+      limit_opencl_kernel_time: 0
+      obfuscate: 0
+      winograd: 0
+  squeezenet_v11:
+      platform: caffe
+      model_file_path: http://cnbj1-inner-fds.api.xiaomi.net/mace/mace-models/squeezenet/SqueezeNet_v1.1/model.prototxt
+      weight_file_path: http://cnbj1-inner-fds.api.xiaomi.net/mace/mace-models/squeezenet/SqueezeNet_v1.1/weight.caffemodel
+      model_sha256_checksum: 625c952063da1569e22d2f499dc454952244d42cd8feca61f05502566e70ae1c
+      weight_sha256_checksum: 72b912ace512e8621f8ff168a7d72af55910d3c7c9445af8dfbff4c2ee960142
+      subgraphs:
+        - input_tensors:
+            - data
+          input_shapes:
+            - 1,227,227,3
+          output_tensors:
+            - prob
+          output_shapes:
+            - 1,1,1,1000
+      runtime: cpu+gpu
+      limit_opencl_kernel_time: 0
+      obfuscate: 0
+      winograd: 0
--- a/docs/user_guide/models/demo_models_caffe.yml
+++ b/docs/user_guide/models/demo_models_caffe.yml
+# The name of library
+library_name: squeezenet-v10
+target_abis: [arm64-v8a]
+model_graph_format: file
+model_data_format: file
+models:
+  squeezenet-v10: # model tag, which will be used in model loading and must be specific.
+    platform: caffe
+    # support local path, http:// and https://
+    model_file_path: https://cnbj1.fds.api.xiaomi.com/mace/miai-models/squeezenet/squeezenet-v1.0.prototxt
+    weight_file_path: https://cnbj1.fds.api.xiaomi.com/mace/miai-models/squeezenet/squeezenet-v1.0.caffemodel
+    # sha256_checksum of your model's graph and data files.
+    # get the sha256_checksum: sha256sum path/to/your/file
+    model_sha256_checksum: db680cf18bb0387ded9c8e9401b1bbcf5dc09bf704ef1e3d3dbd1937e772cae0
+    weight_sha256_checksum: 9ff8035aada1f9ffa880b35252680d971434b141ec9fbacbe88309f0f9a675ce
+    # define your model's interface
+    # if there multiple inputs or outputs, write like blow:
+    # subgraphs:
+    # - input_tensors:
+    #     - input0
+    #     - input1
+    #   input_shapes:
+    #     - 1,224,224,3
+    #     - 1,224,224,3
+    #    output_tensors:
+    #      - output0
+    #      - output1
+    #    output_shapes:
+    #      - 1,1001
+    #      - 1,1001
+    subgraphs:
+      - input_tensors:
+          - data
+        input_shapes:
+          - 1,227,227,3
+        output_tensors:
+          - prob
+        output_shapes:
+          - 1,1,1,1000
+    runtime: cpu+gpu
+    winograd: 0
--- a/docs/user_guide/models/demo_models_tf.yml
+++ b/docs/user_guide/models/demo_models_tf.yml
+# The name of library
+library_name: mobilenet
+target_abis: [arm64-v8a]
+model_graph_format: file
+model_data_format: file
+models:
+  mobilenet_v1: # model tag, which will be used in model loading and must be specific.
+    platform: tensorflow
+    # path to your tensorflow model's pb file. Support local path, http:// and https://
+    model_file_path: https://cnbj1.fds.api.xiaomi.com/mace/miai-models/mobilenet-v1/mobilenet-v1-1.0.pb
+    # sha256_checksum of your model's pb file.
+    # use this command to get the sha256_checksum: sha256sum path/to/your/pb/file
+    model_sha256_checksum: 71b10f540ece33c49a7b51f5d4095fc9bd78ce46ebf0300487b2ee23d71294e6
+    # define your model's interface
+    # if there multiple inputs or outputs, write like blow:
+    # subgraphs:
+    # - input_tensors:
+    #     - input0
+    #     - input1
+    #   input_shapes:
+    #     - 1,224,224,3
+    #     - 1,224,224,3
+    #    output_tensors:
+    #      - output0
+    #      - output1
+    #    output_shapes:
+    #      - 1,1001
+    #      - 1,1001
+    subgraphs:
+      - input_tensors:
+          - input
+        input_shapes:
+          - 1,224,224,3
+        output_tensors:
+          - MobilenetV1/Predictions/Reshape_1
+        output_shapes:
+          - 1,1001
+    # cpu, gpu or cpu+gpu
+    runtime: cpu+gpu
+    winograd: 0
\ No newline at end of file
--- a/docs/getting_started/op_lists.rst
+++ b/docs/getting_started/op_lists.rst
@@ -3,7 +3,6 @@ Operator lists

 .. Please keep in chronological order when editing
 .. csv-table::
-    :widths: auto
    :header: "Operator","Supported","Remark"

    "AVERAGE_POOL_2D","Y",""
@@ -27,17 +26,17 @@ Operator lists
    "LOCAL_RESPONSE_NORMALIZATION","Y",""
    "LOGISTIC","Y",""
    "LSTM","",""
-    "MATMUL","Y",""
+    "MATMUL","Y","Only CPU is supported"
    "MAX_POOL_2D","Y",""
    "PAD","Y",""
    "PSROI_ALIGN","Y",""
    "PRELU","Y","Only caffe model is supported"
-    "REDUCE_MEAN","Y","Only tensorflow model is supported"
+    "REDUCE_MEAN","Y","Only tensorflow model is supported. For GPU only H + W axis reduce is supported"
    "RELU","Y",""
    "RELU1","Y",""
    "RELU6","Y",""
    "RELUX","Y",""
-    "RESHAPE","Y","Limited support: only internal use of reshape in composed operations is supported"
+    "RESHAPE","Y","Limited support: GPU is full supported, for CPU only supports softmax-like usage"
    "RESIZE_BILINEAR","Y",""
    "RNN","",""
    "RPN_PROPOSAL_LAYER","Y",""

--- a/mace/examples/android/build.sh
+++ b/mace/examples/android/build.sh
@@ -3,10 +3,13 @@
 set -e -u -o pipefail

 pushd ../../../
-python tools/converter.py build --config=docs/getting_started/models/demo_app_models.yaml

-cp -r builds/mobilenet/include mace/examples/android/macelibrary/src/main/cpp/
-cp -r builds/mobilenet/lib mace/examples/android/macelibrary/src/main/cpp/
+python tools/converter.py convert --config=mace/examples/android/mobilenet.yml
+cp -rf builds/mobilenet/include mace/examples/android/macelibrary/src/main/cpp/
+cp -rf builds/mobilenet/model mace/examples/android/macelibrary/src/main/cpp/
+
+bash tools/build-standalone-lib.sh
+cp -rf builds/lib mace/examples/android/macelibrary/src/main/cpp/

 popd


--- a/mace/examples/android/macelibrary/CMakeLists.txt
+++ b/mace/examples/android/macelibrary/CMakeLists.txt
@@ -14,18 +14,12 @@ cmake_minimum_required(VERSION 3.4.1)

 include_directories(${CMAKE_SOURCE_DIR}/)
 include_directories(${CMAKE_SOURCE_DIR}/src/main/cpp/include)
-file(GLOB static_file ${CMAKE_SOURCE_DIR}/src/main/cpp/lib/arm64-v8a/*.a)
-
-MESSAGE(STATUS "FILE URL = ${CMAKE_SOURCE_DIR}")
-MESSAGE(STATUS "FILE URL = ${static_file}")
-
-foreach(fileStr ${static_file})
-   set(tmpstr ${fileStr})
-   MESSAGE(STATUS "FILE URL = ${tmpstr}")
-endforeach()
-
-add_library (mace_mobile_lib STATIC IMPORTED)
-set_target_properties(mace_mobile_lib PROPERTIES IMPORTED_LOCATION ${tmpstr})
+set(mace_file ${CMAKE_SOURCE_DIR}/src/main/cpp/lib/arm64-v8a/libmace.a)
+set(mobilenet_file ${CMAKE_SOURCE_DIR}/src/main/cpp/model/mobilenet.a)
+add_library (mace_lib STATIC IMPORTED)
+set_target_properties(mace_lib PROPERTIES IMPORTED_LOCATION ${mace_file})
+add_library (mobilenet_lib STATIC IMPORTED)
+set_target_properties(mobilenet_lib PROPERTIES IMPORTED_LOCATION ${mobilenet_file})

 add_library( # Sets the name of the library.
             mace_mobile_jni
@@ -55,7 +49,8 @@ find_library( # Sets the name of the path variable.

 target_link_libraries( # Specifies the target library.
                       mace_mobile_jni
-                       mace_mobile_lib
+                       mace_lib
+                       mobilenet_lib
                       # Links the target library to the log library
                       # included in the NDK.
                       ${log-lib} )
\ No newline at end of file
--- a/docs/getting_started/models/demo_app_models.yaml
+++ b/docs/getting_started/models/demo_app_models.yaml
-# The name of library
 library_name: mobilenet
 target_abis: [arm64-v8a]
-embed_model_data: 1
-# The build mode for model(s).
-# 'code' stand for transfer model(s) into cpp code, 'proto' for model(s) in protobuf file(s).
-build_type: code
-linkshared: 0
-# One yaml config file can contain multi models' config message.
+model_graph_format: code
+model_data_format: code
 models:
-  mobilenet_v1: # model tag, which will be used in model loading and must be specific.
+  mobilenet_v1:
    platform: tensorflow
-    # support local path, http:// and https://
    model_file_path: https://cnbj1.fds.api.xiaomi.com/mace/miai-models/mobilenet-v1/mobilenet-v1-1.0.pb
    model_sha256_checksum: 71b10f540ece33c49a7b51f5d4095fc9bd78ce46ebf0300487b2ee23d71294e6
    subgraphs:
-      - input_tensors: input
-        input_shapes: 1,224,224,3
-        output_tensors: MobilenetV1/Predictions/Reshape_1
-        output_shapes: 1,1001
+      - input_tensors:
+          - input
+        input_shapes:
+          - 1,224,224,3
+        output_tensors:
+          - MobilenetV1/Predictions/Reshape_1
+        output_shapes:
+          - 1,1001
    runtime: cpu+gpu
    limit_opencl_kernel_time: 0
    nnlib_graph_mode: 0
@@ -28,10 +26,14 @@ models:
    model_file_path: https://cnbj1.fds.api.xiaomi.com/mace/miai-models/mobilenet-v2/mobilenet-v2-1.0.pb
    model_sha256_checksum: 369f9a5f38f3c15b4311c1c84c032ce868da9f371b5f78c13d3ea3c537389bb4
    subgraphs:
-      - input_tensors: input
-        input_shapes: 1,224,224,3
-        output_tensors: MobilenetV2/Predictions/Reshape_1
-        output_shapes: 1,1001
+      - input_tensors:
+          - input
+        input_shapes:
+          - 1,224,224,3
+        output_tensors:
+          - MobilenetV2/Predictions/Reshape_1
+        output_shapes:
+          - 1,1001
    runtime: cpu+gpu
    limit_opencl_kernel_time: 0
    nnlib_graph_mode: 0

--- a/mace/python/tools/mace_engine_factory.h.jinja2
+++ b/mace/python/tools/mace_engine_factory.h.jinja2
@@ -59,6 +59,10 @@ MaceStatus CreateMaceEngineFromCode(
    return MaceStatus::MACE_INVALID_ARGS;
  }
  std::shared_ptr<NetDef> net_def;
+{% if embed_model_data %}
+  (void)model_data_file;
+  const unsigned char * model_data;
+{% endif %}
  MaceStatus status = MaceStatus::MACE_SUCCESS;
  switch (model_name_map[model_name]) {
 {% for i in range(model_tags |length) %}
@@ -66,12 +70,12 @@ MaceStatus CreateMaceEngineFromCode(
      net_def = mace::{{model_tags[i]}}::CreateNet();
      engine->reset(new mace::MaceEngine(device_type));
 {% if embed_model_data %}
-      (void)model_data_file;
-      const unsigned char * model_data =
-          mace::{{model_tags[i]}}::LoadModelData();
-      status = (*engine)->Init(net_def.get(), input_nodes, output_nodes, model_data);
+      model_data = mace::{{model_tags[i]}}::LoadModelData();
+      status = (*engine)->Init(net_def.get(), input_nodes, output_nodes,
+                               model_data);
 {% else %}
-      status = (*engine)->Init(net_def.get(), input_nodes, output_nodes, model_data_file);
+      status = (*engine)->Init(net_def.get(), input_nodes, output_nodes,
+                               model_data_file);
 {% endif %}
      break;
 {% endfor %}

--- a/tools/build-standalone-lib.sh
+++ b/tools/build-standalone-lib.sh
@@ -8,16 +8,6 @@ INCLUDE_DIR=builds/include/mace/public
 mkdir -p $LIB_DIR
 mkdir -p $INCLUDE_DIR

-# generate version code
-rm -rf mace/codegen/version
-mkdir -p mace/codegen/version
-bash mace/tools/git/gen_version_source.sh mace/codegen/version/version.cc
-
-# generate tuning code
-rm -rf mace/codegen/tuning
-mkdir -p mace/codegen/tuning
-python mace/python/tools/binary_codegen.py --output_path=mace/codegen/tuning/tuning_params.cc
-
 # copy include headers
 cp mace/public/*.h $INCLUDE_DIR/

@@ -57,7 +47,7 @@ bazel build --config android --config optimization mace:libmace_static --define
 cp bazel-genfiles/mace/libmace.a $LIB_DIR/arm64-v8a/

 echo "build static lib for linux-x86-64"
-bazel build mace:libmace --config optimization --define openmp=true
+bazel build mace:libmace_static --config optimization --define openmp=true
 cp bazel-genfiles/mace/libmace.a $LIB_DIR/linux-x86-64/

 echo "LIB PATH: $LIB_DIR"

--- a/tools/converter.py
+++ b/tools/converter.py
@@ -543,23 +543,24 @@ def clear_build_dirs(library_name):


 def check_model_converted(library_name, model_name,
-                          model_graph_format, model_data_format):
+                          model_graph_format, model_data_format,
+                          abi):
    model_output_dir = \
        '%s/%s/%s' % (BUILD_OUTPUT_DIR, library_name, MODEL_OUTPUT_DIR_NAME)
    if model_graph_format == ModelFormat.file:
        mace_check(os.path.exists("%s/%s.pb" % (model_output_dir, model_name)),
                   ModuleName.RUN,
-                   "You shuold convert model first.")
+                   "You should convert model first.")
    else:
-        mace_check(os.path.exists("%s/%s.a" %
-                                  (model_output_dir, library_name)),
+        model_lib_path = get_model_lib_output_path(library_name, abi)
+        mace_check(os.path.exists(model_lib_path),
                   ModuleName.RUN,
-                   "You shuold convert model first.")
+                   "You should convert model first.")
    if model_data_format == ModelFormat.file:
        mace_check(os.path.exists("%s/%s.data" %
                                  (model_output_dir, model_name)),
                   ModuleName.RUN,
-                   "You shuold convert model first.")
+                   "You should convert model first.")


 ################################
@@ -716,10 +717,10 @@ def convert_model(configs):
            StringFormatter.block("Model %s converted" % model_name))


-def get_model_lib_output_path(library_name):
-    library_out_dir = os.path.join(BUILD_OUTPUT_DIR, library_name,
-                                   MODEL_OUTPUT_DIR_NAME)
-    lib_output_path = "%s/%s.a" % (library_out_dir, library_name)
+def get_model_lib_output_path(library_name, abi):
+    lib_output_path = os.path.join(BUILD_OUTPUT_DIR, library_name,
+                                   MODEL_OUTPUT_DIR_NAME, abi,
+                                   "%s.a" % library_name)
    return lib_output_path


@@ -728,13 +729,13 @@ def build_model_lib(configs, address_sanitizer):

    # create model library dir
    library_name = configs[YAMLKeyword.library_name]
-    model_lib_output_path = get_model_lib_output_path(library_name)
-    library_out_dir = os.path.dirname(model_lib_output_path)
-    if not os.path.exists(library_out_dir):
-        os.makedirs(library_out_dir)
-
    for target_abi in configs[YAMLKeyword.target_abis]:
        hexagon_mode = get_hexagon_mode(configs)
+        model_lib_output_path = get_model_lib_output_path(library_name,
+                                                          target_abi)
+        library_out_dir = os.path.dirname(model_lib_output_path)
+        if not os.path.exists(library_out_dir):
+            os.makedirs(library_out_dir)

        sh_commands.bazel_build(
            MODEL_LIB_TARGET,
@@ -841,7 +842,7 @@ def build_mace_run(configs, target_abi, enable_openmp, address_sanitizer,
    if configs[YAMLKeyword.model_graph_format] == ModelFormat.code:
        mace_check(os.path.exists(ENGINE_CODEGEN_DIR),
                   ModuleName.RUN,
-                   "You shuold convert model first.")
+                   "You should convert model first.")
        build_arg = "--per_file_copt=mace/tools/validation/mace_run.cc@-DMODEL_GRAPH_FORMAT_CODE"  # noqa

    sh_commands.bazel_build(
@@ -887,8 +888,9 @@ def build_example(configs, target_abi, enable_openmp, mace_lib_type):
        if configs[YAMLKeyword.model_graph_format] == ModelFormat.code:
            mace_check(os.path.exists(ENGINE_CODEGEN_DIR),
                       ModuleName.RUN,
-                       "You shuold convert model first.")
-            model_lib_path = get_model_lib_output_path(library_name)
+                       "You should convert model first.")
+            model_lib_path = get_model_lib_output_path(library_name,
+                                                       target_abi)
            sh.cp("-f", model_lib_path, LIB_CODEGEN_DIR)
            build_arg = "--per_file_copt=mace/examples/cli/example.cc@-DMODEL_GRAPH_FORMAT_CODE"  # noqa

@@ -912,12 +914,6 @@ def tuning(library_name, model_name, model_config,
           mace_lib_type):
    print('* Tuning, it may take some time...')

-    # clear opencl output dir
-    opencl_output_dir = os.path.join(
-        BUILD_OUTPUT_DIR, library_name, OUTPUT_OPENCL_BINARY_DIR_NAME)
-    if os.path.exists(opencl_output_dir):
-        sh.rm('-rf', opencl_output_dir)
-
    build_tmp_binary_dir = get_build_binary_dir(library_name, target_abi)
    mace_run_name = MACE_RUN_STATIC_NAME
    link_dynamic = False
@@ -994,16 +990,7 @@ def run_specific_target(flags, configs, target_abi,
    mace_lib_type = flags.mace_lib_type
    embed_model_data = \
        configs[YAMLKeyword.model_data_format] == ModelFormat.code
-    opencl_output_bin_path = ""
-    opencl_parameter_path = ""
    build_tmp_binary_dir = get_build_binary_dir(library_name, target_abi)
-    if configs[YAMLKeyword.target_socs] and target_abi != ABIType.host:
-        opencl_output_bin_path = get_opencl_binary_output_path(
-            library_name, target_abi, target_soc, serial_num
-        )
-        opencl_parameter_path = get_opencl_parameter_output_path(
-            library_name, target_abi, target_soc, serial_num
-        )

    # get target name for run
    if flags.example:
@@ -1023,7 +1010,8 @@ def run_specific_target(flags, configs, target_abi,
    for model_name in configs[YAMLKeyword.models]:
        check_model_converted(library_name, model_name,
                              configs[YAMLKeyword.model_graph_format],
-                              configs[YAMLKeyword.model_data_format])
+                              configs[YAMLKeyword.model_data_format],
+                              target_abi)
        if target_abi == ABIType.host:
            device_name = ABIType.host
        else:
@@ -1049,10 +1037,14 @@ def run_specific_target(flags, configs, target_abi,
                get_build_model_dirs(library_name, model_name, target_abi,
                                     target_soc, serial_num,
                                     model_config[YAMLKeyword.model_file_path])
+        # clear temp model output dir
        if os.path.exists(model_output_dir):
            sh.rm("-rf", model_output_dir)
        os.makedirs(model_output_dir)

+        is_tuned = False
+        model_opencl_output_bin_path = ""
+        model_opencl_parameter_path = ""
        # tuning for specified soc
        if not flags.address_sanitizer \
                and not flags.example \
@@ -1067,6 +1059,23 @@ def run_specific_target(flags, configs, target_abi,
                   target_abi, target_soc, serial_num,
                   mace_lib_type)
            model_output_dirs.append(model_output_dir)
+            model_opencl_output_bin_path =\
+                "%s/%s/%s" % (model_output_dir,
+                              BUILD_TMP_OPENCL_BIN_DIR,
+                              CL_COMPILED_BINARY_FILE_NAME)
+            model_opencl_parameter_path = \
+                "%s/%s/%s" % (model_output_dir,
+                              BUILD_TMP_OPENCL_BIN_DIR,
+                              CL_TUNED_PARAMETER_FILE_NAME)
+            sh_commands.clear_phone_data_dir(serial_num, PHONE_DATA_DIR)
+            is_tuned = True
+        elif target_abi != ABIType.host and target_soc:
+            model_opencl_output_bin_path = get_opencl_binary_output_path(
+                library_name, target_abi, target_soc, serial_num
+            )
+            model_opencl_parameter_path = get_opencl_parameter_output_path(
+                library_name, target_abi, target_soc, serial_num
+            )

        # generate input data
        sh_commands.gen_random_input(
@@ -1114,8 +1123,8 @@ def run_specific_target(flags, configs, target_abi,
                gpu_priority_hint=flags.gpu_priority_hint,
                runtime_failure_ratio=flags.runtime_failure_ratio,
                address_sanitizer=flags.address_sanitizer,
-                opencl_binary_file=opencl_output_bin_path,
-                opencl_parameter_file=opencl_parameter_path,
+                opencl_binary_file=model_opencl_output_bin_path,
+                opencl_parameter_file=model_opencl_parameter_path,
                libmace_dynamic_library_path=LIBMACE_DYNAMIC_PATH,
                link_dynamic=link_dynamic,
            )
@@ -1142,11 +1151,7 @@ def run_specific_target(flags, configs, target_abi,
                    phone_data_dir=PHONE_DATA_DIR,
                    caffe_env=flags.caffe_env)
            if flags.report and flags.round > 0:
-                opencl_parameter_bin_path = get_opencl_parameter_output_path(
-                    library_name, target_abi, target_soc, serial_num
-                )
-                tuned = device_type == DeviceType.GPU\
-                    and os.path.exists(opencl_parameter_bin_path)
+                tuned = is_tuned and device_type == DeviceType.GPU
                report_run_statistics(
                    run_output, target_abi, serial_num,
                    model_name, device_type, flags.report_dir,
@@ -1159,6 +1164,12 @@ def run_specific_target(flags, configs, target_abi,
        opencl_parameter_bin_path = get_opencl_parameter_output_path(
            library_name, target_abi, target_soc, serial_num
        )
+        # clear opencl output dir
+        if os.path.exists(opencl_output_bin_path):
+            sh.rm('-rf', opencl_output_bin_path)
+        if os.path.exists(opencl_parameter_bin_path):
+            sh.rm('-rf', opencl_parameter_bin_path)
+
        # merge all models' OpenCL binaries together
        sh_commands.merge_opencl_binaries(
            model_output_dirs, CL_COMPILED_BINARY_FILE_NAME,
@@ -1228,7 +1239,7 @@ def build_benchmark_model(configs, target_abi, enable_openmp, mace_lib_type):
    if configs[YAMLKeyword.model_graph_format] == ModelFormat.code:
        mace_check(os.path.exists(ENGINE_CODEGEN_DIR),
                   ModuleName.BENCHMARK,
-                   "You shuold convert model first.")
+                   "You should convert model first.")
        build_arg = "--per_file_copt=mace/benchmark/benchmark_model.cc@-DMODEL_GRAPH_FORMAT_CODE"  # noqa

    sh_commands.bazel_build(benchmark_target,
@@ -1271,7 +1282,8 @@ def bm_specific_target(flags, configs, target_abi, target_soc, serial_num):
    for model_name in configs[YAMLKeyword.models]:
        check_model_converted(library_name, model_name,
                              configs[YAMLKeyword.model_graph_format],
-                              configs[YAMLKeyword.model_data_format])
+                              configs[YAMLKeyword.model_data_format],
+                              target_abi)
        if target_abi == ABIType.host:
            device_name = ABIType.host
        else:

--- a/tools/sh_commands.py
+++ b/tools/sh_commands.py
@@ -780,7 +780,7 @@ def tuning_run(abi,

        print("Running finished!\n")

-        return stdout
+    return stdout


 def validate_model(abi,

--- a/tools/validate.py
+++ b/tools/validate.py
@@ -191,7 +191,7 @@ def parse_args():
    """Parses command line arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument(
-        "--platform", type=str, default="", help="Tensorflow or Caffe.")
+        "--platform", type=str, default="", help="TensorFlow or Caffe.")
    parser.add_argument(
        "--model_file",
        type=str,