Update documents

11733d33 · Liangliang He · 1a9b0045 · 11733d33 · 11733d33 · 11733d33
19 changed file
--- a/README.md
+++ b/README.md
-# **MACE** - *Mobile(Mi) Accelerated Compute Engine Library*
+# MiAI Compute Engine
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
 [![build status](http://v9.git.n.xiaomi.com/deep-computing/mace/badges/master/build.svg)](http://v9.git.n.xiaomi.com/deep-computing/mace/pipelines)

-## Introduction
-**Accelerating Neural Network with Heterogeneous Computing Devices in the phone.**
+[Documentation](docs) |
+[FAQ](docs/faq.md) |
+[Release Notes](RELEASE.md) |
+[MiAI Model Zoo](http://v9.git.n.xiaomi.com/deep-computing/mace-models) |
+[Demo](mace/android)

-Supported Devices: **CPU(NEON)/GPU/DSP**.
+**MiAI Compute Engine** is a deep learning inference framework optimized for
+mobile heterogeneous computing platforms. The design is focused on the following
+targets:
+* Performance
+  * The runtime is highly optimized with NEON, OpenCL and HVX. Except for the
+    inference speed, the initialization speed is also intensively optimized.
+* Power consumption
+  * Chip dependent power options are included as advanced API.
+* Memory usage and library footprint
+  * Graph level memory allocation optimization and buffer reuse is supported.
+* Model protection
+  * Model protection is one the highest priority feature from the beginning of 
+    the design. Various techniques are introduced like coverting models to C++
+    code and literal obfuscations.
+* Platform coverage
+  * A good coverage of recent Qualcomm, MediaTek, Pinecone and other ARM based
+    chips. CPU runtime is also compitable with most POSIX systems and
+    archetectures with limited performance.

-## User Guide
+## Getting Started

-wiki url: [http://v9.git.n.xiaomi.com/deep-computing/mace/wikis/User%20Guide/introduction](http://v9.git.n.xiaomi.com/deep-computing/mace/wikis/User%20Guide/introduction)
+## Performance
+[MiAI Model Zoo](http://v9.git.n.xiaomi.com/deep-computing/mace-models) contains
+several common neural networks models and built daily against several mobile
+phones. The benchmark result can be found in the CI result page.
+
+## Communication
+* GitHub issues: bug reports, usage issues, feature requests
+* Gitter or Slack:
+* QQ群:
+
+## Contributing
+Any kind of contributions are welcome. For bug reports, feature requests,
+please just open an issue without any hesitance. For code contributions, it's
+strongly suggested to open an issue for discussion first. For more details,
+please refer to [this guide](docs).
+
+## License
+[Apache License 2.0](LICENSE).
+
+## Acknowledgement
+*MiAI Compute Engine* depends on several open source projects located in
+[third_party](mace/third_party) directory. Particularly, we learned a lot from
+the following projects during the development:
+* [nnlib](https://source.codeaurora.org/quic/hexagon_nn/nnlib): the DSP runtime
+  depends on this library.
+* [TensorFlow](https://github.com/tensorflow/tensorflow),
+  [Caffe](https://github.com/BVLC/caffe),
+  [SNPE](https://developer.qualcomm.com/software/snapdragon-neural-processing-engine-ai),
+  [ARM ComputeLibrary](https://github.com/ARM-software/ComputeLibrary),
+  [ncnn](https://github.com/Tencent/ncnn) and many others: we learned many best
+  practices from these projects.
+
+Finally, we also thank the Qualcomm, Pinecone and MediaTek engineering teams for
+their helps.
--- a/docs/release_note.md
+++ b/docs/release_note.md
 Release Notes
 =====

-v0.6.0
+v0.6.0 (2018-04-04)
 ------
 1. Change mace header interfaces, only including necessary methods.
--- a/docs/README
+++ b/docs/README
-MACE Documentations
---
+The documents are based on Sphinx, run the following commands to build the documents:

-How to build the documents
-```
 pip install sphinx sphinx-autobuild
 pip install recommonmark
 make html
-```
+
+After building, the generated documents are located in _build.
\ No newline at end of file
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -6,7 +6,7 @@
 import recommonmark.parser
 import sphinx_rtd_theme

-project = u'MACE'
+project = u'MiAI Compute Engine'
 author = u'%s Developers' % project
 copyright = u'2018, %s' % author


--- a/docs/developer/logging.md
+++ b/docs/developer/logging.md
-Logging
-=======
-
-The rule of VLOG:
-0. Ad hoc debug logging, should only be added in test or temporary ad hoc debugging
-1. Important network level Debug/Latency trace log (Op run should never generate level 1 vlog)
-2. Important op level Latency trace log
-3. Unimportant Debug/Latency trace log
-4. Verbose Debug/Latency trace log
--- a/docs/developer/opencl_memory_layout.md
+++ b/docs/developer/opencl_memory_layout.md
-OpenCL Image Storage Layout
-===========================
-
-Use **Image** object to optimize memory access and parallel computing based on OpenCL 2.0.
-
-
-Design the corresponding **Image** format to optimize memory access for different Op algorithm.
-Each pixel of **Image** object contains 4 elements(e.g. RGBA).
-
-
-The Followings are the **Buffer** and **Image** format for all **Tensors**.
-
-Input/Output
---
-**Mace** use NHWC format Input/Output.
-
-| Tensor| Buffer| Image Size [Width, Height]| Explanation|
-| --------- | :---------:|:--------:|:----:|
-|Channel-Major Input/Output | NHWC | [W * (C+3)/4, N * H] | Default Input/Output format|
-|Height-Major Input/Output | NHWC | [W * C, N * (H+3)/4] | Winograd Convolution format| 
-|Width-Major Input/Output | NHWC | [(W+3)/4 * C, N * H] | Winograd Convolution format|
-
-Each Pixel of **Image** contains 4 elements. The below table list the coordination relation 
-between **Image** and **Buffer**.
-
-| Tensor| Pixel Coordinate Relation| Explanation
-| --------- | :---------:| :-----: |
-|Channel-Major Input/Output | P[i, j] = {E[n, h, w, c] &#124; (n=j/H, h=j%H, w=i%W, c=[i/W * 4 + k])}| k=[0, 4)|
-|Height-Major Input/Output | P[i, j] = {E[n, h, w, c] &#124; (n=j%N, h=[j/H*4 + k], w=i%W, c=i/W)}| k=[0, 4)|
-|Width-Major Input/Output | P[i, j] = {E[n, h, w, c] &#124; (n=j/H, h=j%H, w=[i%W*4 + k], c=i/W)}| k=[0, 4)|
-
-
-Filter
---
-| Tensor| Buffer| Image Size [Width, Height]| Explanation|
-| --------- | :---------:|:--------:|:----:|
-|Convolution Filter | HWOI | [RoundUp<4>(I), H * W * (O+3)/4]|Convolution filter format，There is no difference compared to [H*w*I, (O+3)/4]|
-|Depthwise Convlution Filter | HWIM | [H * W * M, (I+3)/4]|Depthwise-Convolution filter format|
-
-Each Pixel of **Image** contains 4 elements. The below table list the coordination relation 
-between **Image** and **Buffer**.
-
-| Tensor| Pixel Coordinate Relation| Explanation|
-| --------- | :---------:| :-----:|
-|Convolution Filter | P[m, n] = {E[h, w, o, i] &#124; (h=T/W, w=T%W, o=[n/HW*4+k], i=m)}| HW= H * W, T=n%HW, k=[0, 4)|
-|Depthwise Convlution Filter | P[m, n] = {E[h, w, i, 0] &#124; (h=m/W, w=m%W, i=[n*4+k])}| only support multiplier == 1, k=[0, 4)| 
-
-1-D Argument
---
-| Tensor| Buffer| Image Size [Width, Height]| Explanation|
-| --------- | :---------:|:--------:|:----:|
-|1-D Argument | W | [(W+3)/4, 1] | 1D argument format, e.g. Bias|
-
-Each Pixel of **Image** contains 4 elements. The below table list the coordination relation 
-between **Image** and **Buffer**.
-
-| Tensor| Pixel Coordinate Relation| Explanation|
-| --------- | :---------:| :-----:|
-|1-D Argument | P[i, 0] = {E[w] &#124; w=i*4+k}| k=[0, 4)|
--- a/docs/developer/adding_a_new_op.md
+++ b/docs/developer/adding_a_new_op.md
-Adding a New Op
+Adding a new Op
 ===============

-You can create a custum op if it is not covered by existing mace library.
+You can create a custom op if it is not supported yet.

-To add a custom op in mace you'll need to:
+To add a custom op, you need to finish the following steps.

- Define the new op class and Registry function in op_name.h and op_name.cc under mace/ops directory. 
+Define the Op class
+--------------------
+Define the new Op class in `mace/ops/my_custom_op.h`.

-
-```
-op_name.cc  
-
-#include "mace/ops/op_name.h"
-
-namespace mace {
-namespace ops {
-
-void Register_Custom_Op(OperatorRegistry *op_registry) {
-  REGISTER_OPERATOR(op_registry, OpKeyBuilder("op_name")
-                                     .Device(DeviceType::CPU)
-                                     .TypeConstraint<float>("T")
-                                     .Build(),
-                    Custom_Op<DeviceType::CPU, float>);
-
-  REGISTER_OPERATOR(op_registry, OpKeyBuilder("op_name")
-                                     .Device(DeviceType::GPU)
-                                     .TypeConstraint<float>("T")
-                                     .Build(),
-                    Custom_Op<DeviceType::GPU, float>);
-
-  REGISTER_OPERATOR(op_registry, OpKeyBuilder("op_name")
-                                     .Device(DeviceType::GPU)
-                                     .TypeConstraint<half>("T")
-                                     .Build(),
-                    Custom_Op<DeviceType::GPU, half>);
-}
-
-}  // namespace ops
-}  // namespace mace
-
-```
-
-
-```
-op_name.h
-
-#ifndef MACE_OPS_CUSTOM_OP_H_
-#define MACE_OPS_CUSTOM_OP_H_
+```c++
+#ifndef MACE_OPS_MY_CUSTOM_OP_H_
+#define MACE_OPS_MY_CUSTOM_OP_H_

 #include "mace/core/operator.h"
-#include "mace/kernels/custom_op.h"
+#include "mace/kernels/my_custom_op.h"

 namespace mace {
 namespace ops {

 template <DeviceType D, typename T>
-class CustomOp : public Operator<D, T> {
+class MyCustomOp : public Operator<D, T> {
 public:
-  CustomOp(const OperatorDef &op_def, Workspace *ws)
+  MyCustomOp(const OperatorDef &op_def, Workspace *ws)
      : Operator<D, T>(op_def, ws),
        functor_() {}

@@ -74,27 +39,59 @@ class CustomOp : public Operator<D, T> {
  OP_OUTPUT_TAGS(OUTPUT);

 private:
-  kernels::CustomOpFunctor<D, T> functor_;
+  kernels::MyCustomOpFunctor<D, T> functor_;
 };

 }  // namespace ops
 }  // namespace mace

-#endif  // MACE_OPS_CUSTOM_OP_H_
+#endif  // MACE_OPS_MY_CUSTOM_OP_H_

 ```

- To Add the new op. You need to implement the cpu version operation in a .h file and gpu version in op_name_opencl.cc and op_name.cl files under the mace/kernels directory.
-
- Register the new op in core/operator.cc
- 
- 
- Test and Benchmark
- 
-    Add an op_name_test.cc file to test all functions of your new op both in cpu and gpu and make sure the new op works fine.
- 
-     Add an op_benchmark.cc file to benchmark all functions of your new op both in cpu or gpu.
-  
- 
- 
- 
+Register the new Op
+--------------------
+Define the Ops registering function in `mace/ops/my_custom_op.cc`.
+```c++
+#include "mace/ops/my_custom_op.h"
+
+namespace mace {
+namespace ops {
+
+void Register_My_Custom_Op(OperatorRegistry *op_registry) {
+  REGISTER_OPERATOR(op_registry, OpKeyBuilder("my_custom_op")
+                                     .Device(DeviceType::CPU)
+                                     .TypeConstraint<float>("T")
+                                     .Build(),
+                    Custom_Op<DeviceType::CPU, float>);
+
+  REGISTER_OPERATOR(op_registry, OpKeyBuilder("my_custom_op")
+                                     .Device(DeviceType::OPENCL)
+                                     .TypeConstraint<float>("T")
+                                     .Build(),
+                    Custom_Op<DeviceType::OPENCL, float>);
+
+  REGISTER_OPERATOR(op_registry, OpKeyBuilder("my_custom_op")
+                                     .Device(DeviceType::OPENCL)
+                                     .TypeConstraint<half>("T")
+                                     .Build(),
+                    Custom_Op<DeviceType::OPENCL, half>);
+}
+
+}  // namespace ops
+}  // namespace mace
+
+```
+And then register the new Op in `mace/core/operator.cc`.
+
+Implement the Op kernel code
+----------------------------
+You need to implement the CPU kernel in a `mace/kernels/my_custom_op.h` and
+optionally OpenCL kernel in `mace/kernels/kernels/my_custom_op_opencl.cc` and
+`mace/kernels/kernels/cl/my_custom_op.cl`. You can also optimize the CPU
+kernel with NEON.
+
+Add test and benchmark
+----------------------
+It's strongly recommended to add unit test and micro benchmark for your
+new Op. If you wish to contribute back, it's required.
--- a/docs/development/contributing.md
+++ b/docs/development/contributing.md
+Contributing guide
+==================
+
+License
+-------
+
+The source file should contains a license header. See the existing files
+as an example.
+
+Python coding style
+-------------------
+
+Changes to Python code should conform to [PEP8 Style Guide for Python
+Code](https://www.python.org/dev/peps/pep-0008/).
+
+You can use pycodestyle to check the style.
+
+C++ coding style
+----------------
+
+Changes to C++ code should conform to [Google C++ Style
+Guide](https://google.github.io/styleguide/cppguide.html).
+
+You can use cpplint to check the style and use clang-format to format
+the code:
+
+```sh
+clang-format -style="{BasedOnStyle: google,            \
+                      DerivePointerAlignment: false,   \
+                      PointerAlignment: Right,         \
+                      BinPackParameters: false}" $file
+```
+
+C++ logging guideline
+---------------------
+
+The rule of VLOG level:
+
+```
+0. Ad hoc debug logging, should only be added in test or temporary ad hoc
+   debugging
+1. Important network level Debug/Latency trace log (Op run should never
+   generate level 1 vlog)
+2. Important op level Latency trace log
+3. Unimportant Debug/Latency trace log
+4. Verbose Debug/Latency trace log
+```
+
+C++ marco
+----------
+C++ macros should start with `MACE_`, except for most common ones like `LOG`
+and `VLOG`.
--- a/docs/development/memory_layout.rst
+++ b/docs/development/memory_layout.rst
+Memory layout
+===========================
+
+CPU runtime memory layout
+-------------------------
+The CPU tensor buffer is organized in the following order:
+
+-----------------------------+--------------+
+| Tensor type                 | Buffer       |
+=============================+==============+
+| Intermediate input/output   | NCHW         |
+-----------------------------+--------------+
+| Convolution Filter          | OIHW         |
+-----------------------------+--------------+
+| Depthwise Convolution Filter| MIHW         |
+-----------------------------+--------------+
+| 1-D Argument, length = W    | W            |
+-----------------------------+--------------+
+
+OpenCL runtime memory layout
+-----------------------------
+OpenCL runtime uses 2D image with CL_RGBA channel order as the tensor storage.
+This requires OpenCL 1.2 and above.
+
+The way of mapping the Tensor data to OpenCL 2D image (RGBA) is critical for
+kernel performance.
+
+In CL_RGBA channel order, each 2D image pixel contains 4 data items.
+The following tables describe the mapping from different type of tensors to
+2D RGBA Image.
+
+Input/Output Tensor
+~~~~~~~~~~~~~~~~~~~
+
+The Input/Output Tensor is stored in NHWC format:
+
+---------------------------+--------+----------------------------+-----------------------------+
+|Tensor type                | Buffer | Image size [width, height] | Explanation                 |
+===========================+========+============================+=============================+
+|Channel-Major Input/Output | NHWC   | [W * (C+3)/4, N * H]       | Default Input/Output format |
+---------------------------+--------+----------------------------+-----------------------------+
+|Height-Major Input/Output  | NHWC   | [W * C, N * (H+3)/4]       | Winograd Convolution format | 
+---------------------------+--------+----------------------------+-----------------------------+
+|Width-Major Input/Output   | NHWC   | [(W+3)/4 * C, N * H]       | Winograd Convolution format |
+---------------------------+--------+----------------------------+-----------------------------+
+
+Each Pixel of **Image** contains 4 elements. The below table list the
+coordination relation between **Image** and **Buffer**.
+
+---------------------------+-------------------------------------------------------------------------+-------------+
+|Tensor type                | Pixel coordinate relationship                                           | Explanation |
+===========================+=========================================================================+=============+
+|Channel-Major Input/Output | P[i, j] = {E[n, h, w, c] &#124; (n=j/H, h=j%H, w=i%W, c=[i/W * 4 + k])} | k=[0, 4)    |
+---------------------------+-------------------------------------------------------------------------+-------------+
+|Height-Major Input/Output  | P[i, j] = {E[n, h, w, c] &#124; (n=j%N, h=[j/H*4 + k], w=i%W, c=i/W)}   | k=[0, 4)    |
+---------------------------+-------------------------------------------------------------------------+-------------+
+|Width-Major Input/Output   | P[i, j] = {E[n, h, w, c] &#124; (n=j/H, h=j%H, w=[i%W*4 + k], c=i/W)}   | k=[0, 4)    |
+---------------------------+-------------------------------------------------------------------------+-------------+
+
+
+Filter Tensor
+~~~~~~~~~~~~~
+
+----------------------------+------+---------------------------------+------------------------------------------------------------------------------+
+| Tensor                     |Buffer| Image size [width, height]      | Explanation                                                                  |
+============================+======+=================================+==============================================================================+
+|Convolution Filter          | HWOI | [RoundUp<4>(I), H * W * (O+3)/4]|Convolution filter format，There is no difference compared to [H*w*I, (O+3)/4]|
+----------------------------+------+---------------------------------+------------------------------------------------------------------------------+
+|Depthwise Convlution Filter | HWIM | [H * W * M, (I+3)/4]            |Depthwise-Convolution filter format                                           |
+----------------------------+------+---------------------------------+------------------------------------------------------------------------------+
+
+Each Pixel of **Image** contains 4 elements. The below table list the
+coordination relation between **Image** and **Buffer**.
+
+----------------------------+-------------------------------------------------------------------+---------------------------------------+
+|Tensor type                 | Pixel coordinate relationship                                     | Explanation                           |
+============================+===================================================================+=======================================+
+|Convolution Filter          | P[m, n] = {E[h, w, o, i] &#124; (h=T/W, w=T%W, o=[n/HW*4+k], i=m)}| HW= H * W, T=n%HW, k=[0, 4)           |
+----------------------------+-------------------------------------------------------------------+---------------------------------------+
+|Depthwise Convlution Filter | P[m, n] = {E[h, w, i, 0] &#124; (h=m/W, w=m%W, i=[n*4+k])}        | only support multiplier == 1, k=[0, 4)| 
+----------------------------+-------------------------------------------------------------------+---------------------------------------+
+
+1-D Argument Tensor
+~~~~~~~~~~~~~~~~~~~
+
+----------------+----------+------------------------------+---------------------------------+
+| Tensor type    | Buffer   | Image size [width, height]   | Explanation                     |
+================+==========+==============================+=================================+
+| 1-D Argument   | W        | [(W+3)/4, 1]                 | 1D argument format, e.g. Bias   |
+----------------+----------+------------------------------+---------------------------------+
+
+Each Pixel of **Image** contains 4 elements. The below table list the
+coordination relation between **Image** and **Buffer**.
+
+--------------+---------------------------------+-------------+
+| Tensor type  | Pixel coordinate relationship   | Explanation |
+==============+=================================+=============+
+|1-D Argument  | P[i, 0] = {E[w] &#124; w=i*4+k} | k=[0, 4)    |
+--------------+---------------------------------+-------------+
--- a/docs/docker/usage.md
+++ b/docs/docker/usage.md
-Docker Images
-==========================================
-
-* Login in [Xiaomi Docker Registry](http://docs.api.xiaomi.net/docker-registry/)
-
-  ```
-  docker login cr.d.xiaomi.net
-  ```
-
-* Build with `Dockerfile`
-
-  ```
-  docker build -t cr.d.xiaomi.net/mace/mace-dev .
-  ```
-
-* Pull image from docker registry
-
-  ```
-  docker pull cr.d.xiaomi.net/mace/mace-dev
-  ```
-
-* Create container
-
-  ```
-  # Set 'host' network to use ADB
-  docker run -it --rm -v /local/path:/container/path --net=host cr.d.xiaomi.net/mace/mace-dev /bin/bash
-  ```
--- a/docs/faq.md
+++ b/docs/faq.md
+Frequently Asked Questions
+==========================
+
+Why is the generated static library file size so huge?
+-------------------------------------------------------
+The static library is simply an archive of a set of object files which are
+intermediate and contains many extra information, please check whether the
+final binary file size is as expected.
+
+Why is the generated binary file (including shared library) size so huge?
+-------------------------------------------------------------------------
+When compiling the model into C++ code, the final binary may contains extra
+debug symbols, they usually takes a lot of space. Try to strip the shared
+library or binary. The common overhead of the file size including the compiled
+model (excluding the model weights) after the strip should be less than 2MB.
+If the model weights is embedded into the binary, the extra overhead should be
+around {model weights size in float32}/2.
+
+OpenCL allocator failed with CL_OUT_OF_RESOURCES
+------------------------------------------------
+OpenCL runtime usually requires continuous virtual memory for its image buffer,
+the error will occur when the OpenCL driver can't find the continuous space
+due to high memory usage or fragmentation. Several solutions can be tried:
+
+* Change the model by reducing its memory usage
+* Split the Op with the biggest single memory buffer
+* Changed from armeabi-v7a to arm64-v8a to expand the virtual address space
+* Reduce the memory consumption of other modules of the same process
+
+Why the performance is worce than the official result for the same model?
+-------------------------------------------------------------------------
+The power options may not set properly, see `mace/public/mace_runtime.h` for
+details.
+
+Why the UI is getting poor responsiveness when running model with GPU runtime?
+------------------------------------------------------------------------------
+Try to set `limit_opencl_kernel_time` to `1`. If still not resolved, try to
+modify the source code to use even smaller time intervals or changed to CPU
+or DSP runtime.
+
+How to include more than one deployment files in one application(process)?
+------------------------------------------------------------------------------
+This case may happen when an application is developed by multiple teams as
+submodules. If the all the submodules are linked into a single shared library,
+then use the same version of MiAI Compute Engine will resolve this issue.
+Ortherwise, different deployment models are contained in different shared
+libraries, it's not required to use the same MiAI version but you should
+controls the exported symbols from the shared library. This is actually a
+best practice for all shared library, please read about GNU loader
+version script for more details.
--- a/docs/getting_started/create_a_model_deployment.rst
+++ b/docs/getting_started/create_a_model_deployment.rst
+Create a model deployment
+=========================
+
+Each YAML deployment script describes a case of deployments (for example,
+a smart camera application may contains face recognition, object recognition,
+and voice recognition models, which can be defined in one deployment file),
+which will generate one static library (if more than one ABIs specified,
+there will be one static library for each). Each YAML scripts can contains one
+or more models.
+
+
+Model deployment file example
+-------------------------------
+TODO: change to a link to a standalone file with comments.
+
+.. code:: yaml
+
+    # 配置文件名会被用作生成库的名称：libmace-${filename}.a
+    target_abis: [armeabi-v7a, arm64-v8a]
+    # 具体机型的soc编号，可以使用`adb shell getprop | grep ro.board.platform | cut -d [ -f3 | cut -d ] -f1`获取
+    target_socs: [msm8998]
+    embed_model_data: 1
+    vlog_level: 0
+    models: # 一个配置文件可以包含多个模型的配置信息，最终生成的库中包含多个模型
+      first_net: # 模型的标签，在调度模型的时候，会用这个变量
+        platform: tensorflow
+        model_file_path: path/to/model64.pb # also support http:// and https://
+        model_sha256_checksum: 7f7462333406e7dea87222737590ebb7d94490194d2f21a7d72bafa87e64e9f9
+        input_nodes: input_node
+        output_nodes: output_node
+        input_shapes: 1,64,64,3
+        output_shapes: 1,64,64,2
+        runtime: gpu
+        limit_opencl_kernel_time: 0
+        dsp_mode: 0
+        obfuscate: 1
+        fast_conv: 0
+        input_files:
+          - path/to/input_files # support http://
+      second_net:
+        platform: caffe
+        model_file_path: path/to/model.prototxt
+        weight_file_path: path/to/weight.caffemodel
+        model_sha256_checksum: 05d92625809dc9edd6484882335c48c043397aed450a168d75eb8b538e86881a
+        weight_sha256_checksum: 05d92625809dc9edd6484882335c48c043397aed450a168d75eb8b538e86881a
+        input_nodes:
+          - input_node0
+          - input_node1
+        output_nodes:
+          - output_node0
+          - output_node1
+        input_shapes:
+          - 1,256,256,3
+          - 1,128,128,3
+        output_shapes:
+          - 1,256,256,2
+          - 1,1,1,2
+        runtime: cpu
+        limit_opencl_kernel_time: 1
+        dsp_mode: 0
+        obfuscate: 1
+        fast_conv: 0
+        input_files:
+          - path/to/input_files # support http://
+
+Configurations
+--------------------
+
+--------------------------+----------------------------------------------------------------------------------------+
+| Configuration key        | Description                                                                            |
+==========================+========================================================================================+
+| target_abis              | The target ABI to build, can be one or more of 'host', 'armeabi-v7a' or 'arm64-v8a'    |
+--------------------------+----------------------------------------------------------------------------------------+
+| embed_model_data         | Whether embedding model weights as the code, default to 1                              |
+--------------------------+----------------------------------------------------------------------------------------+
+| platform                 | The source framework, tensorflow or caffe                                              |
+--------------------------+----------------------------------------------------------------------------------------+
+| model_file_path          | The path of the model file, can be local or remote                                     |
+--------------------------+----------------------------------------------------------------------------------------+
+| weight_file_path         | The path of the model weights file, used by Caffe model                                |
+--------------------------+----------------------------------------------------------------------------------------+
+| model_sha256_checksum    | The SHA256 checksum of the model file                                                  |
+--------------------------+----------------------------------------------------------------------------------------+
+| weight_sha256_checksum   | The SHA256 checksum of the weight file, used by Caffe model                            |
+--------------------------+----------------------------------------------------------------------------------------+
+| input_nodes              | The input node names, one or more strings                                              |
+--------------------------+----------------------------------------------------------------------------------------+
+| output_nodes             | The output node names, one or more strings                                             |
+--------------------------+----------------------------------------------------------------------------------------+
+| input_shapes             | The shapes of the input nodes, in NHWC order                                           |
+--------------------------+----------------------------------------------------------------------------------------+
+| output_shapes            | The shapes of the output nodes, in NHWC order                                          |
+--------------------------+----------------------------------------------------------------------------------------+
+| runtime                  | The running device, one of CPU, GPU or DSP                                             |
+--------------------------+----------------------------------------------------------------------------------------+
+| limit_opencl_kernel_time | Whether splitting the OpenCL kernel within 1 ms to keep UI responsiveness, default to 0|
+--------------------------+----------------------------------------------------------------------------------------+
+| dsp_mode                 | Control the DSP precision and performance, default to 0 usually works for most cases   |
+--------------------------+----------------------------------------------------------------------------------------+
+| obfuscate                | Whether to obfuscate the model operator name, default to 0                             |
+--------------------------+----------------------------------------------------------------------------------------+
+| fast_conv                | Whether to enable Winograd convolution, **will increase memory consumption**           |
+--------------------------+----------------------------------------------------------------------------------------+
+| input_files              | Specify Numpy validation inputs. When not provided, [-1, 1] random values will be used |
+--------------------------+----------------------------------------------------------------------------------------+
--- a/docs/getting_started/docker.rst
+++ b/docs/getting_started/docker.rst
+Docker Images
+=============
+
+-  Login in `Xiaomi Docker
+   Registry <http://docs.api.xiaomi.net/docker-registry/>`__
+
+``docker login cr.d.xiaomi.net``
+
+-  Build with ``Dockerfile``
+
+``docker build -t cr.d.xiaomi.net/mace/mace-dev .``
+
+-  Pull image from docker registry
+
+``docker pull cr.d.xiaomi.net/mace/mace-dev``
+
+-  Create container
+
+``# Set 'host' network to use ADB   docker run -it --rm -v /local/path:/container/path --net=host cr.d.xiaomi.net/mace/mace-dev /bin/bash``
--- a/docs/getting_started/how_to_build.rst
+++ b/docs/getting_started/how_to_build.rst
+How to build
+============
+
+模型格式支持
+-------------
+
+--------------+------------------------------------------------------------------------------------------+
+| 框架格式     | 支持情况                                                                                 |
+==============+==========================================================================================+
+| TensorFlow   | 推荐使用1.4以上版本，否则可能达不到最佳性能 (考虑到后续Android NN，建议首选TensorFLow)   |
+--------------+------------------------------------------------------------------------------------------+
+| Caffe        | 推荐使用1.0以上版本，低版本可能不支持，建议改用TensorFlow                                |
+--------------+------------------------------------------------------------------------------------------+
+| MXNet        | 尚未支持                                                                                 |
+--------------+------------------------------------------------------------------------------------------+
+| ONNX         | 尚未支持                                                                                 |
+--------------+------------------------------------------------------------------------------------------+
+
+环境要求
+---------
+
+``mace``\ 提供了包含开发运行所需环境的docker镜像，镜像文件可以参考\ ``./docker/``\ 。启动命令：
+
+.. code:: sh
+
+    sudo docker pull cr.d.xiaomi.net/mace/mace-dev
+    sudo docker run -it --rm --privileged -v /dev/bus/usb:/dev/bus/usb --net=host -v /local/path:/container/path cr.d.xiaomi.net/mace/mace-dev /bin/bash
+
+如果用户希望配置开发机上的环境，可以参考如下环境要求：
+
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
+| 软件                | 版本号          | 安装命令                                                                                          |
+=====================+=================+===================================================================================================+
+| bazel               | >= 0.5.4        | -                                                                                                 |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
+| android-ndk         | r12c            | -                                                                                                 |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
+| adb                 | >= 1.0.32       | apt install -y android-tools-adb                                                                  |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
+| tensorflow          | 1.4.0           | pip install tensorflow==1.4.0                                                                     |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
+| scipy               | >= 1.0.0        | pip install scipy                                                                                 |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
+| jinja2              | >= 2.10         | pip install jinja2                                                                                |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
+| PyYaml              | >= 3.12         | pip install pyyaml                                                                                |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
+| docker(for caffe)   | >= 17.09.0-ce   | `install doc <https://docs.docker.com/install/linux/docker-ce/ubuntu/#set-up-the-repository>`__   |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
+
+使用简介
+--------
+
+1. 获取最新tag的代码
+
+**建议尽可能使用最新tag下的代码，以及不要直接使用master分支的最新代码。**
+
+.. code:: sh
+
+    git clone git@v9.git.n.xiaomi.com:deep-computing/mace.git
+
+    # update
+    git fetch --all --tags --prune
+
+    # get latest tag version
+    tag_name=`git describe --abbrev=0 --tags`
+
+    # checkout to latest tag branch
+    git checkout -b ${tag_name} tags/${tag_name}
+
+2. 模型优化
+
+-  Tensorflow
+
+TensorFlow训练得到的模型进行一系列的转换，可以提升设备上的运行速度。TensorFlow提供了官方工具
+`TensorFlow Graph Transform
+Tool <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md>`__
+来进行模型优化
+(此工具Docker镜像中已经提供，也可以直接点击`下载 <http://cnbj1-inner-fds.api.xiaomi.net/mace/tool/transform_graph>`__\ 这个工具，用户亦可从官方源码编译\`)。以下分别是GPU模型和DSP模型的优化命令：
+
+.. code:: sh
+
+    # GPU模型:
+    ./transform_graph \
+        --in_graph=tf_model.pb \
+        --out_graph=tf_model_opt.pb \
+        --inputs='input' \
+        --outputs='output' \
+        --transforms='strip_unused_nodes(type=float, shape="1,64,64,3") 
+            strip_unused_nodes(type=float, shape="1,64,64,3")
+            remove_nodes(op=Identity, op=CheckNumerics)
+            fold_constants(ignore_errors=true)
+            fold_batch_norms
+            fold_old_batch_norms
+            strip_unused_nodes
+            sort_by_execution_order'
+
+    # DSP模型:
+    ./transform_graph \
+        --in_graph=tf_model.pb \
+        --out_graph=tf_model_opt.pb \
+        --inputs='input' \
+        --outputs='output' \
+        --transforms='strip_unused_nodes(type=float, shape="1,64,64,3") 
+            strip_unused_nodes(type=float, shape="1,64,64,3")
+            remove_nodes(op=Identity, op=CheckNumerics)
+            fold_constants(ignore_errors=true)
+            fold_batch_norms
+            fold_old_batch_norms
+            backport_concatv2
+            quantize_weights(minimum_size=2)
+            quantize_nodes
+            strip_unused_nodes
+            sort_by_execution_order'
+
+-  Caffe
+
+Caffe目前只支持最新版本，旧版本请使用Caffe的工具进行升级。
+
+.. code:: bash
+
+    # Upgrade prototxt
+    $CAFFE_ROOT/build/tools/upgrade_net_proto_text MODEL.prototxt MODEL.new.prototxt
+
+    # Upgrade caffemodel
+    $CAFFE_ROOT/build/tools/upgrade_net_proto_binary MODEL.caffemodel MODEL.new.caffemodel
+
+3. 生成模型静态库
+
+模型静态库的生成需要使用目标机型，\ ***并且要求必须在目标SOC的机型上编译生成静态库。***
+
+我们提供了\ ``mace_tools.py``\ 工具，可以将模型文件转换成静态库。\ ``tools/mace_tools.py``\ 使用步骤：
+
+
+
+3.2 运行\ ``tools/mace_tools.py``\ 脚本
+
+.. code:: sh
+
+    # print help message
+    # python tools/mace_tools.py --help
+    # --config 配置文件的路径
+    # --output_dir 编译结果的输出文件目录，默认为`./build`
+    # --round 调用`examples/mace_run`运行模型的次数，默认为`1`
+    # --tuning 对opencl的参数调参，该项通常只有开发人员用到，默认为`true`
+    # --mode 运行模式，包含build/run/validate/merge/all/benchmark，默认为`all`
+
+    # 仅编译模型和生成静态库
+    python tools/mace_tools.py --config=models/config.yaml --mode=build
+
+    # 测试模型的运行时间
+    python tools/mace_tools.py --config=models/config.yaml --mode=run --round=1000
+
+    # 对比编译好的模型在mace上与直接使用tensorflow或者caffe运行的结果，相似度使用`余弦距离表示`
+    # 其中使用OpenCL设备，默认相似度大于等于`0.995`为通过；DSP设备下，相似度需要达到`0.930`。
+    python tools/mace_tools.py --config=models/config.yaml --mode=run --round=1000
+
+    # 将已编译好的多个模型合并成静态库
+    # 比如编译了8个模型，决定使用其中2个模型，这时候可以不重新build，直接修改全局配置文件，合并生成静态库
+    python tools/mace_tools.py --config=models/config.yaml --mode=merge
+
+    # 运行以上所有项（可用于测试速度，建议 round=20）
+    python tools/mace_tools.py --config=models/config.yaml --mode=all --round=1000
+
+    # 模型Benchmark：查看每个Op的运行时间
+    python tools/mace_tools.py --config=models/config.yaml --mode=benchmark
+
+    # 查看模型运行时占用内存（如果有多个模型，可能需要注释掉一部分配置，只剩一个模型的配置）
+    python tools/mace_tools.py --config=models/config.yaml --mode=run --round=10000 &
+    adb shell dumpsys meminfo | grep mace_run
+    sleep 10
+    kill %1
+
+4. 发布
+
+通过前面的步骤，我们得到了包含业务模型的库文件。在业务代码中，我们只需要引入下面3组文件（\ ``./build/``\ 是默认的编译结果输出目录）：
+
+头文件(包含mace.h和各个模型的头文件)： \*
+``./build/${project_name}/${target_abi}/include/mace/public/*.h``
+
+静态库（包含mace engine、opencl和模型相关库）： \*
+``./build/${project_name}/${target_abi}/*.a``
+
+动态库（仅编译的模型中包含dsp模式时用到）： \*
+``./build/${project_name}/${target_abi}/libhexagon_controller.so``
+
+模型数据文件（仅在EMBED\_MODEL\_DATA=0时产生）： \*
+``./build/${project_name}/data/${MODEL_TAG}.data``
+
+编译过程中间文件： \* ``./build/${project_name}/build/``
+
+库文件tar包： \* ``./build/${project_name}/${project_name}.tar.gz``
+
+5. 使用
+
+具体使用流程可参考\ ``mace/examples/mace_run.cc``\ ，下面列出关键步骤。
+
+.. code:: cpp
+
+    // 引入头文件
+    #include "mace/public/mace.h"
+    #include "mace/public/{MODEL_TAG}.h"
+
+    // 0. 设置内部存储
+    const std::string file_path ="/path/to/store/internel/files";
+    std::shared_ptr<KVStorageFactory> storage_factory(
+        new FileStorageFactory(file_path));
+    ConfigKVStorageFactory(storage_factory);
+
+    //1. 从文件或代码中Load模型数据，也可通过自定义的方式来Load (例如可自己实现压缩加密等)
+    // 如果使用的是数据嵌入的方式，将参数设为nullptr。
+    unsigned char *model_data = mace::MACE_MODEL_TAG::LoadModelData(FLAGS_model_data_file.c_str());
+
+    //2. 创建net对象
+    NetDef net_def = mace::MACE_MODEL_TAG::CreateNet(model_data);
+
+    //3. 声明设备类型(必须与build时指定的runtime一致）
+    DeviceType device_type = DeviceType::OPENCL;
+
+    //4. 定义输入输出名称数组
+    std::vector<std::string> input_names = {...};
+    std::vector<std::string> output_names = {...};
+
+    //5. 创建输入输出对象
+    std::map<std::string, mace::MaceTensor> inputs;
+    std::map<std::string, mace::MaceTensor> outputs;
+    for (size_t i = 0; i < input_count; ++i) {
+      // Allocate input and output
+      int64_t input_size =
+          std::accumulate(input_shapes[i].begin(), input_shapes[i].end(), 1,
+                          std::multiplies<int64_t>());
+      auto buffer_in = std::shared_ptr<float>(new float[input_size],
+                                              std::default_delete<float[]>());
+      // load input
+      ...
+
+      inputs[input_names[i]] = mace::MaceTensor(input_shapes[i], buffer_in);
+    }
+
+    for (size_t i = 0; i < output_count; ++i) {
+      int64_t output_size =
+          std::accumulate(output_shapes[i].begin(), output_shapes[i].end(), 1,
+                          std::multiplies<int64_t>());
+      auto buffer_out = std::shared_ptr<float>(new float[output_size],
+                                               std::default_delete<float[]>());
+      outputs[output_names[i]] = mace::MaceTensor(output_shapes[i], buffer_out);
+    }
+
+    //6. 创建MaceEngine对象
+    mace::MaceEngine engine(&net_def, device_type, input_names, output_names);
+
+    //7. 如果设备类型是OPENCL或HEXAGON，可以在此释放model_data
+    if (device_type == DeviceType::OPENCL || device_type == DeviceType::HEXAGON) {
+      mace::MACE_MODEL_TAG::UnloadModelData(model_data);
+    }
+
+    //8. 执行模型，得到结果
+    engine.Run(inputs, &outputs);
+
--- a/docs/getting_started/introduction.md
+++ b/docs/getting_started/introduction.md
+Introduction
+============
+
+TODO: describe the conceptions and workflow with diagram.
+![alt text](workflow.jpg "MiAI workflow")
+
+TODO: describe the runtime.
+
--- a/docs/getting_started/op_lists.rst
+++ b/docs/getting_started/op_lists.rst
+Operator lists
+==============
+
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| Operator                         | Android NN   | Status | Remark                                                |
+==================================+==============+========+=======================================================+
+| ADD                              | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| AVERAGE\_POOL\_2D                | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| BATCH\_NORM                      |              | Y      | Fusion with activation is supported                   |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| BIAS\_ADD                        |              | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| CHANNEL\_SHUFFLE                 |              | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| CONCATENATION                    | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| CONV\_2D                         | Y            | Y      | Fusion with BN and activation layer is supported      |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| DEPTHWISE\_CONV\_2D              | Y            | Y      | Only multiplier = 1 is supported; Fusion is supported |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| DEPTH\_TO\_SPACE                 | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| DEQUANTIZE                       | Y            |        |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| EMBEDDING\_LOOKUP                | Y            |        |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| FLOOR                            | Y            |        |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| FULLY\_CONNECTED                 | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| GROUP\_CONV\_2D                  |              |        |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| HASHTABLE\_LOOKUP                | Y            |        |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| L2\_NORMALIZATION                | Y            |        |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| L2\_POOL\_2D                     | Y            |        |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| LOCAL\_RESPONSE\_NORMALIZATION   | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| LOGISTIC                         | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| LSH\_PROJECTION                  | Y            |        |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| LSTM                             | Y            |        |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| MATMUL                           |              | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| MAX\_POOL\_2D                    | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| MUL                              | Y            |        |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| PSROI\_ALIGN                     |              | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| PRELU                            |              | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| RELU                             | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| RELU1                            | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| RELU6                            | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| RELUX                            |              | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| RESHAPE                          | Y            | Y      | Limited support                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| RESIZE\_BILINEAR                 | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| RNN                              | Y            |        |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| RPN\_PROPOSAL\_LAYER             |              | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| SOFTMAX                          | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| SPACE\_TO\_DEPTH                 | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| SVDF                             | Y            |        |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
+| TANH                             | Y            | Y      |                                                       |
+----------------------------------+--------------+--------+-------------------------------------------------------+
--- a/docs/getting_started/workflow.jpg
+++ b/docs/getting_started/workflow.jpg
--- a/docs/index.rst
+++ b/docs/index.rst
-MACE documentation
-================================
-Welcome to MACE documentation.
+MiAI Compute Engine Documentation
+=================================
+Welcome to MiAI Compute Engine documentation.

+The main documentation is organized into the following sections:

-Contents
--------
+.. toctree::
+   :maxdepth: 1
+   :caption: Getting started
+   :name: sec-start
+
+   getting_started/introduction
+   getting_started/create_a_model_deployment
+   getting_started/how_to_build
+   getting_started/docker
+   getting_started/op_lists
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Development
+   :name: sec-devel
+
+   development/contributing
+   development/adding_a_new_op
+   development/memory_layout

 .. toctree::
   :maxdepth: 1
+   :caption: FAQ
+   :name: sec-faq

-   self
-   user/introduction
-   docker/usage
-   developer/adding_a_new_op
-   developer/opencl_memory_layout
-   developer/logging
-   release_note
+   faq
--- a/docs/user/introduction.md
+++ b/docs/user/introduction.md
-Introduction
-============
-
-**MACE** - *Mobile(Mi) Accelerated Compute Engine Library* 是小米自主研发的移动端神经网络加速引擎。
-
-## 特点
-1. 速度快
-  * 专门为小米手机SoC优化(高通，MTK，澎湃)，支持GPU(DSP基于nnlib)加速，在主流高通平台速度优于高通SNPE框架 (注: 速度跟模型结构有关，depthwise conv2d，1x1卷积MACE与SNPE持平，3x3卷积，通用卷积等优于SNPE。另外目前高通平台SNPE明显优于TensorFlow Lite, Caffe/Caffe2，ARM Compute Library，腾讯ncnn，百度MDL等开源框架)。
-  * 支持不同SoC的自动调优
-  * 模型数据通过mmap方式加载，启动速度快
-2. 内存占用少
-  * MACE支持基于计算图依赖的内存优化技术，通过内存复用，能减少运行时内存占用，特别是对于依赖较简单的模型，内存优化效果明显
-3. 体积小
-  * MACE本身无外部依赖，核心代码小于1MB (模型除外)
-4. 内置模型加密功能
-  * MACE支持模型混淆加密功能，模型直接编译成可执行代码而非数据文件，同时加强了敏感代码的混淆，增加了反向的难度
-5. 部署便捷
-  * 用户接口简单，只需要一个头文件，MACE采用源码/静态库形式链接到用户程序，不会引入额外的动态库和模型数据文件(DSP版本需要一个额外的动态库)
-
-## 模型格式支持
-| 框架格式       | 支持情况 |
-| ---------- |:-------:|
-| TensorFlow | 推荐使用1.4以上版本，否则可能达不到最佳性能 (考虑到后续Android NN，建议首选TensorFLow) |
-| Caffe | 推荐使用1.0以上版本，低版本可能不支持，建议改用TensorFlow |
-| MXNet | 尚未支持 |
-| ONNX | 尚未支持 |
-
-
-## 环境要求
-
-`mace`提供了包含开发运行所需环境的docker镜像，镜像文件可以参考`./docker/`。启动命令：
-```sh
-sudo docker pull cr.d.xiaomi.net/mace/mace-dev
-sudo docker run -it --rm --privileged -v /dev/bus/usb:/dev/bus/usb --net=host -v /local/path:/container/path cr.d.xiaomi.net/mace/mace-dev /bin/bash
-```
-
-如果用户希望配置开发机上的环境，可以参考如下环境要求：
-
-| 软件     | 版本号         | 安装命令 |
-| -------- |:--------------:|:---------------------:|
-| bazel | >= 0.5.4 | - |
-| android-ndk | r12c | - |
-| adb | >= 1.0.32 | apt install -y android-tools-adb |
-| tensorflow | 1.4.0 | pip install tensorflow==1.4.0 |
-| scipy | >= 1.0.0 | pip install scipy |
-| jinja2 | >= 2.10 | pip install jinja2 |
-| PyYaml | >= 3.12 | pip install pyyaml |
-| docker(for caffe) | >= 17.09.0-ce | [install doc](https://docs.docker.com/install/linux/docker-ce/ubuntu/#set-up-the-repository) |
-
-## 文件组织
-
-```
-|-- tools --> mace编译运行相关的工具脚本
-|   |-- mace_tools.py
-|   |-- ...
-|
-|-- mace
-|   |-- benchmark
-|   |
-|   |-- codegen --> 模型、opencl二进制文件和tuning数据生成的C++代码
-|   |   |-- models
-|   |   |-- opencl
-|   |   |-- opencl_bin
-|   |   |-- tuning
-|   |
-|   |-- core
-|   |
-|   |-- examples
-|   |   |-- mace_run.cc --> 运行mace模型的样例
-|   |   |-- ...
-|   |
-|   |-- kernels
-|   |
-|   |-- ops
-|   |
-|   |-- public --> mace的接口
-|
-|-- docker --> mace开发环境的Dockerfile
-```
-
-## 使用简介
-
-1\. 获取最新tag的代码
-
-**建议尽可能使用最新tag下的代码，以及不要直接使用master分支的最新代码。**
-
-```sh
-git clone git@v9.git.n.xiaomi.com:deep-computing/mace.git
-
-# update
-git fetch --all --tags --prune
-
-# get latest tag version
-tag_name=`git describe --abbrev=0 --tags`
-
-# checkout to latest tag branch
-git checkout -b ${tag_name} tags/${tag_name}
-```
-
-2\. 模型优化
-
- Tensorflow
-
-TensorFlow训练得到的模型进行一系列的转换，可以提升设备上的运行速度。TensorFlow提供了官方工具 
-[TensorFlow Graph Transform Tool](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md) 
-来进行模型优化 (在官方工具的基础上，我们做了一部分定制化，此工具Docker镜像中已经提供，也可以直接点击[下载](http://cnbj1-inner-fds.api.xiaomi.net/mace/tool/transform_graph)这个工具，`暂时不支持用户自己从官方源码编译`)。以下分别是GPU模型和DSP模型的优化命令：
-
-```sh
-# GPU模型:
-./transform_graph \
-    --in_graph=tf_model.pb \
-    --out_graph=tf_model_opt.pb \
-    --inputs='input' \
-    --outputs='output' \
-    --transforms='strip_unused_nodes(type=float, shape="1,64,64,3") 
-        strip_unused_nodes(type=float, shape="1,64,64,3")
-        remove_nodes(op=Identity, op=CheckNumerics)
-        fold_batch_norms
-        fold_old_batch_norms
-        strip_unused_nodes
-        sort_by_execution_order'
-
-# DSP模型:
-./transform_graph \
-    --in_graph=tf_model.pb \
-    --out_graph=tf_model_opt.pb \
-    --inputs='input' \
-    --outputs='output' \
-    --transforms='strip_unused_nodes(type=float, shape="1,64,64,3") 
-        strip_unused_nodes(type=float, shape="1,64,64,3")
-        remove_nodes(op=Identity, op=CheckNumerics)
-        fold_constants(ignore_errors=true)
-        fold_batch_norms
-        fold_old_batch_norms
-        backport_concatv2
-        quantize_weights(minimum_size=2)
-        quantize_nodes
-        strip_unused_nodes
-        sort_by_execution_order'
-```
-
- Caffe
-
-Caffe目前只支持最新版本，旧版本请使用Caffe的工具进行升级。
-```bash
-# Upgrade prototxt
-$CAFFE_ROOT/build/tools/upgrade_net_proto_text MODEL.prototxt MODEL.new.prototxt
-
-# Upgrade caffemodel
-$CAFFE_ROOT/build/tools/upgrade_net_proto_binary MODEL.caffemodel MODEL.new.caffemodel
-
-```
-
-3\. 生成模型静态库
-
-模型静态库的生成需要使用目标机型，***并且要求必须在目标SOC的机型上编译生成静态库。***
-
-
-我们提供了`mace_tools.py`工具，可以将模型文件转换成静态库。`tools/mace_tools.py`使用步骤：
-
-
-3\.1 配置文件
-
-配置文件使用yml文件格式，配置项如下：
-```yaml
-# 配置文件名会被用作生成库的名称：libmace-${filename}.a
-target_abis: [armeabi-v7a, arm64-v8a]
-# 具体机型的soc编号，可以使用`adb shell getprop | grep ro.board.platform | cut -d [ -f3 | cut -d ] -f1`获取
-target_socs: [msm8998]
-embed_model_data: 1
-models: # 一个配置文件可以包含多个模型的配置信息，最终生成的库中包含多个模型
-  first_net: # 模型的标签，在调度模型的时候，会用这个变量
-    platform: tensorflow
-    model_file_path: path/to/model64.pb # also support http:// and https://
-    model_sha256_checksum: 7f7462333406e7dea87222737590ebb7d94490194d2f21a7d72bafa87e64e9f9
-    input_nodes: input_node
-    output_nodes: output_node
-    input_shapes: 1,64,64,3
-    output_shapes: 1,64,64,2
-    runtime: gpu
-    limit_opencl_kernel_time: 0
-    dsp_mode: 0
-    obfuscate: 1
-    fast_conv: 0
-    input_files:
-      - path/to/input_files # support http://
-  second_net:
-    platform: caffe
-    model_file_path: path/to/model.prototxt
-    weight_file_path: path/to/weight.caffemodel
-    model_sha256_checksum: 05d92625809dc9edd6484882335c48c043397aed450a168d75eb8b538e86881a
-    weight_sha256_checksum: 05d92625809dc9edd6484882335c48c043397aed450a168d75eb8b538e86881a
-    input_nodes:
-      - input_node0
-      - input_node1
-    output_nodes:
-      - output_node0
-      - output_node1
-    input_shapes:
-      - 1,256,256,3
-      - 1,128,128,3
-    output_shapes:
-      - 1,256,256,2
-      - 1,1,1,2
-    runtime: cpu
-    limit_opencl_kernel_time: 1
-    dsp_mode: 0
-    obfuscate: 1
-    fast_conv: 0
-    input_files:
-      - path/to/input_files # support http://
-```
-
-具体配置项含义如下表：
-
-| 配置项     |      含义      |
-| ---------- |:--------------:|
-| target_abis | 运行的ABI，可选包括安卓设备的armeabi-v7a，arm64-v8a等，以及开发人员的电脑终端（电脑终端使用‘host’表示）。可以同时指定多个ABI |
-| embed_model_data | 是否将模型里的数据嵌入到代码中，默认为1 |
-| platform | 模型对应的框架名称 [tensorflow | caffe] |
-| model_file_path | 模型的路径，可以是一个http或https的下载链接 |
-| weight_file_path | 权重文件的路径，可以是一个http或https的下载链接(caffe model)|
-| model_sha256_checksum | The SHA256 checksum of the model file |
-| weight_sha256_checksum | The SHA256 checksum of the weight file(caffe model) |
-| input_nodes | 优化后的模型或其他框架模型的输入节点, 支持多个节点|
-| output_nodes | 优化后的模型或其他框架模型的输出节点, 支持多个节点|
-| input_shapes | 格式: NHWC. 模型的输入shape, 支持多个shape|
-| output_shapes | 格式: NHWC. 模型的输出shape, 支持多个shape|
-| runtime | 运行的设备，可选包含cpu、gpu和dsp |
-| limit_opencl_kernel_time | 限制opencl的kernel每个work group运行时间在1ms以内，可能影响性能，默认关闭 |
-| dsp_mode | 配置dsp的不同计算方式，以获得不同的精度和性能，一般使用默认值0即可 |
-| obfuscate | 是否混淆模型内部各个操作的名称 |
-| fast_conv| 使用最快的卷积算法，**可能会导致内存增多**|
-| input_files| (可选). 指定模型输入文件，用于结果验证，必须与input_nodes对应。如未指定，则使用[-1,1]的随机值|
-
-3\.2 运行`tools/mace_tools.py`脚本
-```sh
-# print help message
-# python tools/mace_tools.py --help
-# --config 配置文件的路径
-# --output_dir 编译结果的输出文件目录，默认为`./build`
-# --round 调用`examples/mace_run`运行模型的次数，默认为`1`
-# --tuning 对opencl的参数调参，该项通常只有开发人员用到，默认为`true`
-# --mode 运行模式，包含build/run/validate/merge/all/benchmark，默认为`all`
-
-# 仅编译模型和生成静态库
-python tools/mace_tools.py --config=models/config.yml --mode=build
-
-# 测试模型的运行时间
-python tools/mace_tools.py --config=models/config.yml --mode=run --round=1000
-
-# 对比编译好的模型在mace上与直接使用tensorflow或者caffe运行的结果，相似度使用`余弦距离表示`
-# 其中使用OpenCL设备，默认相似度大于等于`0.995`为通过；DSP设备下，相似度需要达到`0.930`。
-python tools/mace_tools.py --config=models/config.yml --mode=run --round=1000
-
-# 将已编译好的多个模型合并成静态库
-# 比如编译了8个模型，决定使用其中2个模型，这时候可以不重新build，直接修改全局配置文件，合并生成静态库
-python tools/mace_tools.py --config=models/config.yml --mode=merge
-
-# 运行以上所有项（可用于测试速度，建议 round=20）
-python tools/mace_tools.py --config=models/config.yml --mode=all --round=1000
-
-# 模型Benchmark：查看每个Op的运行时间
-python tools/mace_tools.py --config=models/config.yml --mode=benchmark
-
-# 查看模型运行时占用内存（如果有多个模型，可能需要注释掉一部分配置，只剩一个模型的配置）
-python tools/mace_tools.py --config=models/config.yml --mode=run --round=10000 &
-adb shell dumpsys meminfo | grep mace_run
-sleep 10
-kill %1
-
-```
-
-4\. 发布
-
-通过前面的步骤，我们得到了包含业务模型的库文件。在业务代码中，我们只需要引入下面3组文件（`./build/`是默认的编译结果输出目录）：
-
-头文件(包含mace.h和各个模型的头文件)：
-  * `./build/${project_name}/${target_abi}/include/mace/public/*.h`
-
-静态库（包含mace engine、opencl和模型相关库）：
-  * `./build/${project_name}/${target_abi}/*.a`
-
-动态库（仅编译的模型中包含dsp模式时用到）：
-  * `./build/${project_name}/${target_abi}/libhexagon_controller.so`
-
-模型数据文件（仅在EMBED_MODEL_DATA=0时产生）：
-  * `./build/${project_name}/data/${MODEL_TAG}.data`
-
-编译过程中间文件：
-  * `./build/${project_name}/build/`
-
-库文件tar包：
-  * `./build/${project_name}/${project_name}.tar.gz`
-  
-5\. 使用
-
-具体使用流程可参考`mace/examples/mace_run.cc`，下面列出关键步骤。
-
-```c++
-// 引入头文件
-#include "mace/public/mace.h"
-#include "mace/public/{MODEL_TAG}.h"
-
-// 0. 设置内部存储
-const std::string file_path ="/path/to/store/internel/files";
-std::shared_ptr<KVStorageFactory> storage_factory(
-    new FileStorageFactory(file_path));
-ConfigKVStorageFactory(storage_factory);
-
-//1. 从文件或代码中Load模型数据，也可通过自定义的方式来Load (例如可自己实现压缩加密等)
-// 如果使用的是数据嵌入的方式，将参数设为nullptr。
-unsigned char *model_data = mace::MACE_MODEL_TAG::LoadModelData(FLAGS_model_data_file.c_str());
-
-//2. 创建net对象
-NetDef net_def = mace::MACE_MODEL_TAG::CreateNet(model_data);
-
-//3. 声明设备类型
-DeviceType device_type = DeviceType::GPU;
-
-//4. 定义输入输出名称数组
-std::vector<std::string> input_names = {...};
-std::vector<std::string> output_names = {...};
-
-//5. 创建输入输出对象
-std::map<std::string, mace::MaceTensor> inputs;
-std::map<std::string, mace::MaceTensor> outputs;
-for (size_t i = 0; i < input_count; ++i) {
-  // Allocate input and output
-  int64_t input_size =
-      std::accumulate(input_shapes[i].begin(), input_shapes[i].end(), 1,
-                      std::multiplies<int64_t>());
-  auto buffer_in = std::shared_ptr<float>(new float[input_size],
-                                          std::default_delete<float[]>());
-  // load input
-  ...
-
-  inputs[input_names[i]] = mace::MaceTensor(input_shapes[i], buffer_in);
-}
-
-for (size_t i = 0; i < output_count; ++i) {
-  int64_t output_size =
-      std::accumulate(output_shapes[i].begin(), output_shapes[i].end(), 1,
-                      std::multiplies<int64_t>());
-  auto buffer_out = std::shared_ptr<float>(new float[output_size],
-                                           std::default_delete<float[]>());
-  outputs[output_names[i]] = mace::MaceTensor(output_shapes[i], buffer_out);
-}
-
-//6. 创建MaceEngine对象
-mace::MaceEngine engine(&net_def, device_type, input_names, output_names);
-
-//7. 如果设备类型是GPU或者HEXAGON，可以在此释放model_data
-if (device_type == DeviceType::GPU || device_type == DeviceType::HEXAGON) {
-  mace::MACE_MODEL_TAG::UnloadModelData(model_data);
-}
-
-//8. 执行模型，得到结果
-engine.Run(inputs, &outputs);
-
-```
-
-## 功能列表
-算子持续完善中，有新功能需求请联系我们。
-
-| 操作          | Android NN    | 状态      | 备注          |
-| ------------- |:-------------:|:---------:|:-------------:|
-| ADD | Y | Y | |
-| AVERAGE_POOL_2D | Y | Y | |
-| BATCH_NORM | | Y | 支持与激活层合并 |
-| BIAS_ADD | | Y | |
-| CHANNEL_SHUFFLE | | | |
-| CONCATENATION | Y | Y | |
-| CONV_2D | Y | Y | 支持stride，dilations，支持与batch norm和激活层合并 |
-| DEPTHWISE_CONV_2D | Y | Y | 目前支持multiplier = 1以及与batch norm和激活层合并 |
-| DEPTH_TO_SPACE | Y | | |
-| DEQUANTIZE | Y | | |
-| EMBEDDING_LOOKUP | Y | | |
-| FLOOR | Y | | |
-| FULLY_CONNECTED | Y | Y | |
-| GROUP_CONV_2D | | | |
-| HASHTABLE_LOOKUP | Y | | |
-| L2_NORMALIZATION | Y | | |
-| L2_POOL_2D | Y | | |
-| LOCAL_RESPONSE_NORMALIZATION | Y | | |
-| LOGISTIC | Y | Y | |
-| LSH_PROJECTION | Y | | |
-| LSTM | Y | | |
-| MATMUL | | |  |
-| MAX_POOL_2D | Y | Y | |
-| MUL | Y | | |
-| PSROI_ALIGN | | | |
-| PRELU | | Y |  |
-| RELU | Y | Y | |
-| RELU1 | Y | Y | |
-| RELU6 | Y | Y | |
-| RELUX |  | Y | |
-| RESHAPE | Y | | |
-| RESIZE_BILINEAR | Y | Y | |
-| RNN | Y | | |
-| RPN_PROPOSAL_LAYER | | | |
-| SOFTMAX | Y | Y | |
-| SPACE_TO_DEPTH | Y | | |
-| SVDF | Y | | |
-| TANH | Y | Y | |
-
-
-## 性能对比
-待整理