提交 11733d33 编写于 作者: L Liangliang He

Update documents

上级 1a9b0045
# **MACE** - *Mobile(Mi) Accelerated Compute Engine Library*
# MiAI Compute Engine
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
[![build status](http://v9.git.n.xiaomi.com/deep-computing/mace/badges/master/build.svg)](http://v9.git.n.xiaomi.com/deep-computing/mace/pipelines)
## Introduction
**Accelerating Neural Network with Heterogeneous Computing Devices in the phone.**
[Documentation](docs) |
[FAQ](docs/faq.md) |
[Release Notes](RELEASE.md) |
[MiAI Model Zoo](http://v9.git.n.xiaomi.com/deep-computing/mace-models) |
[Demo](mace/android)
Supported Devices: **CPU(NEON)/GPU/DSP**.
**MiAI Compute Engine** is a deep learning inference framework optimized for
mobile heterogeneous computing platforms. The design is focused on the following
targets:
* Performance
* The runtime is highly optimized with NEON, OpenCL and HVX. Except for the
inference speed, the initialization speed is also intensively optimized.
* Power consumption
* Chip dependent power options are included as advanced API.
* Memory usage and library footprint
* Graph level memory allocation optimization and buffer reuse is supported.
* Model protection
* Model protection is one the highest priority feature from the beginning of
the design. Various techniques are introduced like coverting models to C++
code and literal obfuscations.
* Platform coverage
* A good coverage of recent Qualcomm, MediaTek, Pinecone and other ARM based
chips. CPU runtime is also compitable with most POSIX systems and
archetectures with limited performance.
## User Guide
## Getting Started
wiki url: [http://v9.git.n.xiaomi.com/deep-computing/mace/wikis/User%20Guide/introduction](http://v9.git.n.xiaomi.com/deep-computing/mace/wikis/User%20Guide/introduction)
## Performance
[MiAI Model Zoo](http://v9.git.n.xiaomi.com/deep-computing/mace-models) contains
several common neural networks models and built daily against several mobile
phones. The benchmark result can be found in the CI result page.
## Communication
* GitHub issues: bug reports, usage issues, feature requests
* Gitter or Slack:
* QQ群:
## Contributing
Any kind of contributions are welcome. For bug reports, feature requests,
please just open an issue without any hesitance. For code contributions, it's
strongly suggested to open an issue for discussion first. For more details,
please refer to [this guide](docs).
## License
[Apache License 2.0](LICENSE).
## Acknowledgement
*MiAI Compute Engine* depends on several open source projects located in
[third_party](mace/third_party) directory. Particularly, we learned a lot from
the following projects during the development:
* [nnlib](https://source.codeaurora.org/quic/hexagon_nn/nnlib): the DSP runtime
depends on this library.
* [TensorFlow](https://github.com/tensorflow/tensorflow),
[Caffe](https://github.com/BVLC/caffe),
[SNPE](https://developer.qualcomm.com/software/snapdragon-neural-processing-engine-ai),
[ARM ComputeLibrary](https://github.com/ARM-software/ComputeLibrary),
[ncnn](https://github.com/Tencent/ncnn) and many others: we learned many best
practices from these projects.
Finally, we also thank the Qualcomm, Pinecone and MediaTek engineering teams for
their helps.
Release Notes
=====
v0.6.0
v0.6.0 (2018-04-04)
------
1. Change mace header interfaces, only including necessary methods.
MACE Documentations
---
The documents are based on Sphinx, run the following commands to build the documents:
How to build the documents
```
pip install sphinx sphinx-autobuild
pip install recommonmark
make html
```
After building, the generated documents are located in _build.
\ No newline at end of file
......@@ -6,7 +6,7 @@
import recommonmark.parser
import sphinx_rtd_theme
project = u'MACE'
project = u'MiAI Compute Engine'
author = u'%s Developers' % project
copyright = u'2018, %s' % author
......
Logging
=======
The rule of VLOG:
0. Ad hoc debug logging, should only be added in test or temporary ad hoc debugging
1. Important network level Debug/Latency trace log (Op run should never generate level 1 vlog)
2. Important op level Latency trace log
3. Unimportant Debug/Latency trace log
4. Verbose Debug/Latency trace log
OpenCL Image Storage Layout
===========================
Use **Image** object to optimize memory access and parallel computing based on OpenCL 2.0.
Design the corresponding **Image** format to optimize memory access for different Op algorithm.
Each pixel of **Image** object contains 4 elements(e.g. RGBA).
The Followings are the **Buffer** and **Image** format for all **Tensors**.
Input/Output
---
**Mace** use NHWC format Input/Output.
| Tensor| Buffer| Image Size [Width, Height]| Explanation|
| --------- | :---------:|:--------:|:----:|
|Channel-Major Input/Output | NHWC | [W * (C+3)/4, N * H] | Default Input/Output format|
|Height-Major Input/Output | NHWC | [W * C, N * (H+3)/4] | Winograd Convolution format|
|Width-Major Input/Output | NHWC | [(W+3)/4 * C, N * H] | Winograd Convolution format|
Each Pixel of **Image** contains 4 elements. The below table list the coordination relation
between **Image** and **Buffer**.
| Tensor| Pixel Coordinate Relation| Explanation
| --------- | :---------:| :-----: |
|Channel-Major Input/Output | P[i, j] = {E[n, h, w, c] | (n=j/H, h=j%H, w=i%W, c=[i/W * 4 + k])}| k=[0, 4)|
|Height-Major Input/Output | P[i, j] = {E[n, h, w, c] | (n=j%N, h=[j/H*4 + k], w=i%W, c=i/W)}| k=[0, 4)|
|Width-Major Input/Output | P[i, j] = {E[n, h, w, c] | (n=j/H, h=j%H, w=[i%W*4 + k], c=i/W)}| k=[0, 4)|
Filter
---
| Tensor| Buffer| Image Size [Width, Height]| Explanation|
| --------- | :---------:|:--------:|:----:|
|Convolution Filter | HWOI | [RoundUp<4>(I), H * W * (O+3)/4]|Convolution filter format,There is no difference compared to [H*w*I, (O+3)/4]|
|Depthwise Convlution Filter | HWIM | [H * W * M, (I+3)/4]|Depthwise-Convolution filter format|
Each Pixel of **Image** contains 4 elements. The below table list the coordination relation
between **Image** and **Buffer**.
| Tensor| Pixel Coordinate Relation| Explanation|
| --------- | :---------:| :-----:|
|Convolution Filter | P[m, n] = {E[h, w, o, i] &#124; (h=T/W, w=T%W, o=[n/HW*4+k], i=m)}| HW= H * W, T=n%HW, k=[0, 4)|
|Depthwise Convlution Filter | P[m, n] = {E[h, w, i, 0] &#124; (h=m/W, w=m%W, i=[n*4+k])}| only support multiplier == 1, k=[0, 4)|
1-D Argument
---
| Tensor| Buffer| Image Size [Width, Height]| Explanation|
| --------- | :---------:|:--------:|:----:|
|1-D Argument | W | [(W+3)/4, 1] | 1D argument format, e.g. Bias|
Each Pixel of **Image** contains 4 elements. The below table list the coordination relation
between **Image** and **Buffer**.
| Tensor| Pixel Coordinate Relation| Explanation|
| --------- | :---------:| :-----:|
|1-D Argument | P[i, 0] = {E[w] &#124; w=i*4+k}| k=[0, 4)|
Adding a New Op
Adding a new Op
===============
You can create a custum op if it is not covered by existing mace library.
You can create a custom op if it is not supported yet.
To add a custom op in mace you'll need to:
To add a custom op, you need to finish the following steps.
- Define the new op class and Registry function in op_name.h and op_name.cc under mace/ops directory.
Define the Op class
--------------------
Define the new Op class in `mace/ops/my_custom_op.h`.
```
op_name.cc
#include "mace/ops/op_name.h"
namespace mace {
namespace ops {
void Register_Custom_Op(OperatorRegistry *op_registry) {
REGISTER_OPERATOR(op_registry, OpKeyBuilder("op_name")
.Device(DeviceType::CPU)
.TypeConstraint<float>("T")
.Build(),
Custom_Op<DeviceType::CPU, float>);
REGISTER_OPERATOR(op_registry, OpKeyBuilder("op_name")
.Device(DeviceType::GPU)
.TypeConstraint<float>("T")
.Build(),
Custom_Op<DeviceType::GPU, float>);
REGISTER_OPERATOR(op_registry, OpKeyBuilder("op_name")
.Device(DeviceType::GPU)
.TypeConstraint<half>("T")
.Build(),
Custom_Op<DeviceType::GPU, half>);
}
} // namespace ops
} // namespace mace
```
```
op_name.h
#ifndef MACE_OPS_CUSTOM_OP_H_
#define MACE_OPS_CUSTOM_OP_H_
```c++
#ifndef MACE_OPS_MY_CUSTOM_OP_H_
#define MACE_OPS_MY_CUSTOM_OP_H_
#include "mace/core/operator.h"
#include "mace/kernels/custom_op.h"
#include "mace/kernels/my_custom_op.h"
namespace mace {
namespace ops {
template <DeviceType D, typename T>
class CustomOp : public Operator<D, T> {
class MyCustomOp : public Operator<D, T> {
public:
CustomOp(const OperatorDef &op_def, Workspace *ws)
MyCustomOp(const OperatorDef &op_def, Workspace *ws)
: Operator<D, T>(op_def, ws),
functor_() {}
......@@ -74,27 +39,59 @@ class CustomOp : public Operator<D, T> {
OP_OUTPUT_TAGS(OUTPUT);
private:
kernels::CustomOpFunctor<D, T> functor_;
kernels::MyCustomOpFunctor<D, T> functor_;
};
} // namespace ops
} // namespace mace
#endif // MACE_OPS_CUSTOM_OP_H_
#endif // MACE_OPS_MY_CUSTOM_OP_H_
```
- To Add the new op. You need to implement the cpu version operation in a .h file and gpu version in op_name_opencl.cc and op_name.cl files under the mace/kernels directory.
- Register the new op in core/operator.cc
- Test and Benchmark
Add an op_name_test.cc file to test all functions of your new op both in cpu and gpu and make sure the new op works fine.
Add an op_benchmark.cc file to benchmark all functions of your new op both in cpu or gpu.
Register the new Op
--------------------
Define the Ops registering function in `mace/ops/my_custom_op.cc`.
```c++
#include "mace/ops/my_custom_op.h"
namespace mace {
namespace ops {
void Register_My_Custom_Op(OperatorRegistry *op_registry) {
REGISTER_OPERATOR(op_registry, OpKeyBuilder("my_custom_op")
.Device(DeviceType::CPU)
.TypeConstraint<float>("T")
.Build(),
Custom_Op<DeviceType::CPU, float>);
REGISTER_OPERATOR(op_registry, OpKeyBuilder("my_custom_op")
.Device(DeviceType::OPENCL)
.TypeConstraint<float>("T")
.Build(),
Custom_Op<DeviceType::OPENCL, float>);
REGISTER_OPERATOR(op_registry, OpKeyBuilder("my_custom_op")
.Device(DeviceType::OPENCL)
.TypeConstraint<half>("T")
.Build(),
Custom_Op<DeviceType::OPENCL, half>);
}
} // namespace ops
} // namespace mace
```
And then register the new Op in `mace/core/operator.cc`.
Implement the Op kernel code
----------------------------
You need to implement the CPU kernel in a `mace/kernels/my_custom_op.h` and
optionally OpenCL kernel in `mace/kernels/kernels/my_custom_op_opencl.cc` and
`mace/kernels/kernels/cl/my_custom_op.cl`. You can also optimize the CPU
kernel with NEON.
Add test and benchmark
----------------------
It's strongly recommended to add unit test and micro benchmark for your
new Op. If you wish to contribute back, it's required.
Contributing guide
==================
License
-------
The source file should contains a license header. See the existing files
as an example.
Python coding style
-------------------
Changes to Python code should conform to [PEP8 Style Guide for Python
Code](https://www.python.org/dev/peps/pep-0008/).
You can use pycodestyle to check the style.
C++ coding style
----------------
Changes to C++ code should conform to [Google C++ Style
Guide](https://google.github.io/styleguide/cppguide.html).
You can use cpplint to check the style and use clang-format to format
the code:
```sh
clang-format -style="{BasedOnStyle: google, \
DerivePointerAlignment: false, \
PointerAlignment: Right, \
BinPackParameters: false}" $file
```
C++ logging guideline
---------------------
The rule of VLOG level:
```
0. Ad hoc debug logging, should only be added in test or temporary ad hoc
debugging
1. Important network level Debug/Latency trace log (Op run should never
generate level 1 vlog)
2. Important op level Latency trace log
3. Unimportant Debug/Latency trace log
4. Verbose Debug/Latency trace log
```
C++ marco
----------
C++ macros should start with `MACE_`, except for most common ones like `LOG`
and `VLOG`.
Memory layout
===========================
CPU runtime memory layout
-------------------------
The CPU tensor buffer is organized in the following order:
+-----------------------------+--------------+
| Tensor type | Buffer |
+=============================+==============+
| Intermediate input/output | NCHW |
+-----------------------------+--------------+
| Convolution Filter | OIHW |
+-----------------------------+--------------+
| Depthwise Convolution Filter| MIHW |
+-----------------------------+--------------+
| 1-D Argument, length = W | W |
+-----------------------------+--------------+
OpenCL runtime memory layout
-----------------------------
OpenCL runtime uses 2D image with CL_RGBA channel order as the tensor storage.
This requires OpenCL 1.2 and above.
The way of mapping the Tensor data to OpenCL 2D image (RGBA) is critical for
kernel performance.
In CL_RGBA channel order, each 2D image pixel contains 4 data items.
The following tables describe the mapping from different type of tensors to
2D RGBA Image.
Input/Output Tensor
~~~~~~~~~~~~~~~~~~~
The Input/Output Tensor is stored in NHWC format:
+---------------------------+--------+----------------------------+-----------------------------+
|Tensor type | Buffer | Image size [width, height] | Explanation |
+===========================+========+============================+=============================+
|Channel-Major Input/Output | NHWC | [W * (C+3)/4, N * H] | Default Input/Output format |
+---------------------------+--------+----------------------------+-----------------------------+
|Height-Major Input/Output | NHWC | [W * C, N * (H+3)/4] | Winograd Convolution format |
+---------------------------+--------+----------------------------+-----------------------------+
|Width-Major Input/Output | NHWC | [(W+3)/4 * C, N * H] | Winograd Convolution format |
+---------------------------+--------+----------------------------+-----------------------------+
Each Pixel of **Image** contains 4 elements. The below table list the
coordination relation between **Image** and **Buffer**.
+---------------------------+-------------------------------------------------------------------------+-------------+
|Tensor type | Pixel coordinate relationship | Explanation |
+===========================+=========================================================================+=============+
|Channel-Major Input/Output | P[i, j] = {E[n, h, w, c] &#124; (n=j/H, h=j%H, w=i%W, c=[i/W * 4 + k])} | k=[0, 4) |
+---------------------------+-------------------------------------------------------------------------+-------------+
|Height-Major Input/Output | P[i, j] = {E[n, h, w, c] &#124; (n=j%N, h=[j/H*4 + k], w=i%W, c=i/W)} | k=[0, 4) |
+---------------------------+-------------------------------------------------------------------------+-------------+
|Width-Major Input/Output | P[i, j] = {E[n, h, w, c] &#124; (n=j/H, h=j%H, w=[i%W*4 + k], c=i/W)} | k=[0, 4) |
+---------------------------+-------------------------------------------------------------------------+-------------+
Filter Tensor
~~~~~~~~~~~~~
+----------------------------+------+---------------------------------+------------------------------------------------------------------------------+
| Tensor |Buffer| Image size [width, height] | Explanation |
+============================+======+=================================+==============================================================================+
|Convolution Filter | HWOI | [RoundUp<4>(I), H * W * (O+3)/4]|Convolution filter format,There is no difference compared to [H*w*I, (O+3)/4]|
+----------------------------+------+---------------------------------+------------------------------------------------------------------------------+
|Depthwise Convlution Filter | HWIM | [H * W * M, (I+3)/4] |Depthwise-Convolution filter format |
+----------------------------+------+---------------------------------+------------------------------------------------------------------------------+
Each Pixel of **Image** contains 4 elements. The below table list the
coordination relation between **Image** and **Buffer**.
+----------------------------+-------------------------------------------------------------------+---------------------------------------+
|Tensor type | Pixel coordinate relationship | Explanation |
+============================+===================================================================+=======================================+
|Convolution Filter | P[m, n] = {E[h, w, o, i] &#124; (h=T/W, w=T%W, o=[n/HW*4+k], i=m)}| HW= H * W, T=n%HW, k=[0, 4) |
+----------------------------+-------------------------------------------------------------------+---------------------------------------+
|Depthwise Convlution Filter | P[m, n] = {E[h, w, i, 0] &#124; (h=m/W, w=m%W, i=[n*4+k])} | only support multiplier == 1, k=[0, 4)|
+----------------------------+-------------------------------------------------------------------+---------------------------------------+
1-D Argument Tensor
~~~~~~~~~~~~~~~~~~~
+----------------+----------+------------------------------+---------------------------------+
| Tensor type | Buffer | Image size [width, height] | Explanation |
+================+==========+==============================+=================================+
| 1-D Argument | W | [(W+3)/4, 1] | 1D argument format, e.g. Bias |
+----------------+----------+------------------------------+---------------------------------+
Each Pixel of **Image** contains 4 elements. The below table list the
coordination relation between **Image** and **Buffer**.
+--------------+---------------------------------+-------------+
| Tensor type | Pixel coordinate relationship | Explanation |
+==============+=================================+=============+
|1-D Argument | P[i, 0] = {E[w] &#124; w=i*4+k} | k=[0, 4) |
+--------------+---------------------------------+-------------+
Docker Images
==========================================
* Login in [Xiaomi Docker Registry](http://docs.api.xiaomi.net/docker-registry/)
```
docker login cr.d.xiaomi.net
```
* Build with `Dockerfile`
```
docker build -t cr.d.xiaomi.net/mace/mace-dev .
```
* Pull image from docker registry
```
docker pull cr.d.xiaomi.net/mace/mace-dev
```
* Create container
```
# Set 'host' network to use ADB
docker run -it --rm -v /local/path:/container/path --net=host cr.d.xiaomi.net/mace/mace-dev /bin/bash
```
Frequently Asked Questions
==========================
Why is the generated static library file size so huge?
-------------------------------------------------------
The static library is simply an archive of a set of object files which are
intermediate and contains many extra information, please check whether the
final binary file size is as expected.
Why is the generated binary file (including shared library) size so huge?
-------------------------------------------------------------------------
When compiling the model into C++ code, the final binary may contains extra
debug symbols, they usually takes a lot of space. Try to strip the shared
library or binary. The common overhead of the file size including the compiled
model (excluding the model weights) after the strip should be less than 2MB.
If the model weights is embedded into the binary, the extra overhead should be
around {model weights size in float32}/2.
OpenCL allocator failed with CL_OUT_OF_RESOURCES
------------------------------------------------
OpenCL runtime usually requires continuous virtual memory for its image buffer,
the error will occur when the OpenCL driver can't find the continuous space
due to high memory usage or fragmentation. Several solutions can be tried:
* Change the model by reducing its memory usage
* Split the Op with the biggest single memory buffer
* Changed from armeabi-v7a to arm64-v8a to expand the virtual address space
* Reduce the memory consumption of other modules of the same process
Why the performance is worce than the official result for the same model?
-------------------------------------------------------------------------
The power options may not set properly, see `mace/public/mace_runtime.h` for
details.
Why the UI is getting poor responsiveness when running model with GPU runtime?
------------------------------------------------------------------------------
Try to set `limit_opencl_kernel_time` to `1`. If still not resolved, try to
modify the source code to use even smaller time intervals or changed to CPU
or DSP runtime.
How to include more than one deployment files in one application(process)?
------------------------------------------------------------------------------
This case may happen when an application is developed by multiple teams as
submodules. If the all the submodules are linked into a single shared library,
then use the same version of MiAI Compute Engine will resolve this issue.
Ortherwise, different deployment models are contained in different shared
libraries, it's not required to use the same MiAI version but you should
controls the exported symbols from the shared library. This is actually a
best practice for all shared library, please read about GNU loader
version script for more details.
Create a model deployment
=========================
Each YAML deployment script describes a case of deployments (for example,
a smart camera application may contains face recognition, object recognition,
and voice recognition models, which can be defined in one deployment file),
which will generate one static library (if more than one ABIs specified,
there will be one static library for each). Each YAML scripts can contains one
or more models.
Model deployment file example
-------------------------------
TODO: change to a link to a standalone file with comments.
.. code:: yaml
# 配置文件名会被用作生成库的名称:libmace-${filename}.a
target_abis: [armeabi-v7a, arm64-v8a]
# 具体机型的soc编号,可以使用`adb shell getprop | grep ro.board.platform | cut -d [ -f3 | cut -d ] -f1`获取
target_socs: [msm8998]
embed_model_data: 1
vlog_level: 0
models: # 一个配置文件可以包含多个模型的配置信息,最终生成的库中包含多个模型
first_net: # 模型的标签,在调度模型的时候,会用这个变量
platform: tensorflow
model_file_path: path/to/model64.pb # also support http:// and https://
model_sha256_checksum: 7f7462333406e7dea87222737590ebb7d94490194d2f21a7d72bafa87e64e9f9
input_nodes: input_node
output_nodes: output_node
input_shapes: 1,64,64,3
output_shapes: 1,64,64,2
runtime: gpu
limit_opencl_kernel_time: 0
dsp_mode: 0
obfuscate: 1
fast_conv: 0
input_files:
- path/to/input_files # support http://
second_net:
platform: caffe
model_file_path: path/to/model.prototxt
weight_file_path: path/to/weight.caffemodel
model_sha256_checksum: 05d92625809dc9edd6484882335c48c043397aed450a168d75eb8b538e86881a
weight_sha256_checksum: 05d92625809dc9edd6484882335c48c043397aed450a168d75eb8b538e86881a
input_nodes:
- input_node0
- input_node1
output_nodes:
- output_node0
- output_node1
input_shapes:
- 1,256,256,3
- 1,128,128,3
output_shapes:
- 1,256,256,2
- 1,1,1,2
runtime: cpu
limit_opencl_kernel_time: 1
dsp_mode: 0
obfuscate: 1
fast_conv: 0
input_files:
- path/to/input_files # support http://
Configurations
--------------------
+--------------------------+----------------------------------------------------------------------------------------+
| Configuration key | Description |
+==========================+========================================================================================+
| target_abis | The target ABI to build, can be one or more of 'host', 'armeabi-v7a' or 'arm64-v8a' |
+--------------------------+----------------------------------------------------------------------------------------+
| embed_model_data | Whether embedding model weights as the code, default to 1 |
+--------------------------+----------------------------------------------------------------------------------------+
| platform | The source framework, tensorflow or caffe |
+--------------------------+----------------------------------------------------------------------------------------+
| model_file_path | The path of the model file, can be local or remote |
+--------------------------+----------------------------------------------------------------------------------------+
| weight_file_path | The path of the model weights file, used by Caffe model |
+--------------------------+----------------------------------------------------------------------------------------+
| model_sha256_checksum | The SHA256 checksum of the model file |
+--------------------------+----------------------------------------------------------------------------------------+
| weight_sha256_checksum | The SHA256 checksum of the weight file, used by Caffe model |
+--------------------------+----------------------------------------------------------------------------------------+
| input_nodes | The input node names, one or more strings |
+--------------------------+----------------------------------------------------------------------------------------+
| output_nodes | The output node names, one or more strings |
+--------------------------+----------------------------------------------------------------------------------------+
| input_shapes | The shapes of the input nodes, in NHWC order |
+--------------------------+----------------------------------------------------------------------------------------+
| output_shapes | The shapes of the output nodes, in NHWC order |
+--------------------------+----------------------------------------------------------------------------------------+
| runtime | The running device, one of CPU, GPU or DSP |
+--------------------------+----------------------------------------------------------------------------------------+
| limit_opencl_kernel_time | Whether splitting the OpenCL kernel within 1 ms to keep UI responsiveness, default to 0|
+--------------------------+----------------------------------------------------------------------------------------+
| dsp_mode | Control the DSP precision and performance, default to 0 usually works for most cases |
+--------------------------+----------------------------------------------------------------------------------------+
| obfuscate | Whether to obfuscate the model operator name, default to 0 |
+--------------------------+----------------------------------------------------------------------------------------+
| fast_conv | Whether to enable Winograd convolution, **will increase memory consumption** |
+--------------------------+----------------------------------------------------------------------------------------+
| input_files | Specify Numpy validation inputs. When not provided, [-1, 1] random values will be used |
+--------------------------+----------------------------------------------------------------------------------------+
Docker Images
=============
- Login in `Xiaomi Docker
Registry <http://docs.api.xiaomi.net/docker-registry/>`__
``docker login cr.d.xiaomi.net``
- Build with ``Dockerfile``
``docker build -t cr.d.xiaomi.net/mace/mace-dev .``
- Pull image from docker registry
``docker pull cr.d.xiaomi.net/mace/mace-dev``
- Create container
``# Set 'host' network to use ADB docker run -it --rm -v /local/path:/container/path --net=host cr.d.xiaomi.net/mace/mace-dev /bin/bash``
How to build
============
模型格式支持
-------------
+--------------+------------------------------------------------------------------------------------------+
| 框架格式 | 支持情况 |
+==============+==========================================================================================+
| TensorFlow | 推荐使用1.4以上版本,否则可能达不到最佳性能 (考虑到后续Android NN,建议首选TensorFLow) |
+--------------+------------------------------------------------------------------------------------------+
| Caffe | 推荐使用1.0以上版本,低版本可能不支持,建议改用TensorFlow |
+--------------+------------------------------------------------------------------------------------------+
| MXNet | 尚未支持 |
+--------------+------------------------------------------------------------------------------------------+
| ONNX | 尚未支持 |
+--------------+------------------------------------------------------------------------------------------+
环境要求
---------
``mace``\ 提供了包含开发运行所需环境的docker镜像,镜像文件可以参考\ ``./docker/``\ 。启动命令:
.. code:: sh
sudo docker pull cr.d.xiaomi.net/mace/mace-dev
sudo docker run -it --rm --privileged -v /dev/bus/usb:/dev/bus/usb --net=host -v /local/path:/container/path cr.d.xiaomi.net/mace/mace-dev /bin/bash
如果用户希望配置开发机上的环境,可以参考如下环境要求:
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
| 软件 | 版本号 | 安装命令 |
+=====================+=================+===================================================================================================+
| bazel | >= 0.5.4 | - |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
| android-ndk | r12c | - |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
| adb | >= 1.0.32 | apt install -y android-tools-adb |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
| tensorflow | 1.4.0 | pip install tensorflow==1.4.0 |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
| scipy | >= 1.0.0 | pip install scipy |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
| jinja2 | >= 2.10 | pip install jinja2 |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
| PyYaml | >= 3.12 | pip install pyyaml |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
| docker(for caffe) | >= 17.09.0-ce | `install doc <https://docs.docker.com/install/linux/docker-ce/ubuntu/#set-up-the-repository>`__ |
+---------------------+-----------------+---------------------------------------------------------------------------------------------------+
使用简介
--------
1. 获取最新tag的代码
**建议尽可能使用最新tag下的代码,以及不要直接使用master分支的最新代码。**
.. code:: sh
git clone git@v9.git.n.xiaomi.com:deep-computing/mace.git
# update
git fetch --all --tags --prune
# get latest tag version
tag_name=`git describe --abbrev=0 --tags`
# checkout to latest tag branch
git checkout -b ${tag_name} tags/${tag_name}
2. 模型优化
- Tensorflow
TensorFlow训练得到的模型进行一系列的转换,可以提升设备上的运行速度。TensorFlow提供了官方工具
`TensorFlow Graph Transform
Tool <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md>`__
来进行模型优化
(此工具Docker镜像中已经提供,也可以直接点击`下载 <http://cnbj1-inner-fds.api.xiaomi.net/mace/tool/transform_graph>`__\ 这个工具,用户亦可从官方源码编译\`)。以下分别是GPU模型和DSP模型的优化命令:
.. code:: sh
# GPU模型:
./transform_graph \
--in_graph=tf_model.pb \
--out_graph=tf_model_opt.pb \
--inputs='input' \
--outputs='output' \
--transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
strip_unused_nodes(type=float, shape="1,64,64,3")
remove_nodes(op=Identity, op=CheckNumerics)
fold_constants(ignore_errors=true)
fold_batch_norms
fold_old_batch_norms
strip_unused_nodes
sort_by_execution_order'
# DSP模型:
./transform_graph \
--in_graph=tf_model.pb \
--out_graph=tf_model_opt.pb \
--inputs='input' \
--outputs='output' \
--transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
strip_unused_nodes(type=float, shape="1,64,64,3")
remove_nodes(op=Identity, op=CheckNumerics)
fold_constants(ignore_errors=true)
fold_batch_norms
fold_old_batch_norms
backport_concatv2
quantize_weights(minimum_size=2)
quantize_nodes
strip_unused_nodes
sort_by_execution_order'
- Caffe
Caffe目前只支持最新版本,旧版本请使用Caffe的工具进行升级。
.. code:: bash
# Upgrade prototxt
$CAFFE_ROOT/build/tools/upgrade_net_proto_text MODEL.prototxt MODEL.new.prototxt
# Upgrade caffemodel
$CAFFE_ROOT/build/tools/upgrade_net_proto_binary MODEL.caffemodel MODEL.new.caffemodel
3. 生成模型静态库
模型静态库的生成需要使用目标机型,\ ***并且要求必须在目标SOC的机型上编译生成静态库。***
我们提供了\ ``mace_tools.py``\ 工具,可以将模型文件转换成静态库。\ ``tools/mace_tools.py``\ 使用步骤:
3.2 运行\ ``tools/mace_tools.py``\ 脚本
.. code:: sh
# print help message
# python tools/mace_tools.py --help
# --config 配置文件的路径
# --output_dir 编译结果的输出文件目录,默认为`./build`
# --round 调用`examples/mace_run`运行模型的次数,默认为`1`
# --tuning 对opencl的参数调参,该项通常只有开发人员用到,默认为`true`
# --mode 运行模式,包含build/run/validate/merge/all/benchmark,默认为`all`
# 仅编译模型和生成静态库
python tools/mace_tools.py --config=models/config.yaml --mode=build
# 测试模型的运行时间
python tools/mace_tools.py --config=models/config.yaml --mode=run --round=1000
# 对比编译好的模型在mace上与直接使用tensorflow或者caffe运行的结果,相似度使用`余弦距离表示`
# 其中使用OpenCL设备,默认相似度大于等于`0.995`为通过;DSP设备下,相似度需要达到`0.930`。
python tools/mace_tools.py --config=models/config.yaml --mode=run --round=1000
# 将已编译好的多个模型合并成静态库
# 比如编译了8个模型,决定使用其中2个模型,这时候可以不重新build,直接修改全局配置文件,合并生成静态库
python tools/mace_tools.py --config=models/config.yaml --mode=merge
# 运行以上所有项(可用于测试速度,建议 round=20)
python tools/mace_tools.py --config=models/config.yaml --mode=all --round=1000
# 模型Benchmark:查看每个Op的运行时间
python tools/mace_tools.py --config=models/config.yaml --mode=benchmark
# 查看模型运行时占用内存(如果有多个模型,可能需要注释掉一部分配置,只剩一个模型的配置)
python tools/mace_tools.py --config=models/config.yaml --mode=run --round=10000 &
adb shell dumpsys meminfo | grep mace_run
sleep 10
kill %1
4. 发布
通过前面的步骤,我们得到了包含业务模型的库文件。在业务代码中,我们只需要引入下面3组文件(\ ``./build/``\ 是默认的编译结果输出目录):
头文件(包含mace.h和各个模型的头文件): \*
``./build/${project_name}/${target_abi}/include/mace/public/*.h``
静态库(包含mace engine、opencl和模型相关库): \*
``./build/${project_name}/${target_abi}/*.a``
动态库(仅编译的模型中包含dsp模式时用到): \*
``./build/${project_name}/${target_abi}/libhexagon_controller.so``
模型数据文件(仅在EMBED\_MODEL\_DATA=0时产生): \*
``./build/${project_name}/data/${MODEL_TAG}.data``
编译过程中间文件: \* ``./build/${project_name}/build/``
库文件tar包: \* ``./build/${project_name}/${project_name}.tar.gz``
5. 使用
具体使用流程可参考\ ``mace/examples/mace_run.cc``\ ,下面列出关键步骤。
.. code:: cpp
// 引入头文件
#include "mace/public/mace.h"
#include "mace/public/{MODEL_TAG}.h"
// 0. 设置内部存储
const std::string file_path ="/path/to/store/internel/files";
std::shared_ptr<KVStorageFactory> storage_factory(
new FileStorageFactory(file_path));
ConfigKVStorageFactory(storage_factory);
//1. 从文件或代码中Load模型数据,也可通过自定义的方式来Load (例如可自己实现压缩加密等)
// 如果使用的是数据嵌入的方式,将参数设为nullptr。
unsigned char *model_data = mace::MACE_MODEL_TAG::LoadModelData(FLAGS_model_data_file.c_str());
//2. 创建net对象
NetDef net_def = mace::MACE_MODEL_TAG::CreateNet(model_data);
//3. 声明设备类型(必须与build时指定的runtime一致)
DeviceType device_type = DeviceType::OPENCL;
//4. 定义输入输出名称数组
std::vector<std::string> input_names = {...};
std::vector<std::string> output_names = {...};
//5. 创建输入输出对象
std::map<std::string, mace::MaceTensor> inputs;
std::map<std::string, mace::MaceTensor> outputs;
for (size_t i = 0; i < input_count; ++i) {
// Allocate input and output
int64_t input_size =
std::accumulate(input_shapes[i].begin(), input_shapes[i].end(), 1,
std::multiplies<int64_t>());
auto buffer_in = std::shared_ptr<float>(new float[input_size],
std::default_delete<float[]>());
// load input
...
inputs[input_names[i]] = mace::MaceTensor(input_shapes[i], buffer_in);
}
for (size_t i = 0; i < output_count; ++i) {
int64_t output_size =
std::accumulate(output_shapes[i].begin(), output_shapes[i].end(), 1,
std::multiplies<int64_t>());
auto buffer_out = std::shared_ptr<float>(new float[output_size],
std::default_delete<float[]>());
outputs[output_names[i]] = mace::MaceTensor(output_shapes[i], buffer_out);
}
//6. 创建MaceEngine对象
mace::MaceEngine engine(&net_def, device_type, input_names, output_names);
//7. 如果设备类型是OPENCL或HEXAGON,可以在此释放model_data
if (device_type == DeviceType::OPENCL || device_type == DeviceType::HEXAGON) {
mace::MACE_MODEL_TAG::UnloadModelData(model_data);
}
//8. 执行模型,得到结果
engine.Run(inputs, &outputs);
Introduction
============
TODO: describe the conceptions and workflow with diagram.
![alt text](workflow.jpg "MiAI workflow")
TODO: describe the runtime.
Operator lists
==============
+----------------------------------+--------------+--------+-------------------------------------------------------+
| Operator | Android NN | Status | Remark |
+==================================+==============+========+=======================================================+
| ADD | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| AVERAGE\_POOL\_2D | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| BATCH\_NORM | | Y | Fusion with activation is supported |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| BIAS\_ADD | | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| CHANNEL\_SHUFFLE | | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| CONCATENATION | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| CONV\_2D | Y | Y | Fusion with BN and activation layer is supported |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| DEPTHWISE\_CONV\_2D | Y | Y | Only multiplier = 1 is supported; Fusion is supported |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| DEPTH\_TO\_SPACE | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| DEQUANTIZE | Y | | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| EMBEDDING\_LOOKUP | Y | | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| FLOOR | Y | | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| FULLY\_CONNECTED | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| GROUP\_CONV\_2D | | | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| HASHTABLE\_LOOKUP | Y | | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| L2\_NORMALIZATION | Y | | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| L2\_POOL\_2D | Y | | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| LOCAL\_RESPONSE\_NORMALIZATION | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| LOGISTIC | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| LSH\_PROJECTION | Y | | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| LSTM | Y | | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| MATMUL | | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| MAX\_POOL\_2D | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| MUL | Y | | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| PSROI\_ALIGN | | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| PRELU | | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| RELU | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| RELU1 | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| RELU6 | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| RELUX | | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| RESHAPE | Y | Y | Limited support |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| RESIZE\_BILINEAR | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| RNN | Y | | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| RPN\_PROPOSAL\_LAYER | | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| SOFTMAX | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| SPACE\_TO\_DEPTH | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| SVDF | Y | | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
| TANH | Y | Y | |
+----------------------------------+--------------+--------+-------------------------------------------------------+
MACE documentation
================================
Welcome to MACE documentation.
MiAI Compute Engine Documentation
=================================
Welcome to MiAI Compute Engine documentation.
The main documentation is organized into the following sections:
Contents
--------
.. toctree::
:maxdepth: 1
:caption: Getting started
:name: sec-start
getting_started/introduction
getting_started/create_a_model_deployment
getting_started/how_to_build
getting_started/docker
getting_started/op_lists
.. toctree::
:maxdepth: 1
:caption: Development
:name: sec-devel
development/contributing
development/adding_a_new_op
development/memory_layout
.. toctree::
:maxdepth: 1
:caption: FAQ
:name: sec-faq
self
user/introduction
docker/usage
developer/adding_a_new_op
developer/opencl_memory_layout
developer/logging
release_note
faq
Introduction
============
**MACE** - *Mobile(Mi) Accelerated Compute Engine Library* 是小米自主研发的移动端神经网络加速引擎。
## 特点
1. 速度快
* 专门为小米手机SoC优化(高通,MTK,澎湃),支持GPU(DSP基于nnlib)加速,在主流高通平台速度优于高通SNPE框架 (注: 速度跟模型结构有关,depthwise conv2d,1x1卷积MACE与SNPE持平,3x3卷积,通用卷积等优于SNPE。另外目前高通平台SNPE明显优于TensorFlow Lite, Caffe/Caffe2,ARM Compute Library,腾讯ncnn,百度MDL等开源框架)。
* 支持不同SoC的自动调优
* 模型数据通过mmap方式加载,启动速度快
2. 内存占用少
* MACE支持基于计算图依赖的内存优化技术,通过内存复用,能减少运行时内存占用,特别是对于依赖较简单的模型,内存优化效果明显
3. 体积小
* MACE本身无外部依赖,核心代码小于1MB (模型除外)
4. 内置模型加密功能
* MACE支持模型混淆加密功能,模型直接编译成可执行代码而非数据文件,同时加强了敏感代码的混淆,增加了反向的难度
5. 部署便捷
* 用户接口简单,只需要一个头文件,MACE采用源码/静态库形式链接到用户程序,不会引入额外的动态库和模型数据文件(DSP版本需要一个额外的动态库)
## 模型格式支持
| 框架格式 | 支持情况 |
| ---------- |:-------:|
| TensorFlow | 推荐使用1.4以上版本,否则可能达不到最佳性能 (考虑到后续Android NN,建议首选TensorFLow) |
| Caffe | 推荐使用1.0以上版本,低版本可能不支持,建议改用TensorFlow |
| MXNet | 尚未支持 |
| ONNX | 尚未支持 |
## 环境要求
`mace`提供了包含开发运行所需环境的docker镜像,镜像文件可以参考`./docker/`。启动命令:
```sh
sudo docker pull cr.d.xiaomi.net/mace/mace-dev
sudo docker run -it --rm --privileged -v /dev/bus/usb:/dev/bus/usb --net=host -v /local/path:/container/path cr.d.xiaomi.net/mace/mace-dev /bin/bash
```
如果用户希望配置开发机上的环境,可以参考如下环境要求:
| 软件 | 版本号 | 安装命令 |
| -------- |:--------------:|:---------------------:|
| bazel | >= 0.5.4 | - |
| android-ndk | r12c | - |
| adb | >= 1.0.32 | apt install -y android-tools-adb |
| tensorflow | 1.4.0 | pip install tensorflow==1.4.0 |
| scipy | >= 1.0.0 | pip install scipy |
| jinja2 | >= 2.10 | pip install jinja2 |
| PyYaml | >= 3.12 | pip install pyyaml |
| docker(for caffe) | >= 17.09.0-ce | [install doc](https://docs.docker.com/install/linux/docker-ce/ubuntu/#set-up-the-repository) |
## 文件组织
```
|-- tools --> mace编译运行相关的工具脚本
| |-- mace_tools.py
| |-- ...
|
|-- mace
| |-- benchmark
| |
| |-- codegen --> 模型、opencl二进制文件和tuning数据生成的C++代码
| | |-- models
| | |-- opencl
| | |-- opencl_bin
| | |-- tuning
| |
| |-- core
| |
| |-- examples
| | |-- mace_run.cc --> 运行mace模型的样例
| | |-- ...
| |
| |-- kernels
| |
| |-- ops
| |
| |-- public --> mace的接口
|
|-- docker --> mace开发环境的Dockerfile
```
## 使用简介
1\. 获取最新tag的代码
**建议尽可能使用最新tag下的代码,以及不要直接使用master分支的最新代码。**
```sh
git clone git@v9.git.n.xiaomi.com:deep-computing/mace.git
# update
git fetch --all --tags --prune
# get latest tag version
tag_name=`git describe --abbrev=0 --tags`
# checkout to latest tag branch
git checkout -b ${tag_name} tags/${tag_name}
```
2\. 模型优化
- Tensorflow
TensorFlow训练得到的模型进行一系列的转换,可以提升设备上的运行速度。TensorFlow提供了官方工具
[TensorFlow Graph Transform Tool](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md)
来进行模型优化 (在官方工具的基础上,我们做了一部分定制化,此工具Docker镜像中已经提供,也可以直接点击[下载](http://cnbj1-inner-fds.api.xiaomi.net/mace/tool/transform_graph)这个工具,`暂时不支持用户自己从官方源码编译`)。以下分别是GPU模型和DSP模型的优化命令:
```sh
# GPU模型:
./transform_graph \
--in_graph=tf_model.pb \
--out_graph=tf_model_opt.pb \
--inputs='input' \
--outputs='output' \
--transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
strip_unused_nodes(type=float, shape="1,64,64,3")
remove_nodes(op=Identity, op=CheckNumerics)
fold_batch_norms
fold_old_batch_norms
strip_unused_nodes
sort_by_execution_order'
# DSP模型:
./transform_graph \
--in_graph=tf_model.pb \
--out_graph=tf_model_opt.pb \
--inputs='input' \
--outputs='output' \
--transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
strip_unused_nodes(type=float, shape="1,64,64,3")
remove_nodes(op=Identity, op=CheckNumerics)
fold_constants(ignore_errors=true)
fold_batch_norms
fold_old_batch_norms
backport_concatv2
quantize_weights(minimum_size=2)
quantize_nodes
strip_unused_nodes
sort_by_execution_order'
```
- Caffe
Caffe目前只支持最新版本,旧版本请使用Caffe的工具进行升级。
```bash
# Upgrade prototxt
$CAFFE_ROOT/build/tools/upgrade_net_proto_text MODEL.prototxt MODEL.new.prototxt
# Upgrade caffemodel
$CAFFE_ROOT/build/tools/upgrade_net_proto_binary MODEL.caffemodel MODEL.new.caffemodel
```
3\. 生成模型静态库
模型静态库的生成需要使用目标机型,***并且要求必须在目标SOC的机型上编译生成静态库。***
我们提供了`mace_tools.py`工具,可以将模型文件转换成静态库。`tools/mace_tools.py`使用步骤:
3\.1 配置文件
配置文件使用yml文件格式,配置项如下:
```yaml
# 配置文件名会被用作生成库的名称:libmace-${filename}.a
target_abis: [armeabi-v7a, arm64-v8a]
# 具体机型的soc编号,可以使用`adb shell getprop | grep ro.board.platform | cut -d [ -f3 | cut -d ] -f1`获取
target_socs: [msm8998]
embed_model_data: 1
models: # 一个配置文件可以包含多个模型的配置信息,最终生成的库中包含多个模型
first_net: # 模型的标签,在调度模型的时候,会用这个变量
platform: tensorflow
model_file_path: path/to/model64.pb # also support http:// and https://
model_sha256_checksum: 7f7462333406e7dea87222737590ebb7d94490194d2f21a7d72bafa87e64e9f9
input_nodes: input_node
output_nodes: output_node
input_shapes: 1,64,64,3
output_shapes: 1,64,64,2
runtime: gpu
limit_opencl_kernel_time: 0
dsp_mode: 0
obfuscate: 1
fast_conv: 0
input_files:
- path/to/input_files # support http://
second_net:
platform: caffe
model_file_path: path/to/model.prototxt
weight_file_path: path/to/weight.caffemodel
model_sha256_checksum: 05d92625809dc9edd6484882335c48c043397aed450a168d75eb8b538e86881a
weight_sha256_checksum: 05d92625809dc9edd6484882335c48c043397aed450a168d75eb8b538e86881a
input_nodes:
- input_node0
- input_node1
output_nodes:
- output_node0
- output_node1
input_shapes:
- 1,256,256,3
- 1,128,128,3
output_shapes:
- 1,256,256,2
- 1,1,1,2
runtime: cpu
limit_opencl_kernel_time: 1
dsp_mode: 0
obfuscate: 1
fast_conv: 0
input_files:
- path/to/input_files # support http://
```
具体配置项含义如下表:
| 配置项 | 含义 |
| ---------- |:--------------:|
| target_abis | 运行的ABI,可选包括安卓设备的armeabi-v7a,arm64-v8a等,以及开发人员的电脑终端(电脑终端使用‘host’表示)。可以同时指定多个ABI |
| embed_model_data | 是否将模型里的数据嵌入到代码中,默认为1 |
| platform | 模型对应的框架名称 [tensorflow | caffe] |
| model_file_path | 模型的路径,可以是一个http或https的下载链接 |
| weight_file_path | 权重文件的路径,可以是一个http或https的下载链接(caffe model)|
| model_sha256_checksum | The SHA256 checksum of the model file |
| weight_sha256_checksum | The SHA256 checksum of the weight file(caffe model) |
| input_nodes | 优化后的模型或其他框架模型的输入节点, 支持多个节点|
| output_nodes | 优化后的模型或其他框架模型的输出节点, 支持多个节点|
| input_shapes | 格式: NHWC. 模型的输入shape, 支持多个shape|
| output_shapes | 格式: NHWC. 模型的输出shape, 支持多个shape|
| runtime | 运行的设备,可选包含cpu、gpu和dsp |
| limit_opencl_kernel_time | 限制opencl的kernel每个work group运行时间在1ms以内,可能影响性能,默认关闭 |
| dsp_mode | 配置dsp的不同计算方式,以获得不同的精度和性能,一般使用默认值0即可 |
| obfuscate | 是否混淆模型内部各个操作的名称 |
| fast_conv| 使用最快的卷积算法,**可能会导致内存增多**|
| input_files| (可选). 指定模型输入文件,用于结果验证,必须与input_nodes对应。如未指定,则使用[-1,1]的随机值|
3\.2 运行`tools/mace_tools.py`脚本
```sh
# print help message
# python tools/mace_tools.py --help
# --config 配置文件的路径
# --output_dir 编译结果的输出文件目录,默认为`./build`
# --round 调用`examples/mace_run`运行模型的次数,默认为`1`
# --tuning 对opencl的参数调参,该项通常只有开发人员用到,默认为`true`
# --mode 运行模式,包含build/run/validate/merge/all/benchmark,默认为`all`
# 仅编译模型和生成静态库
python tools/mace_tools.py --config=models/config.yml --mode=build
# 测试模型的运行时间
python tools/mace_tools.py --config=models/config.yml --mode=run --round=1000
# 对比编译好的模型在mace上与直接使用tensorflow或者caffe运行的结果,相似度使用`余弦距离表示`
# 其中使用OpenCL设备,默认相似度大于等于`0.995`为通过;DSP设备下,相似度需要达到`0.930`。
python tools/mace_tools.py --config=models/config.yml --mode=run --round=1000
# 将已编译好的多个模型合并成静态库
# 比如编译了8个模型,决定使用其中2个模型,这时候可以不重新build,直接修改全局配置文件,合并生成静态库
python tools/mace_tools.py --config=models/config.yml --mode=merge
# 运行以上所有项(可用于测试速度,建议 round=20)
python tools/mace_tools.py --config=models/config.yml --mode=all --round=1000
# 模型Benchmark:查看每个Op的运行时间
python tools/mace_tools.py --config=models/config.yml --mode=benchmark
# 查看模型运行时占用内存(如果有多个模型,可能需要注释掉一部分配置,只剩一个模型的配置)
python tools/mace_tools.py --config=models/config.yml --mode=run --round=10000 &
adb shell dumpsys meminfo | grep mace_run
sleep 10
kill %1
```
4\. 发布
通过前面的步骤,我们得到了包含业务模型的库文件。在业务代码中,我们只需要引入下面3组文件(`./build/`是默认的编译结果输出目录):
头文件(包含mace.h和各个模型的头文件):
* `./build/${project_name}/${target_abi}/include/mace/public/*.h`
静态库(包含mace engine、opencl和模型相关库):
* `./build/${project_name}/${target_abi}/*.a`
动态库(仅编译的模型中包含dsp模式时用到):
* `./build/${project_name}/${target_abi}/libhexagon_controller.so`
模型数据文件(仅在EMBED_MODEL_DATA=0时产生):
* `./build/${project_name}/data/${MODEL_TAG}.data`
编译过程中间文件:
* `./build/${project_name}/build/`
库文件tar包:
* `./build/${project_name}/${project_name}.tar.gz`
5\. 使用
具体使用流程可参考`mace/examples/mace_run.cc`,下面列出关键步骤。
```c++
// 引入头文件
#include "mace/public/mace.h"
#include "mace/public/{MODEL_TAG}.h"
// 0. 设置内部存储
const std::string file_path ="/path/to/store/internel/files";
std::shared_ptr<KVStorageFactory> storage_factory(
new FileStorageFactory(file_path));
ConfigKVStorageFactory(storage_factory);
//1. 从文件或代码中Load模型数据,也可通过自定义的方式来Load (例如可自己实现压缩加密等)
// 如果使用的是数据嵌入的方式,将参数设为nullptr。
unsigned char *model_data = mace::MACE_MODEL_TAG::LoadModelData(FLAGS_model_data_file.c_str());
//2. 创建net对象
NetDef net_def = mace::MACE_MODEL_TAG::CreateNet(model_data);
//3. 声明设备类型
DeviceType device_type = DeviceType::GPU;
//4. 定义输入输出名称数组
std::vector<std::string> input_names = {...};
std::vector<std::string> output_names = {...};
//5. 创建输入输出对象
std::map<std::string, mace::MaceTensor> inputs;
std::map<std::string, mace::MaceTensor> outputs;
for (size_t i = 0; i < input_count; ++i) {
// Allocate input and output
int64_t input_size =
std::accumulate(input_shapes[i].begin(), input_shapes[i].end(), 1,
std::multiplies<int64_t>());
auto buffer_in = std::shared_ptr<float>(new float[input_size],
std::default_delete<float[]>());
// load input
...
inputs[input_names[i]] = mace::MaceTensor(input_shapes[i], buffer_in);
}
for (size_t i = 0; i < output_count; ++i) {
int64_t output_size =
std::accumulate(output_shapes[i].begin(), output_shapes[i].end(), 1,
std::multiplies<int64_t>());
auto buffer_out = std::shared_ptr<float>(new float[output_size],
std::default_delete<float[]>());
outputs[output_names[i]] = mace::MaceTensor(output_shapes[i], buffer_out);
}
//6. 创建MaceEngine对象
mace::MaceEngine engine(&net_def, device_type, input_names, output_names);
//7. 如果设备类型是GPU或者HEXAGON,可以在此释放model_data
if (device_type == DeviceType::GPU || device_type == DeviceType::HEXAGON) {
mace::MACE_MODEL_TAG::UnloadModelData(model_data);
}
//8. 执行模型,得到结果
engine.Run(inputs, &outputs);
```
## 功能列表
算子持续完善中,有新功能需求请联系我们。
| 操作 | Android NN | 状态 | 备注 |
| ------------- |:-------------:|:---------:|:-------------:|
| ADD | Y | Y | |
| AVERAGE_POOL_2D | Y | Y | |
| BATCH_NORM | | Y | 支持与激活层合并 |
| BIAS_ADD | | Y | |
| CHANNEL_SHUFFLE | | | |
| CONCATENATION | Y | Y | |
| CONV_2D | Y | Y | 支持stride,dilations,支持与batch norm和激活层合并 |
| DEPTHWISE_CONV_2D | Y | Y | 目前支持multiplier = 1以及与batch norm和激活层合并 |
| DEPTH_TO_SPACE | Y | | |
| DEQUANTIZE | Y | | |
| EMBEDDING_LOOKUP | Y | | |
| FLOOR | Y | | |
| FULLY_CONNECTED | Y | Y | |
| GROUP_CONV_2D | | | |
| HASHTABLE_LOOKUP | Y | | |
| L2_NORMALIZATION | Y | | |
| L2_POOL_2D | Y | | |
| LOCAL_RESPONSE_NORMALIZATION | Y | | |
| LOGISTIC | Y | Y | |
| LSH_PROJECTION | Y | | |
| LSTM | Y | | |
| MATMUL | | | |
| MAX_POOL_2D | Y | Y | |
| MUL | Y | | |
| PSROI_ALIGN | | | |
| PRELU | | Y | |
| RELU | Y | Y | |
| RELU1 | Y | Y | |
| RELU6 | Y | Y | |
| RELUX | | Y | |
| RESHAPE | Y | | |
| RESIZE_BILINEAR | Y | Y | |
| RNN | Y | | |
| RPN_PROPOSAL_LAYER | | | |
| SOFTMAX | Y | Y | |
| SPACE_TO_DEPTH | Y | | |
| SVDF | Y | | |
| TANH | Y | Y | |
## 性能对比
待整理
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册