提交 930c6c31 编写于 作者: L liuqi

Add introduction and advanced usage documents.

上级 b2441b02
......@@ -2,45 +2,150 @@ Introduction
============
Mobile AI Compute Engine (MACE) is a deep learning inference framework optimized for
mobile heterogeneous computing platforms. The following figure shows the
overall architecture.
mobile heterogeneous computing platforms. MACE cover common mobile computing devices(CPU, GPU and DSP),
and supplies tools and document to help users to deploy NN model to mobile devices. MACE has been
MACE has been widely used in Xiaomi and proved with industry leading performance and stability.
Framework
---------
The following figure shows the overall architecture.
.. image:: mace-arch.png
:scale: 40 %
:align: center
Model format
------------
==========
MACE Model
==========
MACE defines a customized model format which is similar to
Caffe2. The MACE model can be converted from exported models by TensorFlow
and Caffe. A YAML file is used to describe the model deployment details. In the
next chapter, there is a detailed guide showing how to create this YAML file.
Model conversion
----------------
Currently, we provide model converters for TensorFlow and Caffe. And
more frameworks will be supported in the future.
Model loading
-------------
The MACE model format contains two parts: the model graph definition and
the model parameter tensors. The graph part utilizes Protocol Buffers
for serialization. All the model parameter tensors are concatenated
together into a continuous byte array, and we call this array tensor data in
the following paragraphs. In the model graph, the tensor data offsets
and lengths are recorded.
The models can be loaded in 3 ways:
1. Both model graph and tensor data are dynamically loaded externally
(by default, from file system, but the users are free to choose their own
implementations, for example, with compression or encryption). This
approach provides the most flexibility but the weakest model protection.
2. Both model graph and tensor data are converted into C++ code and loaded
by executing the compiled code. This approach provides the strongest
model protection and simplest deployment.
3. The model graph is converted into C++ code and constructed as the second
approach, and the tensor data is loaded externally as the first approach.
and Caffe.
================
MACE Interpreter
================
Mace Interpreter mainly parse the NN graph and manage the tensors in the graph.
=======
Runtime
=======
CPU/GPU/DSP runtime correspond to the Ops for different devices.
Work Flow
---------
The following figure shows the basic work flow of MACE.
.. image:: mace-work-flow.png
:scale: 60 %
:align: center
==================================
1. Configure model deployment file
==================================
Model deploy configuration file (.yml) describe the information of the model and library,
MACE will build the library based on the file.
==================
2. Build libraries
==================
Build MACE dynamic or static libraries.
==================
3. Convert model
==================
Convert Tensorflow or Caffe model to MACE model.
===========
4.1. Deploy
===========
Integrate the MACE library to your application and run with MACE API.
==============
4.2. Run (CLI)
==============
There are command line tools to run models, which could be used for testing time, memory usage and correctness.
==============
4.3. Benchmark
==============
MACE supplies Benchmark tool to look up the run time of every Operation in the model.
简介
----
Mobile AI Compute Engine (MACE) 是一个专为移动端异构计算设备优化的深度学习前向预测框架。
MACE覆盖了常见的移动端计算设备(CPU,GPU和DSP),并且提供了完整的工具链和文档,用户借助MACE能够
很方便地在移动端部署深度学习模型。MACE已经在小米内部广泛使用并且被充分验证具有业界领先的性能和稳定性。
框架
----
下图描述了MACE的基本框架.
.. image:: mace-arch.png
:scale: 60 %
:align: center
==============
MACE Model
==============
MACE定义了自有的模型格式(类似于Caffe2),通过MACE提供的工具可以将Caffe和TensorFlow的模型
转为MACE模型。
=================
MACE Interpreter
=================
MACE Interpreter主要负责解析运行神经网络图(DAG)并管理网络中的Tensors。
=======
Runtime
=======
CPU/GPU/DSP Runtime对应于各个计算设备的算子实现。
使用流程
------------
下图描述了MACE使用的基本流程。
.. image:: mace-work-flow-zh.png
:scale: 60 %
:align: center
==================================
1. 配置模型部署文件(.yml)
==================================
模型部署文件详细描述了需要部署的模型以及生成库的信息,MACE根据该文件最终生成对应的库文件。
==================================
2. 编译MACE库
==================================
编译MACE的静态库或者动态库。
==================
3. 转换模型
==================
将Tensorflow 或者 Caffe的模型转为MACE的模型.
==================================
4.1. 部署
==================================
根据不同使用目的集成Build阶段生成的库文件,然后调用MACE相应的接口执行模型。
==================================
4.2. 命令行运行
==================================
MACE提供了命令行工具,可以在命令行运行模型,可以用来测试模型运行时间,内存占用和正确性。
==================================
4.3. Benchmark
==================================
MACE提供了命令行benchmark工具,可以细粒度的查看模型中所涉及的所有算子的运行时间。
......@@ -44,12 +44,10 @@ in one deployment file.
If more than one ABIs will be used, seperate them by comas.
* - target_socs
- [optional] Build for specific SoCs.
* - embed_model_data
- Whether embedding model weights into the code, default is 0.
* - build_type
- model build type, can be 'proto' or 'code'. 'proto' for converting model to ProtoBuf file and 'code' for converting model to c++ code.
* - linkshared
- [optional] 1 for building shared library, and 0 for static library, default to 0.
* - model_graph_format
- model graph format, could be 'file' or 'code'. 'file' for converting model graph to ProtoBuf file(.pb) and 'code' for converting model graph to c++ code.
* - model_data_format
- model data format, could be 'file' or 'code'. 'file' for converting model weight to data file(.data) and 'code' for converting model weight to c++ code.
* - model_name
- model name, should be unique if there are more than one models.
**LIMIT: if build_type is code, model_name will be used in c++ code so that model_name must comply with c++ name specification.**
......@@ -164,8 +162,7 @@ There are two common advanced use cases: 1. convert model to CPP code. 2. tuning
* **3. Deployment**
* Link `libmace.a` and `${library_name}.a` to your target.
Please refer to \ ``mace/examples/example.cc``\ for full usage. The following list the key steps.
* Refer to \ ``mace/examples/example.cc``\ for full usage. The following list the key steps.
.. code:: cpp
......
Create a model deployment file
==============================
The first step to deploy your models is to create a YAML model deployment
file.
One deployment file describes a case of model deployment,
each file will generate one static library (if more than one ABIs specified,
there will be one static library for each). The deployment file can contain
one or more models, for example, a smart camera application may contain face
recognition, object recognition, and voice recognition models, which can be
defined in one deployment file.
Example
----------
Here is an example deployment file used by an Android demo application.
.. literalinclude:: models/demo_app_models.yml
:language: yaml
Configurations
--------------------
.. list-table::
:header-rows: 1
* - library_name
- library name.
* - target_abis
- The target ABI to build, can be one or more of 'host', 'armeabi-v7a' or 'arm64-v8a'.
* - target_socs
- [optional] build for specified socs if you just want use the model for that socs.
* - model_graph_format
- MACE model graph type, could be ['file', 'code']. 'file' for converting model to ProtoBuf(`.pb`) file and 'code' for converting model to c++ code.
* - model_data_format
- MACE model data type, could be ['file', 'code']. 'file' for converting model to `.data` file and 'code' for converting model to c++ code.
* - model_name
- model name, should be unique if there are multiple models.
**LIMIT: if build_type is code, model_name will used in c++ code so that model_name must fulfill c++ name specification.**
* - platform
- The source framework, one of [tensorflow, caffe].
* - model_file_path
- The path of the model file, can be local or remote.
* - model_sha256_checksum
- The SHA256 checksum of the model file.
* - weight_file_path
- [optional] The path of the model weights file, used by Caffe model.
* - weight_sha256_checksum
- [optional] The SHA256 checksum of the weight file, used by Caffe model.
* - subgraphs
- subgraphs key. **DO NOT EDIT**
* - input_tensors
- The input tensor names (tensorflow), top name of inputs' layer (caffe). one or more strings.
* - output_tensors
- The output tensor names (tensorflow), top name of outputs' layer (caffe). one or more strings.
* - input_shapes
- The shapes of the input tensors, in NHWC order.
* - output_shapes
- The shapes of the output tensors, in NHWC order.
* - input_ranges
- The numerical range of the input tensors, default [-1, 1]. It is only for test.
* - validation_inputs_data
- [optional] Specify Numpy validation inputs. When not provided, [-1, 1] random values will be used.
* - runtime
- The running device, one of [cpu, gpu, dsp, cpu_gpu]. cpu_gpu contains CPU and GPU model definition so you can run the model on both CPU and GPU.
* - data_type
- [optional] The data type used for specified runtime. [fp16_fp32, fp32_fp32] for GPU, default is fp16_fp32. [fp32] for CPU. [uint8] for DSP.
* - limit_opencl_kernel_time
- [optional] Whether splitting the OpenCL kernel within 1 ms to keep UI responsiveness, default to 0.
* - nnlib_graph_mode
- [optional] Control the DSP precision and performance, default to 0 usually works for most cases.
* - obfuscate
- [optional] Whether to obfuscate the model operator name, default to 0.
* - winograd
- [optional] Whether to enable Winograd convolution, **will increase memory consumption**.
How to build
============
Supported Platforms
-------------------
.. list-table::
:header-rows: 1
* - Platform
- Explanation
* - TensorFlow
- >= 1.6.0.
* - Caffe
- >= 1.0.
Usage
--------
=======================================
1. Pull MACE source code
=======================================
.. code:: sh
git clone https://github.com/XiaoMi/mace.git
git fetch --all --tags --prune
# Checkout the latest tag (i.e. release version)
tag_name=`git describe --abbrev=0 --tags`
git checkout tags/${tag_name}
.. note::
It's highly recommended to use a release version instead of master branch.
============================
2. Model Preprocessing
============================
- TensorFlow
TensorFlow provides
`Graph Transform Tool <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md>`__
to improve inference efficiency by making various optimizations like Ops
folding, redundant node removal etc. It's strongly recommended to make these
optimizations before graph conversion step.
The following commands show the suggested graph transformations and
optimizations for different runtimes,
.. code:: sh
# CPU/GPU:
./transform_graph \
--in_graph=tf_model.pb \
--out_graph=tf_model_opt.pb \
--inputs='input' \
--outputs='output' \
--transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
strip_unused_nodes(type=float, shape="1,64,64,3")
remove_nodes(op=Identity, op=CheckNumerics)
fold_constants(ignore_errors=true)
flatten_atrous_conv
fold_batch_norms
fold_old_batch_norms
strip_unused_nodes
sort_by_execution_order'
.. code:: sh
# DSP:
./transform_graph \
--in_graph=tf_model.pb \
--out_graph=tf_model_opt.pb \
--inputs='input' \
--outputs='output' \
--transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
strip_unused_nodes(type=float, shape="1,64,64,3")
remove_nodes(op=Identity, op=CheckNumerics)
fold_constants(ignore_errors=true)
fold_batch_norms
fold_old_batch_norms
backport_concatv2
quantize_weights(minimum_size=2)
quantize_nodes
strip_unused_nodes
sort_by_execution_order'
- Caffe
MACE converter only supports Caffe 1.0+, you need to upgrade
your models with Caffe built-in tool when necessary,
.. code:: bash
# Upgrade prototxt
$CAFFE_ROOT/build/tools/upgrade_net_proto_text MODEL.prototxt MODEL.new.prototxt
# Upgrade caffemodel
$CAFFE_ROOT/build/tools/upgrade_net_proto_binary MODEL.caffemodel MODEL.new.caffemodel
==============================
3. Build static/shared library
==============================
-----------------
3.1 Overview
-----------------
MACE can build either static or shared library (which is
specified by ``linkshared`` in YAML model deployment file).
The followings are two use cases.
* **Build well tuned library for specific SoCs**
When ``target_socs`` is specified in YAML model deployment file, the build
tool will enable automatic tuning for GPU kernels. This usually takes some
time to finish depending on the complexity of your model.
.. note::
You should plug in device(s) with the correspoding SoC(s).
* **Build generic library for all SoCs**
When ``target_socs`` is not specified, the generated library is compatible
with general devices.
.. note::
There will be around of 1 ~ 10% performance drop for GPU
runtime compared to the well tuned library.
MACE provide command line tool (``tools/converter.py``) for
model conversion, compiling, test run, benchmark and correctness validation.
.. note::
1. ``tools/converter.py`` should be run at the root directory of this project.
2. When ``linkshared`` is set to ``1``, ``build_type`` should be ``proto``.
And currently only android devices supported.
------------------------------------------
3.2 \ ``tools/converter.py``\ usage
------------------------------------------
**Commands**
* **build**
build library and test tools.
.. code:: sh
# Build library
python tools/converter.py build --config=models/config.yaml
* **run**
run the model(s).
.. code:: sh
# Test model run time
python tools/converter.py run --config=models/config.yaml --round=100
# Validate the correctness by comparing the results against the
# original model and framework, measured with cosine distance for similarity.
python tools/converter.py run --config=models/config.yaml --validate
# Check the memory usage of the model(**Just keep only one model in configuration file**)
python tools/converter.py run --config=models/config.yaml --round=10000 &
sleep 5
adb shell dumpsys meminfo | grep mace_run
kill %1
.. warning::
``run`` rely on ``build`` command, you should ``run`` after ``build``.
* **benchmark**
benchmark and profiling model.
.. code:: sh
# Benchmark model, get detailed statistics of each Op.
python tools/converter.py benchmark --config=models/config.yaml
.. warning::
``benchmark`` rely on ``build`` command, you should ``benchmark`` after ``build``.
**Common arguments**
.. list-table::
:header-rows: 1
* - option
- type
- default
- commands
- explanation
* - --omp_num_threads
- int
- -1
- ``run``/``benchmark``
- number of threads
* - --cpu_affinity_policy
- int
- 1
- ``run``/``benchmark``
- 0:AFFINITY_NONE/1:AFFINITY_BIG_ONLY/2:AFFINITY_LITTLE_ONLY
* - --gpu_perf_hint
- int
- 3
- ``run``/``benchmark``
- 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
* - --gpu_perf_hint
- int
- 3
- ``run``/``benchmark``
- 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
* - --gpu_priority_hint
- int
- 3
- ``run``/``benchmark``
- 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
Using ``-h`` to get detailed help.
.. code:: sh
python tools/converter.py -h
python tools/converter.py build -h
python tools/converter.py run -h
python tools/converter.py benchmark -h
=============
4. Deployment
=============
``build`` command will generate the static/shared library, model files and
header files and package them as
``build/${library_name}/libmace_${library_name}.tar.gz``.
- The generated ``static`` libraries are organized as follows,
.. code::
build/
└── mobilenet-v2-gpu
├── include
│   └── mace
│   └── public
│   ├── mace.h
│   └── mace_runtime.h
├── libmace_mobilenet-v2-gpu.tar.gz
├── lib
│   ├── arm64-v8a
│   │   └── libmace_mobilenet-v2-gpu.MI6.msm8998.a
│   └── armeabi-v7a
│   └── libmace_mobilenet-v2-gpu.MI6.msm8998.a
├── model
│   ├── mobilenet_v2.data
│   └── mobilenet_v2.pb
└── opencl
├── arm64-v8a
│   └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
└── armeabi-v7a
└── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
- The generated ``shared`` libraries are organized as follows,
.. code::
build
└── mobilenet-v2-gpu
├── include
│   └── mace
│   └── public
│   ├── mace.h
│   └── mace_runtime.h
├── lib
│   ├── arm64-v8a
│   │   ├── libgnustl_shared.so
│   │   └── libmace.so
│   └── armeabi-v7a
│   ├── libgnustl_shared.so
│   └── libmace.so
├── model
│   ├── mobilenet_v2.data
│   └── mobilenet_v2.pb
└── opencl
├── arm64-v8a
│   └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
└── armeabi-v7a
└── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
.. note::
1. DSP runtime depends on ``libhexagon_controller.so``.
2. ``${MODEL_TAG}.pb`` file will be generated only when ``build_type`` is ``proto``.
3. ``${library_name}_compiled_opencl_kernel.${device_name}.${soc}.bin`` will
be generated only when ``target_socs`` and ``gpu`` runtime are specified.
4. Generated shared library depends on ``libgnustl_shared.so``.
.. warning::
``${library_name}_compiled_opencl_kernel.${device_name}.${soc}.bin`` depends
on the OpenCL version of the device, you should maintan the compatibility or
configure compiling cache store with ``ConfigKVStorageFactory``.
=========================================
5. How to use the library in your project
=========================================
Please refer to \ ``mace/examples/example.cc``\ for full usage. The following list the key steps.
.. code:: cpp
// Include the headers
#include "mace/public/mace.h"
#include "mace/public/mace_runtime.h"
// If the build_type is code
#include "mace/public/mace_engine_factory.h"
// 0. Set pre-compiled OpenCL binary program file paths when available
if (device_type == DeviceType::GPU) {
mace::SetOpenCLBinaryPaths(opencl_binary_paths);
}
// 1. Set compiled OpenCL kernel cache, this is used to reduce the
// initialization time since the compiling is too slow. It's suggested
// to set this even when pre-compiled OpenCL program file is provided
// because the OpenCL version upgrade may also leads to kernel
// recompilations.
const std::string file_path ="path/to/opencl_cache_file";
std::shared_ptr<KVStorageFactory> storage_factory(
new FileStorageFactory(file_path));
ConfigKVStorageFactory(storage_factory);
// 2. Declare the device type (must be same with ``runtime`` in configuration file)
DeviceType device_type = DeviceType::GPU;
// 3. Define the input and output tensor names.
std::vector<std::string> input_names = {...};
std::vector<std::string> output_names = {...};
// 4. Create MaceEngine instance
std::shared_ptr<mace::MaceEngine> engine;
MaceStatus create_engine_status;
// Create Engine from compiled code
create_engine_status =
CreateMaceEngineFromCode(model_name.c_str(),
nullptr,
input_names,
output_names,
device_type,
&engine);
// Create Engine from model file
create_engine_status =
CreateMaceEngineFromProto(model_pb_data,
model_data_file.c_str(),
input_names,
output_names,
device_type,
&engine);
if (create_engine_status != MaceStatus::MACE_SUCCESS) {
// Report error
}
// 5. Create Input and Output tensor buffers
std::map<std::string, mace::MaceTensor> inputs;
std::map<std::string, mace::MaceTensor> outputs;
for (size_t i = 0; i < input_count; ++i) {
// Allocate input and output
int64_t input_size =
std::accumulate(input_shapes[i].begin(), input_shapes[i].end(), 1,
std::multiplies<int64_t>());
auto buffer_in = std::shared_ptr<float>(new float[input_size],
std::default_delete<float[]>());
// Load input here
// ...
inputs[input_names[i]] = mace::MaceTensor(input_shapes[i], buffer_in);
}
for (size_t i = 0; i < output_count; ++i) {
int64_t output_size =
std::accumulate(output_shapes[i].begin(), output_shapes[i].end(), 1,
std::multiplies<int64_t>());
auto buffer_out = std::shared_ptr<float>(new float[output_size],
std::default_delete<float[]>());
outputs[output_names[i]] = mace::MaceTensor(output_shapes[i], buffer_out);
}
// 6. Run the model
MaceStatus status = engine.Run(inputs, &outputs);
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册