Add introduction and advanced usage documents.

930c6c31 · liuqi · b2441b02 · 930c6c31 · 930c6c31 · 930c6c31
7 changed file
--- a/docs/introduction.rst
+++ b/docs/introduction.rst
@@ -2,45 +2,150 @@ Introduction
 ============

 Mobile AI Compute Engine (MACE) is a deep learning inference framework optimized for
-mobile heterogeneous computing platforms. The following figure shows the
-overall architecture.
+mobile heterogeneous computing platforms. MACE cover common mobile computing devices(CPU, GPU and DSP),
+and supplies tools and document to help users to deploy NN model to mobile devices. MACE has been
+MACE has been widely used in Xiaomi and proved with industry leading performance and stability.
+
+Framework
+---------
+The following figure shows the overall architecture.

 .. image:: mace-arch.png
   :scale: 40 %
   :align: center

-Model format
------------
+==========
+MACE Model
+==========

 MACE defines a customized model format which is similar to
 Caffe2. The MACE model can be converted from exported models by TensorFlow
-and Caffe. A YAML file is used to describe the model deployment details. In the
-next chapter, there is a detailed guide showing how to create this YAML file.
-
-Model conversion
----------------
-
-Currently, we provide model converters for TensorFlow and Caffe. And
-more frameworks will be supported in the future.
-
-Model loading
-------------
-
-The MACE model format contains two parts: the model graph definition and
-the model parameter tensors. The graph part utilizes Protocol Buffers
-for serialization. All the model parameter tensors are concatenated
-together into a continuous byte array, and we call this array tensor data in
-the following paragraphs. In the model graph, the tensor data offsets
-and lengths are recorded.
-
-The models can be loaded in 3 ways:
-
-1. Both model graph and tensor data are dynamically loaded externally
-   (by default, from file system, but the users are free to choose their own
-   implementations, for example, with compression or encryption). This
-   approach provides the most flexibility but the weakest model protection.
-2. Both model graph and tensor data are converted into C++ code and loaded
-   by executing the compiled code. This approach provides the strongest
-   model protection and simplest deployment.
-3. The model graph is converted into C++ code and constructed as the second
-   approach, and the tensor data is loaded externally as the first approach.
+and Caffe.
+
+================
+MACE Interpreter
+================
+
+Mace Interpreter mainly parse the NN graph and manage the tensors in the graph.
+
+=======
+Runtime
+=======
+
+CPU/GPU/DSP runtime correspond to the Ops for different devices.
+
+Work Flow
+---------
+The following figure shows the basic work flow of MACE.
+
+.. image:: mace-work-flow.png
+   :scale: 60 %
+   :align: center
+
+==================================
+1. Configure model deployment file
+==================================
+Model deploy configuration file (.yml) describe the information of the model and library,
+MACE will build the library based on the file.
+
+==================
+2. Build libraries
+==================
+Build MACE dynamic or static libraries.
+
+==================
+3. Convert model
+==================
+Convert Tensorflow or Caffe model to MACE model.
+
+===========
+4.1. Deploy
+===========
+Integrate the MACE library to your application and run with MACE API.
+
+==============
+4.2. Run (CLI)
+==============
+There are command line tools to run models, which could be used for testing time, memory usage and correctness.
+
+==============
+4.3. Benchmark
+==============
+MACE supplies Benchmark tool to look up the run time of every Operation in the model.
+
+
+
+简介
+----
+
+Mobile AI Compute Engine (MACE) 是一个专为移动端异构计算设备优化的深度学习前向预测框架。
+MACE覆盖了常见的移动端计算设备（CPU，GPU和DSP），并且提供了完整的工具链和文档，用户借助MACE能够
+很方便地在移动端部署深度学习模型。MACE已经在小米内部广泛使用并且被充分验证具有业界领先的性能和稳定性。
+
+框架
+----
+下图描述了MACE的基本框架.
+
+.. image:: mace-arch.png
+   :scale: 60 %
+   :align: center
+
+
+==============
+MACE Model
+==============
+
+MACE定义了自有的模型格式（类似于Caffe2），通过MACE提供的工具可以将Caffe和TensorFlow的模型
+转为MACE模型。
+
+=================
+MACE Interpreter
+=================
+
+MACE Interpreter主要负责解析运行神经网络图（DAG）并管理网络中的Tensors。
+
+=======
+Runtime
+=======
+
+CPU/GPU/DSP Runtime对应于各个计算设备的算子实现。
+
+使用流程
+------------
+下图描述了MACE使用的基本流程。
+
+.. image:: mace-work-flow-zh.png
+   :scale: 60 %
+   :align: center
+
+
+==================================
+1. 配置模型部署文件(.yml)
+==================================
+模型部署文件详细描述了需要部署的模型以及生成库的信息，MACE根据该文件最终生成对应的库文件。
+
+==================================
+2. 编译MACE库
+==================================
+编译MACE的静态库或者动态库。
+
+==================
+3. 转换模型
+==================
+将Tensorflow 或者 Caffe的模型转为MACE的模型.
+
+==================================
+4.1. 部署
+==================================
+根据不同使用目的集成Build阶段生成的库文件，然后调用MACE相应的接口执行模型。
+
+==================================
+4.2. 命令行运行
+==================================
+MACE提供了命令行工具，可以在命令行运行模型，可以用来测试模型运行时间，内存占用和正确性。
+
+==================================
+4.3. Benchmark
+==================================
+MACE提供了命令行benchmark工具，可以细粒度的查看模型中所涉及的所有算子的运行时间。
+
--- a/docs/mace-work-flow-zh.png
+++ b/docs/mace-work-flow-zh.png
--- a/docs/mace-work-flow.png
+++ b/docs/mace-work-flow.png
--- a/docs/user_guide/advanced_usage.rst
+++ b/docs/user_guide/advanced_usage.rst
@@ -44,12 +44,10 @@ in one deployment file.
        If more than one ABIs will be used, seperate them by comas.
    * - target_socs
      - [optional] Build for specific SoCs.
-    * - embed_model_data
-      - Whether embedding model weights into the code, default is 0.
-    * - build_type
-      - model build type, can be 'proto' or 'code'. 'proto' for converting model to ProtoBuf file and 'code' for converting model to c++ code.
-    * - linkshared
-      - [optional] 1 for building shared library, and 0 for static library, default to 0.
+    * - model_graph_format
+      - model graph format, could be 'file' or 'code'. 'file' for converting model graph to ProtoBuf file(.pb) and 'code' for converting model graph to c++ code.
+    * - model_data_format
+      - model data format, could be 'file' or 'code'. 'file' for converting model weight to data file(.data) and 'code' for converting model weight to c++ code.
    * - model_name
      - model name, should be unique if there are more than one models.
        **LIMIT: if build_type is code, model_name will be used in c++ code so that model_name must comply with c++ name specification.**
@@ -164,8 +162,7 @@ There are two common advanced use cases: 1. convert model to CPP code. 2. tuning

    * **3. Deployment**
        * Link `libmace.a` and `${library_name}.a` to your target.
-
-        Please refer to \ ``mace/examples/example.cc``\ for full usage. The following list the key steps.
+        * Refer to \ ``mace/examples/example.cc``\ for full usage. The following list the key steps.

        .. code:: cpp


--- a/docs/user_guide/create_a_model_deployment.rst
+++ b/docs/user_guide/create_a_model_deployment.rst
-Create a model deployment file
-==============================
-
-The first step to deploy your models is to create a YAML model deployment
-file.
-
-One deployment file describes a case of model deployment,
-each file will generate one static library (if more than one ABIs specified,
-there will be one static library for each). The deployment file can contain
-one or more models, for example, a smart camera application may contain face
-recognition, object recognition, and voice recognition models, which can be
-defined in one deployment file.
-
-
-Example
----------
-Here is an example deployment file used by an Android demo application.
-
-.. literalinclude:: models/demo_app_models.yml
-   :language: yaml
-
-Configurations
--------------------
-
-.. list-table::
-    :header-rows: 1
-
-    * - library_name
-      - library name.
-    * - target_abis
-      - The target ABI to build, can be one or more of 'host', 'armeabi-v7a' or 'arm64-v8a'.
-    * - target_socs
-      - [optional] build for specified socs if you just want use the model for that socs.
-    * - model_graph_format
-      - MACE model graph type, could be ['file', 'code']. 'file' for converting model to ProtoBuf(`.pb`) file and 'code' for converting model to c++ code.
-    * - model_data_format
-      - MACE model data type, could be ['file', 'code']. 'file' for converting model to `.data` file and 'code' for converting model to c++ code.
-    * - model_name
-      - model name, should be unique if there are multiple models.
-        **LIMIT: if build_type is code, model_name will used in c++ code so that model_name must fulfill c++ name specification.**
-    * - platform
-      - The source framework, one of [tensorflow, caffe].
-    * - model_file_path
-      - The path of the model file, can be local or remote.
-    * - model_sha256_checksum
-      - The SHA256 checksum of the model file.
-    * - weight_file_path
-      - [optional] The path of the model weights file, used by Caffe model.
-    * - weight_sha256_checksum
-      - [optional] The SHA256 checksum of the weight file, used by Caffe model.
-    * - subgraphs
-      - subgraphs key. **DO NOT EDIT**
-    * - input_tensors
-      - The input tensor names (tensorflow), top name of inputs' layer (caffe). one or more strings.
-    * - output_tensors
-      - The output tensor names (tensorflow), top name of outputs' layer (caffe). one or more strings.
-    * - input_shapes
-      - The shapes of the input tensors, in NHWC order.
-    * - output_shapes
-      - The shapes of the output tensors, in NHWC order.
-    * - input_ranges
-      - The numerical range of the input tensors, default [-1, 1]. It is only for test.
-    * - validation_inputs_data
-      - [optional] Specify Numpy validation inputs. When not provided, [-1, 1] random values will be used.
-    * - runtime
-      - The running device, one of [cpu, gpu, dsp, cpu_gpu]. cpu_gpu contains CPU and GPU model definition so you can run the model on both CPU and GPU.
-    * - data_type
-      - [optional] The data type used for specified runtime. [fp16_fp32, fp32_fp32] for GPU, default is fp16_fp32. [fp32] for CPU. [uint8] for DSP.
-    * - limit_opencl_kernel_time
-      - [optional] Whether splitting the OpenCL kernel within 1 ms to keep UI responsiveness, default to 0.
-    * - nnlib_graph_mode
-      - [optional] Control the DSP precision and performance, default to 0 usually works for most cases.
-    * - obfuscate
-      - [optional] Whether to obfuscate the model operator name, default to 0.
-    * - winograd
-      - [optional] Whether to enable Winograd convolution, **will increase memory consumption**.
--- a/docs/user_guide/how_to_build.rst
+++ b/docs/user_guide/how_to_build.rst
-How to build
-============
-
-Supported Platforms
-------------------
-
-.. list-table::
-    :header-rows: 1
-
-    * - Platform
-      - Explanation
-    * - TensorFlow
-      - >= 1.6.0.
-    * - Caffe
-      - >= 1.0.
-
-Usage
--------
-
-=======================================
-1. Pull MACE source code
-=======================================
-
-.. code:: sh
-
-    git clone https://github.com/XiaoMi/mace.git
-    git fetch --all --tags --prune
-
-    # Checkout the latest tag (i.e. release version)
-    tag_name=`git describe --abbrev=0 --tags`
-    git checkout tags/${tag_name}
-
-.. note::
-
-    It's highly recommended to use a release version instead of master branch.
-
-============================
-2. Model Preprocessing
-============================
-
-  TensorFlow
-
-TensorFlow provides 
-`Graph Transform Tool <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md>`__
-to improve inference efficiency by making various optimizations like Ops
-folding, redundant node removal etc. It's strongly recommended to make these
-optimizations before graph conversion step.
-
-The following commands show the suggested graph transformations and
-optimizations for different runtimes,
-
-.. code:: sh
-
-    # CPU/GPU:
-    ./transform_graph \
-        --in_graph=tf_model.pb \
-        --out_graph=tf_model_opt.pb \
-        --inputs='input' \
-        --outputs='output' \
-        --transforms='strip_unused_nodes(type=float, shape="1,64,64,3") 
-            strip_unused_nodes(type=float, shape="1,64,64,3")
-            remove_nodes(op=Identity, op=CheckNumerics)
-            fold_constants(ignore_errors=true)
-            flatten_atrous_conv
-            fold_batch_norms
-            fold_old_batch_norms
-            strip_unused_nodes
-            sort_by_execution_order'
-
-.. code:: sh
-
-    # DSP:
-    ./transform_graph \
-        --in_graph=tf_model.pb \
-        --out_graph=tf_model_opt.pb \
-        --inputs='input' \
-        --outputs='output' \
-        --transforms='strip_unused_nodes(type=float, shape="1,64,64,3") 
-            strip_unused_nodes(type=float, shape="1,64,64,3")
-            remove_nodes(op=Identity, op=CheckNumerics)
-            fold_constants(ignore_errors=true)
-            fold_batch_norms
-            fold_old_batch_norms
-            backport_concatv2
-            quantize_weights(minimum_size=2)
-            quantize_nodes
-            strip_unused_nodes
-            sort_by_execution_order'
-
-  Caffe
-
-MACE converter only supports Caffe 1.0+, you need to upgrade
-your models with Caffe built-in tool when necessary,
-
-.. code:: bash
-
-    # Upgrade prototxt
-    $CAFFE_ROOT/build/tools/upgrade_net_proto_text MODEL.prototxt MODEL.new.prototxt
-
-    # Upgrade caffemodel
-    $CAFFE_ROOT/build/tools/upgrade_net_proto_binary MODEL.caffemodel MODEL.new.caffemodel
-
-==============================
-3. Build static/shared library
-==============================
-
-----------------
-3.1 Overview
-----------------
-MACE can build either static or shared library (which is
-specified by ``linkshared`` in YAML model deployment file).
-The followings are two use cases.
-
-* **Build well tuned library for specific SoCs**
-
-    When ``target_socs`` is specified in YAML model deployment file, the build
-    tool will enable automatic tuning for GPU kernels. This usually takes some
-    time to finish depending on the complexity of your model.
-
-    .. note::
-
-         You should plug in device(s) with the correspoding SoC(s).
-
-* **Build generic library for all SoCs**
-
-    When ``target_socs`` is not specified, the generated library is compatible
-    with general devices.
-
-    .. note::
-
-         There will be around of 1 ~ 10% performance drop for GPU
-         runtime compared to the well tuned library.
-
-MACE provide command line tool (``tools/converter.py``) for
-model conversion, compiling, test run, benchmark and correctness validation.
-
-.. note::
-
-     1. ``tools/converter.py`` should be run at the root directory of this project.
-     2. When ``linkshared`` is set to ``1``, ``build_type`` should be ``proto``.
-        And currently only android devices supported.
-
-
------------------------------------------
-3.2 \ ``tools/converter.py``\  usage
------------------------------------------
-
-**Commands**
-
-    * **build**
-
-        build library and test tools.
-
-    .. code:: sh
-
-        # Build library 
-        python tools/converter.py build --config=models/config.yaml
-
-
-
-    * **run**
-
-        run the model(s).
-
-    .. code:: sh
-
-    	# Test model run time
-        python tools/converter.py run --config=models/config.yaml --round=100
-
-    	# Validate the correctness by comparing the results against the
-    	# original model and framework, measured with cosine distance for similarity.
-    	python tools/converter.py run --config=models/config.yaml --validate
-
-    	# Check the memory usage of the model(**Just keep only one model in configuration file**)
-    	python tools/converter.py run --config=models/config.yaml --round=10000 &
-    	sleep 5
-    	adb shell dumpsys meminfo | grep mace_run
-    	kill %1
-
-
-    .. warning::
-
-        ``run`` rely on ``build`` command, you should ``run`` after ``build``.
-
-    * **benchmark**
-
-        benchmark and profiling model.
-
-    .. code:: sh
-
-        # Benchmark model, get detailed statistics of each Op.
-        python tools/converter.py benchmark --config=models/config.yaml
-
-
-    .. warning::
-
-        ``benchmark`` rely on ``build`` command, you should ``benchmark`` after ``build``.
-
-**Common arguments**
-
-    .. list-table::
-        :header-rows: 1
-
-        * - option
-          - type
-          - default
-          - commands
-          - explanation
-        * - --omp_num_threads
-          - int
-          - -1
-          - ``run``/``benchmark``
-          - number of threads
-        * - --cpu_affinity_policy
-          - int
-          - 1
-          - ``run``/``benchmark``
-          - 0:AFFINITY_NONE/1:AFFINITY_BIG_ONLY/2:AFFINITY_LITTLE_ONLY
-        * - --gpu_perf_hint
-          - int
-          - 3
-          - ``run``/``benchmark``
-          - 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
-        * - --gpu_perf_hint
-          - int
-          - 3
-          - ``run``/``benchmark``
-          - 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
-        * - --gpu_priority_hint
-          - int
-          - 3
-          - ``run``/``benchmark``
-          - 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
-
-Using ``-h`` to get detailed help.
-
-.. code:: sh
-
-    python tools/converter.py -h
-    python tools/converter.py build -h
-    python tools/converter.py run -h
-    python tools/converter.py benchmark -h
-
-
-=============
-4. Deployment
-=============
-
-``build`` command will generate the static/shared library, model files and
-header files and package them as
-``build/${library_name}/libmace_${library_name}.tar.gz``.
-
-  The generated ``static`` libraries are organized as follows,
-
-.. code::
-
-      build/
-      └── mobilenet-v2-gpu
-          ├── include
-          │   └── mace
-          │       └── public
-          │           ├── mace.h
-          │           └── mace_runtime.h
-          ├── libmace_mobilenet-v2-gpu.tar.gz
-          ├── lib
-          │   ├── arm64-v8a
-          │   │   └── libmace_mobilenet-v2-gpu.MI6.msm8998.a
-          │   └── armeabi-v7a
-          │       └── libmace_mobilenet-v2-gpu.MI6.msm8998.a
-          ├── model
-          │   ├── mobilenet_v2.data
-          │   └── mobilenet_v2.pb
-          └── opencl
-              ├── arm64-v8a
-              │   └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
-              └── armeabi-v7a
-                  └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
-
-  The generated ``shared`` libraries are organized as follows,
-
-.. code::
-
-      build
-      └── mobilenet-v2-gpu
-          ├── include
-          │   └── mace
-          │       └── public
-          │           ├── mace.h
-          │           └── mace_runtime.h
-          ├── lib
-          │   ├── arm64-v8a
-          │   │   ├── libgnustl_shared.so
-          │   │   └── libmace.so
-          │   └── armeabi-v7a
-          │       ├── libgnustl_shared.so
-          │       └── libmace.so
-          ├── model
-          │   ├── mobilenet_v2.data
-          │   └── mobilenet_v2.pb
-          └── opencl
-              ├── arm64-v8a
-              │   └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
-              └── armeabi-v7a
-                  └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
-
-.. note::
-
-    1. DSP runtime depends on ``libhexagon_controller.so``.
-    2. ``${MODEL_TAG}.pb`` file will be generated only when ``build_type`` is ``proto``.
-    3. ``${library_name}_compiled_opencl_kernel.${device_name}.${soc}.bin`` will
-       be generated only when ``target_socs`` and ``gpu`` runtime are specified.
-    4. Generated shared library depends on ``libgnustl_shared.so``.
-
-.. warning::
-
-    ``${library_name}_compiled_opencl_kernel.${device_name}.${soc}.bin`` depends
-    on the OpenCL version of the device, you should maintan the compatibility or
-    configure compiling cache store with ``ConfigKVStorageFactory``.
-
-=========================================
-5. How to use the library in your project
-=========================================
-
-Please refer to \ ``mace/examples/example.cc``\ for full usage. The following list the key steps.
-
-.. code:: cpp
-
-    // Include the headers
-    #include "mace/public/mace.h"
-    #include "mace/public/mace_runtime.h"
-    // If the build_type is code
-    #include "mace/public/mace_engine_factory.h"
-
-    // 0. Set pre-compiled OpenCL binary program file paths when available
-    if (device_type == DeviceType::GPU) {
-      mace::SetOpenCLBinaryPaths(opencl_binary_paths);
-    }
-
-    // 1. Set compiled OpenCL kernel cache, this is used to reduce the
-    // initialization time since the compiling is too slow. It's suggested
-    // to set this even when pre-compiled OpenCL program file is provided
-    // because the OpenCL version upgrade may also leads to kernel
-    // recompilations.
-    const std::string file_path ="path/to/opencl_cache_file";
-    std::shared_ptr<KVStorageFactory> storage_factory(
-        new FileStorageFactory(file_path));
-    ConfigKVStorageFactory(storage_factory);
-
-    // 2. Declare the device type (must be same with ``runtime`` in configuration file)
-    DeviceType device_type = DeviceType::GPU;
-
-    // 3. Define the input and output tensor names.
-    std::vector<std::string> input_names = {...};
-    std::vector<std::string> output_names = {...};
-
-    // 4. Create MaceEngine instance 
-    std::shared_ptr<mace::MaceEngine> engine;
-    MaceStatus create_engine_status;
-    // Create Engine from compiled code
-    create_engine_status =
-        CreateMaceEngineFromCode(model_name.c_str(),
-                                 nullptr,
-                                 input_names,
-                                 output_names,
-                                 device_type,
-                                 &engine);
-    // Create Engine from model file
-    create_engine_status =
-        CreateMaceEngineFromProto(model_pb_data,
-                                  model_data_file.c_str(),
-                                  input_names,
-                                  output_names,
-                                  device_type,
-                                  &engine);
-    if (create_engine_status != MaceStatus::MACE_SUCCESS) {
-      // Report error
-    }
-
-    // 5. Create Input and Output tensor buffers
-    std::map<std::string, mace::MaceTensor> inputs;
-    std::map<std::string, mace::MaceTensor> outputs;
-    for (size_t i = 0; i < input_count; ++i) {
-      // Allocate input and output
-      int64_t input_size =
-          std::accumulate(input_shapes[i].begin(), input_shapes[i].end(), 1,
-                          std::multiplies<int64_t>());
-      auto buffer_in = std::shared_ptr<float>(new float[input_size],
-                                              std::default_delete<float[]>());
-      // Load input here
-      // ...
-
-      inputs[input_names[i]] = mace::MaceTensor(input_shapes[i], buffer_in);
-    }
-
-    for (size_t i = 0; i < output_count; ++i) {
-      int64_t output_size =
-          std::accumulate(output_shapes[i].begin(), output_shapes[i].end(), 1,
-                          std::multiplies<int64_t>());
-      auto buffer_out = std::shared_ptr<float>(new float[output_size],
-                                               std::default_delete<float[]>());
-      outputs[output_names[i]] = mace::MaceTensor(output_shapes[i], buffer_out);
-    }
-
-    // 6. Run the model
-    MaceStatus status = engine.Run(inputs, &outputs);
-
--- a/docs/user_guide/installation.rst
+++ b/docs/user_guide/installation.rst
-Installation
-============
-