未验证 提交 d47d48e3 编写于 作者: A Artem Tsvetkov 提交者: GitHub

Updated pooling kernel to work with embARC MLI Library 2.0 for ARC (#230)

* Updated pooling kernel to work with embARC MLI Library 2.0

* Update pooling.cc copyright.
上级 46ce0bd3
......@@ -4,6 +4,7 @@
* [dzakhar](https://github.com/dzakhar)
* [JaccovG](https://github.com/JaccovG)
* [gerbauz](https://github.com/gerbauz)
## Introduction
......@@ -14,23 +15,23 @@ quantization).
## Usage
embARC MLI Library is used by default to speed up execution of some kernels for
asymmetrically quantized layers. This means that usual project generation for
embARC MLI Library is used to speed up execution of some kernels for
asymmetrically quantized layers and can be applied with the option `OPTIMIZED_KERNEL_DIR=arc_mli`.
This means that usual project generation for
ARC specific target implies usage of embARC MLI.
For example:
```
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=arc_emsdp OPTIMIZED_KERNEL_DIR=arc_mli generate_person_detection_int8_make_project
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=arc_vpx OPTIMIZED_KERNEL_DIR=arc_mli generate_person_detection_int8_make_project
```
In case MLI implementation can’t be used, kernels in this folder fallback to
TFLM reference implementations. For applications which may not benefit from MLI
library, projects can be generated without these implementations by adding
`ARC_TAGS=no_arc_mli` in the command line, which can reduce overall code size:
library, projects can be generated without these implementations **removing** `OPTIMIZED_KERNEL_DIR=arc_mli` in the command line, which can reduce overall code size:
```
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=arc_emsdp OPTIMIZED_KERNEL_DIR=arc_mli ARC_TAGS=no_arc_mli generate_person_detection_int8_make_project
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=arc_vpx generate_person_detection_int8_make_project
```
For ARC EM SDP board, a pre-compiled MLI library is downloaded and used in the
......@@ -39,16 +40,34 @@ and compiled during project generation phase. To build library from sources for
ARC EM SDP platform, add `BUILD_ARC_MLI=true` option to make command:
```
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=arc_emsdp OPTIMIZED_KERNEL_DIR=arc_mli BUILD_ARC_MLI=true generate_person_detection_int8_make_project
make -f tensorflow/lite/micro/tools/make/Makefile \
TARGET=arc_emsdp \
OPTIMIZED_KERNEL_DIR=arc_mli \
BUILD_ARC_MLI=true \
generate_person_detection_int8_make_project
```
---
### Optional (experimental features):
TFLM can be built using [embARC MLI Library 2.0](https://github.com/foss-for-synopsys-dwc-arc-processors/embarc_mli/tree/Release_2.0_EA) as an experimental feature.
To build TFLM using the embARC MLI Library 2.0, add the following tag to the command:
```
ARC_TAGS=mli20_experimental
```
In this case, generated projectes will be in <tcf_file_basename>_mli20_arc_default folder.
Some of configurations may require a custom run-time library specified using the BUILD_LIB_DIR option. Please, check MLI Library 2.0 [documentation](https://github.com/foss-for-synopsys-dwc-arc-processors/embarc_mli/tree/Release_2.0_EA#build-configuration-options) for more details. The following option can be added:
```
BUILD_LIB_DIR=<path_to_buildlib>
```
---
If an application exclusively uses accelerated MLI kernel implementations, one
can strip out TFLM reference kernel implementations to reduce code size of
application. Build application with `MLI_ONLY=true` option in generated project
(after the project was built):
```
cd tensorflow/lite/micro/tools/make/gen/arc_emsdp_arc/prj/person_detection_int8/make
cd tensorflow/lite/micro/tools/make/gen/arc_vpx_arc_default/prj/person_detection_int8/make
make app MLI_ONLY=true
```
......@@ -59,35 +78,39 @@ used for some nodes and you need to revert to using TFLM reference kernels.
## Limitations
Currently, the MLI Library provides optimized implementation only for int8
(asymmetric) versions of the following kernels: 1. Convolution 2D – Per axis
quantization only, `dilation_ratio==1` 2. Depthwise Convolution 2D – Per axis
quantization only, `dilation_ratio==1` 3. Average Pooling 4. Max Pooling 5.
Fully Connected
Currently only
[/tensorflow/lite/micro/examples/person_detection](/tensorflow/lite/micro/examples/person_detection)
is quantized using this specification. Other examples can be executed on
ARC-based targets, but will only use reference kernels.
(asymmetric) versions of the following kernels:
1. Convolution 2D – Per axis
quantization only, `dilation_ratio==1`
2. Depthwise Convolution 2D – Per axis
quantization only, `dilation_ratio==1`
3. Average Pooling
4. Max Pooling
5. Fully Connected
## Scratch Buffers and Slicing
The following information applies only for ARC EM SDP and other targets with XY
The following information applies only for ARC EM SDP, VPX and other targets with XY or VCCM
memory. embARC MLI uses specific optimizations which assumes node operands are
in XY memory and/or DCCM (Data Closely Coupled Memory). As operands might be
quite big and may not fit in available XY memory, special slicing logic is
in XY, VCCM memory and/or DCCM (Data Closely Coupled Memory). As operands might be
quite big and may not fit in available XY or VCCM memory, special slicing logic is
applied which allows kernel calculations to be split into multiple parts. For
this reason, internal static buffers are allocated in these X, Y and DCCM memory
this reason, internal static buffers are allocated in these X, Y, VCCM and DCCM memory
banks and used to execute sub-calculations.
All this is performed automatically and invisible to the user. Half of the DCCM
memory bank and the full XY banks are occupied for MLI specific needs. If the
user needs space in XY memory for other tasks, these arrays can be reduced by
memory bank and the full XY banks or 3/4 of VCCM bank are occupied for MLI specific needs.
If the user needs space in XY or VCCM memory for other tasks, these arrays can be reduced by
setting specific sizes. For this, add the following option to build command
replacing **<size[a|b|c]>** with required values:
**For EM:**
```
EXT_CFLAGS=”-DSCRATCH_MEM_Z_SIZE=<size_a> -DSCRATCH_MEM_X_SIZE=<size_b> -DSCRATCH_MEM_Y_SIZE=<size_c>”
```
**For VPX:**
```
EXT_CFLAGS=”-DSCRATCH_MEM_VEC_SIZE=<size_a>”
```
For example, to reduce sizes of arrays placed in DCCM and XCCM to 32k and 8k
respectively, use next command:
......
/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.
/* Copyright 2021 The TensorFlow Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
......@@ -27,8 +27,6 @@ limitations under the License.
#include "tensorflow/lite/micro/kernels/kernel_util.h"
namespace tflite {
namespace ops {
namespace micro {
namespace pooling {
namespace {
......@@ -47,8 +45,8 @@ struct OpData {
bool is_mli_applicable;
// Tensors in MLI format.
mli_tensor* mli_in;
mli_tensor* mli_out;
mutable ops::micro::MliTensorInterface mli_in;
mutable ops::micro::MliTensorInterface mli_out;
mli_pool_cfg* cfg;
};
......@@ -109,15 +107,15 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
}
if (data->is_mli_applicable) {
data->mli_in = static_cast<mli_tensor*>(
context->AllocatePersistentBuffer(context, sizeof(mli_tensor)));
data->mli_out = static_cast<mli_tensor*>(
context->AllocatePersistentBuffer(context, sizeof(mli_tensor)));
data->mli_in = ops::micro::MliTensorInterface(static_cast<mli_tensor*>(
context->AllocatePersistentBuffer(context, sizeof(mli_tensor))));
data->mli_out = ops::micro::MliTensorInterface(static_cast<mli_tensor*>(
context->AllocatePersistentBuffer(context, sizeof(mli_tensor))));
data->cfg = static_cast<mli_pool_cfg*>(
context->AllocatePersistentBuffer(context, sizeof(mli_pool_cfg)));
ops::micro::ConvertToMliTensor(input, data->mli_in);
ops::micro::ConvertToMliTensor(output, data->mli_out);
ops::micro::ConvertToMliTensor(input, &data->mli_in);
ops::micro::ConvertToMliTensor(output, &data->mli_out);
data->cfg->kernel_width = params->filter_width;
data->cfg->kernel_height = params->filter_height;
......@@ -176,8 +174,8 @@ TfLiteStatus EvalMli(TfLiteContext* context, const TfLitePoolParams* params,
const MliPoolingType pooling_type) {
mli_pool_cfg cfg_local = *data.cfg;
ops::micro::MliTensorAttachBuffer<int8_t>(input, data.mli_in);
ops::micro::MliTensorAttachBuffer<int8_t>(output, data.mli_out);
ops::micro::MliTensorAttachBuffer<int8_t>(input, &data.mli_in);
ops::micro::MliTensorAttachBuffer<int8_t>(output, &data.mli_out);
const int height_dimension = 1;
int in_slice_height = 0;
......@@ -186,18 +184,26 @@ TfLiteStatus EvalMli(TfLiteContext* context, const TfLitePoolParams* params,
// Tensors for data in fast (local) memory and config to copy data from
// external to local memory
mli_tensor in_local = *data.mli_in;
mli_tensor out_local = *data.mli_out;
mli_tensor in_local = *data.mli_in.MliTensor();
mli_tensor out_local = *data.mli_out.MliTensor();
ops::micro::MliTensorInterface in_local_interface(&in_local);
ops::micro::MliTensorInterface out_local_interface(&out_local);
mli_mov_cfg_t copy_config;
mli_mov_cfg_for_copy(&copy_config);
TF_LITE_ENSURE_STATUS(get_arc_scratch_buffer_for_pooling_tensors(
context, &in_local, &out_local));
bool in_is_local = in_local.data == data.mli_in->data;
bool out_is_local = out_local.data == data.mli_out->data;
context, &in_local_interface, &out_local_interface));
bool in_is_local =
in_local_interface.Data<int8_t>() == data.mli_in.Data<int8_t>();
bool out_is_local =
out_local_interface.Data<int8_t>() == data.mli_out.Data<int8_t>();
TF_LITE_ENSURE_STATUS(arc_scratch_buffer_calc_slice_size_io(
&in_local, &out_local, cfg_local.kernel_height, cfg_local.stride_height,
cfg_local.padding_top, cfg_local.padding_bottom, &in_slice_height,
&out_slice_height));
&in_local_interface, &out_local_interface, cfg_local.kernel_height,
cfg_local.stride_height, cfg_local.padding_top, cfg_local.padding_bottom,
&in_slice_height, &out_slice_height));
/* mli_in tensor contains batches of HWC tensors. so it is a 4 dimensional
tensor. because the mli kernel will process one HWC tensor at a time, the 4
......@@ -206,10 +212,11 @@ TfLiteStatus EvalMli(TfLiteContext* context, const TfLitePoolParams* params,
for that the sliceHeight has been calculated. The tensor slicer is
configured that it will completely slice the nBatch dimension (0) and slice
the height dimension (1) in chunks of 'sliceHeight' */
TensorSlicer in_slice(data.mli_in, height_dimension, in_slice_height,
cfg_local.padding_top, cfg_local.padding_bottom,
overlap);
TensorSlicer out_slice(data.mli_out, height_dimension, out_slice_height);
ops::micro::TensorSlicer in_slice(data.mli_in.MliTensor(), height_dimension,
in_slice_height, cfg_local.padding_top,
cfg_local.padding_bottom, overlap);
ops::micro::TensorSlicer out_slice(data.mli_out.MliTensor(), height_dimension,
out_slice_height);
/* is_local indicates that the tensor is already in local memory,
so in that case the original tensor can be used,
......@@ -418,6 +425,4 @@ TfLiteRegistration Register_MAX_POOL_2D() {
/*version=*/0};
}
} // namespace micro
} // namespace ops
} // namespace tflite
......@@ -218,7 +218,11 @@ TF_LITE_MICRO_TEST(LocalAveragePoolTestInt1) {
const float output_max = 127;
int8_t output_data[3];
#ifdef __Xvdsp
#pragma Bss(".vecmem_data")
#else
#pragma Bss(".Zdata")
#endif
const int kInput1Shape[] = {4, 1, 2, 4, 1};
const int8_t kInput1Data[] = {1, 1, 1, 1, 1, 1, 1, 1};
const int kOutput1Shape[] = {4, 1, 1, 3, 1};
......@@ -276,7 +280,11 @@ TF_LITE_MICRO_TEST(LocalAveragePoolTestInt2) {
const float output_max = 127;
int8_t output_data[45];
#ifdef __Xvdsp
#pragma Bss(".vecmem_data")
#else
#pragma Bss(".Zdata")
#endif
const int kInput2Shape[] = {4, 1, 6, 10, 1};
const int8_t kInput2Data[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
......@@ -339,7 +347,11 @@ TF_LITE_MICRO_TEST(LocalMaxPoolTestInt1) {
int stride_width = 1;
int stride_height = 1;
#ifdef __Xvdsp
#pragma Bss(".vecmem_data")
#else
#pragma Bss(".Zdata")
#endif
const int kInput1Shape[] = {4, 1, 2, 4, 1};
const int8_t kInput1Data[] = {1, 1, 1, 1, 1, 1, 1, 1};
const int kOutput1Shape[] = {4, 1, 1, 3, 1};
......@@ -399,7 +411,11 @@ TF_LITE_MICRO_TEST(LocalMaxPoolTestInt2) {
int stride_width = 1;
int stride_height = 1;
#ifdef __Xvdsp
#pragma Bss(".vecmem_data")
#else
#pragma Bss(".Zdata")
#endif
const int kInput2Shape[] = {4, 1, 6, 10, 1};
const int8_t kInput2Data[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
......@@ -419,4 +435,4 @@ TF_LITE_MICRO_TEST(LocalMaxPoolTestInt2) {
kTfLitePaddingValid, kTfLiteActNone, output_data);
}
TF_LITE_MICRO_TESTS_END
TF_LITE_MICRO_TESTS_END
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册