Updated pooling kernel to work with embARC MLI Library 2.0 for ARC (#230)

* Updated pooling kernel to work with embARC MLI Library 2.0 * Update pooling.cc copyright.

Updated pooling kernel to work with embARC MLI Library 2.0 for ARC (#230)
* Updated pooling kernel to work with embARC MLI Library 2.0 * Update pooling.cc copyright.
d47d48e3 · Artem Tsvetkov · GitHub · 46ce0bd3 · d47d48e3 · d47d48e3
3 changed file
--- a/tensorflow/lite/micro/kernels/arc_mli/README.md
+++ b/tensorflow/lite/micro/kernels/arc_mli/README.md
@@ -4,6 +4,7 @@

 *   [dzakhar](https://github.com/dzakhar)
 *   [JaccovG](https://github.com/JaccovG)
+*   [gerbauz](https://github.com/gerbauz)

 ## Introduction

@@ -14,23 +15,23 @@ quantization).

 ## Usage

-embARC MLI Library is used by default to speed up execution of some kernels for
-asymmetrically quantized layers. This means that usual project generation for
+embARC MLI Library is used to speed up execution of some kernels for 
+asymmetrically quantized layers and can be applied with the option `OPTIMIZED_KERNEL_DIR=arc_mli`.
+This means that usual project generation for
 ARC specific target implies usage of embARC MLI.

 For example:

 ```
-make -f tensorflow/lite/micro/tools/make/Makefile TARGET=arc_emsdp OPTIMIZED_KERNEL_DIR=arc_mli generate_person_detection_int8_make_project
+make -f tensorflow/lite/micro/tools/make/Makefile TARGET=arc_vpx OPTIMIZED_KERNEL_DIR=arc_mli generate_person_detection_int8_make_project
 ```

 In case MLI implementation can’t be used, kernels in this folder fallback to
 TFLM reference implementations. For applications which may not benefit from MLI
-library, projects can be generated without these implementations by adding
-`ARC_TAGS=no_arc_mli` in the command line, which can reduce overall code size:
+library, projects can be generated without these implementations **removing** `OPTIMIZED_KERNEL_DIR=arc_mli` in the command line, which can reduce overall code size:

 ```
-make -f tensorflow/lite/micro/tools/make/Makefile TARGET=arc_emsdp OPTIMIZED_KERNEL_DIR=arc_mli ARC_TAGS=no_arc_mli generate_person_detection_int8_make_project
+make -f tensorflow/lite/micro/tools/make/Makefile TARGET=arc_vpx generate_person_detection_int8_make_project
 ```

 For ARC EM SDP board, a pre-compiled MLI library is downloaded and used in the
@@ -39,16 +40,34 @@ and compiled during project generation phase. To build library from sources for
 ARC EM SDP platform, add `BUILD_ARC_MLI=true` option to make command:

 ```
-make -f tensorflow/lite/micro/tools/make/Makefile TARGET=arc_emsdp OPTIMIZED_KERNEL_DIR=arc_mli BUILD_ARC_MLI=true generate_person_detection_int8_make_project
+make -f tensorflow/lite/micro/tools/make/Makefile \
+TARGET=arc_emsdp \
+OPTIMIZED_KERNEL_DIR=arc_mli \
+BUILD_ARC_MLI=true \
+generate_person_detection_int8_make_project
 ```
+---
+### Optional (experimental features):

+TFLM can be built using [embARC MLI Library 2.0](https://github.com/foss-for-synopsys-dwc-arc-processors/embarc_mli/tree/Release_2.0_EA) as an experimental feature.
+To build TFLM using the embARC MLI Library 2.0, add the following tag to the command:
+```
+ARC_TAGS=mli20_experimental
+```
+In this case, generated projectes will be in <tcf_file_basename>_mli20_arc_default folder.
+
+Some of configurations may require a custom run-time library specified using the BUILD_LIB_DIR option. Please, check MLI Library 2.0 [documentation](https://github.com/foss-for-synopsys-dwc-arc-processors/embarc_mli/tree/Release_2.0_EA#build-configuration-options) for more details. The following option can be added:
+```
+BUILD_LIB_DIR=<path_to_buildlib>
+```
+---
 If an application exclusively uses accelerated MLI kernel implementations, one
 can strip out TFLM reference kernel implementations to reduce code size of
 application. Build application with `MLI_ONLY=true` option in generated project
 (after the project was built):

 ```
-cd tensorflow/lite/micro/tools/make/gen/arc_emsdp_arc/prj/person_detection_int8/make
+cd tensorflow/lite/micro/tools/make/gen/arc_vpx_arc_default/prj/person_detection_int8/make

 make app MLI_ONLY=true
 ```
@@ -59,35 +78,39 @@ used for some nodes and you need to revert to using TFLM reference kernels.
 ## Limitations

 Currently, the MLI Library provides optimized implementation only for int8
-(asymmetric) versions of the following kernels: 1. Convolution 2D – Per axis
-quantization only, `dilation_ratio==1` 2. Depthwise Convolution 2D – Per axis
-quantization only, `dilation_ratio==1` 3. Average Pooling 4. Max Pooling 5.
-Fully Connected
-
-Currently only
-[/tensorflow/lite/micro/examples/person_detection](/tensorflow/lite/micro/examples/person_detection)
-is quantized using this specification. Other examples can be executed on
-ARC-based targets, but will only use reference kernels.
+(asymmetric) versions of the following kernels: 
+1. Convolution 2D – Per axis
+quantization only, `dilation_ratio==1` 
+2. Depthwise Convolution 2D – Per axis
+quantization only, `dilation_ratio==1` 
+3. Average Pooling 
+4. Max Pooling 
+5. Fully Connected

 ## Scratch Buffers and Slicing

-The following information applies only for ARC EM SDP and other targets with XY
+The following information applies only for ARC EM SDP, VPX and other targets with XY or VCCM
 memory. embARC MLI uses specific optimizations which assumes node operands are
-in XY memory and/or DCCM (Data Closely Coupled Memory). As operands might be
-quite big and may not fit in available XY memory, special slicing logic is
+in XY, VCCM memory and/or DCCM (Data Closely Coupled Memory). As operands might be
+quite big and may not fit in available XY or VCCM memory, special slicing logic is
 applied which allows kernel calculations to be split into multiple parts. For
-this reason, internal static buffers are allocated in these X, Y and DCCM memory
+this reason, internal static buffers are allocated in these X, Y, VCCM and DCCM memory
 banks and used to execute sub-calculations.

 All this is performed automatically and invisible to the user. Half of the DCCM
-memory bank and the full XY banks are occupied for MLI specific needs. If the
-user needs space in XY memory for other tasks, these arrays can be reduced by
+memory bank and the full XY banks or 3/4 of VCCM bank are occupied for MLI specific needs.
+If the user needs space in XY or VCCM memory for other tasks, these arrays can be reduced by
 setting specific sizes. For this, add the following option to build command
 replacing **<size[a|b|c]>** with required values:

+**For EM:**
 ```
 EXT_CFLAGS=”-DSCRATCH_MEM_Z_SIZE=<size_a> -DSCRATCH_MEM_X_SIZE=<size_b> -DSCRATCH_MEM_Y_SIZE=<size_c>”
 ```
+**For VPX:**
+```
+EXT_CFLAGS=”-DSCRATCH_MEM_VEC_SIZE=<size_a>”
+```

 For example, to reduce sizes of arrays placed in DCCM and XCCM to 32k and 8k
 respectively, use next command:

--- a/tensorflow/lite/micro/kernels/arc_mli/pooling.cc
+++ b/tensorflow/lite/micro/kernels/arc_mli/pooling.cc
-/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.
+/* Copyright 2021 The TensorFlow Authors. All Rights Reserved.

 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
@@ -27,8 +27,6 @@ limitations under the License.
 #include "tensorflow/lite/micro/kernels/kernel_util.h"

 namespace tflite {
-namespace ops {
-namespace micro {
 namespace pooling {

 namespace {
@@ -47,8 +45,8 @@ struct OpData {
  bool is_mli_applicable;

  // Tensors in MLI format.
-  mli_tensor* mli_in;
-  mli_tensor* mli_out;
+  mutable ops::micro::MliTensorInterface mli_in;
+  mutable ops::micro::MliTensorInterface mli_out;
  mli_pool_cfg* cfg;
 };

@@ -109,15 +107,15 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
  }

  if (data->is_mli_applicable) {
-    data->mli_in = static_cast<mli_tensor*>(
-        context->AllocatePersistentBuffer(context, sizeof(mli_tensor)));
-    data->mli_out = static_cast<mli_tensor*>(
-        context->AllocatePersistentBuffer(context, sizeof(mli_tensor)));
+    data->mli_in = ops::micro::MliTensorInterface(static_cast<mli_tensor*>(
+        context->AllocatePersistentBuffer(context, sizeof(mli_tensor))));
+    data->mli_out = ops::micro::MliTensorInterface(static_cast<mli_tensor*>(
+        context->AllocatePersistentBuffer(context, sizeof(mli_tensor))));
    data->cfg = static_cast<mli_pool_cfg*>(
        context->AllocatePersistentBuffer(context, sizeof(mli_pool_cfg)));

-    ops::micro::ConvertToMliTensor(input, data->mli_in);
-    ops::micro::ConvertToMliTensor(output, data->mli_out);
+    ops::micro::ConvertToMliTensor(input, &data->mli_in);
+    ops::micro::ConvertToMliTensor(output, &data->mli_out);

    data->cfg->kernel_width = params->filter_width;
    data->cfg->kernel_height = params->filter_height;
@@ -176,8 +174,8 @@ TfLiteStatus EvalMli(TfLiteContext* context, const TfLitePoolParams* params,
                     const MliPoolingType pooling_type) {
  mli_pool_cfg cfg_local = *data.cfg;

-  ops::micro::MliTensorAttachBuffer<int8_t>(input, data.mli_in);
-  ops::micro::MliTensorAttachBuffer<int8_t>(output, data.mli_out);
+  ops::micro::MliTensorAttachBuffer<int8_t>(input, &data.mli_in);
+  ops::micro::MliTensorAttachBuffer<int8_t>(output, &data.mli_out);

  const int height_dimension = 1;
  int in_slice_height = 0;
@@ -186,18 +184,26 @@ TfLiteStatus EvalMli(TfLiteContext* context, const TfLitePoolParams* params,

  // Tensors for data in fast (local) memory and config to copy data from
  // external to local memory
-  mli_tensor in_local = *data.mli_in;
-  mli_tensor out_local = *data.mli_out;
+  mli_tensor in_local = *data.mli_in.MliTensor();
+  mli_tensor out_local = *data.mli_out.MliTensor();
+
+  ops::micro::MliTensorInterface in_local_interface(&in_local);
+  ops::micro::MliTensorInterface out_local_interface(&out_local);
+
  mli_mov_cfg_t copy_config;
  mli_mov_cfg_for_copy(&copy_config);
  TF_LITE_ENSURE_STATUS(get_arc_scratch_buffer_for_pooling_tensors(
-      context, &in_local, &out_local));
-  bool in_is_local = in_local.data == data.mli_in->data;
-  bool out_is_local = out_local.data == data.mli_out->data;
+      context, &in_local_interface, &out_local_interface));
+
+  bool in_is_local =
+      in_local_interface.Data<int8_t>() == data.mli_in.Data<int8_t>();
+  bool out_is_local =
+      out_local_interface.Data<int8_t>() == data.mli_out.Data<int8_t>();
+
  TF_LITE_ENSURE_STATUS(arc_scratch_buffer_calc_slice_size_io(
-      &in_local, &out_local, cfg_local.kernel_height, cfg_local.stride_height,
-      cfg_local.padding_top, cfg_local.padding_bottom, &in_slice_height,
-      &out_slice_height));
+      &in_local_interface, &out_local_interface, cfg_local.kernel_height,
+      cfg_local.stride_height, cfg_local.padding_top, cfg_local.padding_bottom,
+      &in_slice_height, &out_slice_height));

  /* mli_in tensor contains batches of HWC tensors. so it is a 4 dimensional
     tensor. because the mli kernel will process one HWC tensor at a time, the 4
@@ -206,10 +212,11 @@ TfLiteStatus EvalMli(TfLiteContext* context, const TfLitePoolParams* params,
     for that the sliceHeight has been calculated. The tensor slicer is
     configured that it will completely slice the nBatch dimension (0) and slice
     the height dimension (1) in chunks of 'sliceHeight' */
-  TensorSlicer in_slice(data.mli_in, height_dimension, in_slice_height,
-                        cfg_local.padding_top, cfg_local.padding_bottom,
-                        overlap);
-  TensorSlicer out_slice(data.mli_out, height_dimension, out_slice_height);
+  ops::micro::TensorSlicer in_slice(data.mli_in.MliTensor(), height_dimension,
+                                    in_slice_height, cfg_local.padding_top,
+                                    cfg_local.padding_bottom, overlap);
+  ops::micro::TensorSlicer out_slice(data.mli_out.MliTensor(), height_dimension,
+                                     out_slice_height);

  /* is_local indicates that the tensor is already in local memory,
     so in that case the original tensor can be used,
@@ -418,6 +425,4 @@ TfLiteRegistration Register_MAX_POOL_2D() {
          /*version=*/0};
 }

-}  // namespace micro
-}  // namespace ops
 }  // namespace tflite
--- a/tensorflow/lite/micro/kernels/arc_mli/pooling_slicing_test.cc
+++ b/tensorflow/lite/micro/kernels/arc_mli/pooling_slicing_test.cc
@@ -218,7 +218,11 @@ TF_LITE_MICRO_TEST(LocalAveragePoolTestInt1) {
  const float output_max = 127;
  int8_t output_data[3];

+#ifdef __Xvdsp
+#pragma Bss(".vecmem_data")
+#else
 #pragma Bss(".Zdata")
+#endif
  const int kInput1Shape[] = {4, 1, 2, 4, 1};
  const int8_t kInput1Data[] = {1, 1, 1, 1, 1, 1, 1, 1};
  const int kOutput1Shape[] = {4, 1, 1, 3, 1};
@@ -276,7 +280,11 @@ TF_LITE_MICRO_TEST(LocalAveragePoolTestInt2) {
  const float output_max = 127;
  int8_t output_data[45];

+#ifdef __Xvdsp
+#pragma Bss(".vecmem_data")
+#else
 #pragma Bss(".Zdata")
+#endif
  const int kInput2Shape[] = {4, 1, 6, 10, 1};
  const int8_t kInput2Data[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
@@ -339,7 +347,11 @@ TF_LITE_MICRO_TEST(LocalMaxPoolTestInt1) {
  int stride_width = 1;
  int stride_height = 1;

+#ifdef __Xvdsp
+#pragma Bss(".vecmem_data")
+#else
 #pragma Bss(".Zdata")
+#endif
  const int kInput1Shape[] = {4, 1, 2, 4, 1};
  const int8_t kInput1Data[] = {1, 1, 1, 1, 1, 1, 1, 1};
  const int kOutput1Shape[] = {4, 1, 1, 3, 1};
@@ -399,7 +411,11 @@ TF_LITE_MICRO_TEST(LocalMaxPoolTestInt2) {
  int stride_width = 1;
  int stride_height = 1;

+#ifdef __Xvdsp
+#pragma Bss(".vecmem_data")
+#else
 #pragma Bss(".Zdata")
+#endif
  const int kInput2Shape[] = {4, 1, 6, 10, 1};
  const int8_t kInput2Data[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
@@ -419,4 +435,4 @@ TF_LITE_MICRO_TEST(LocalMaxPoolTestInt2) {
      kTfLitePaddingValid, kTfLiteActNone, output_data);
 }

-TF_LITE_MICRO_TESTS_END
+TF_LITE_MICRO_TESTS_END
\ No newline at end of file