Fix Multi-threading incorrectness of NEON armv7 gemv_trans_mx1 (#1556)

* add prepare data and model script, modified corresponding docs:development_android and development_android_GPU * Add verified multi-threading supported gemv_trans_mx1 * add comments for alloc and free * optimize reduction progress of gemv_trans_mx1 * add macro for gemv_trans_mx1 * Fix multi-threadings incorrectnesss of armv7 neon gemv_trans_mx1 * add test_gemm_accuracy

Fix Multi-threading incorrectness of NEON armv7 gemv_trans_mx1 (#1556)
* add prepare data and model script, modified corresponding docs:development_android and development_android_GPU * Add verified multi-threading supported gemv_trans_mx1 * add comments for alloc and free * optimize reduction progress of gemv_trans_mx1 * add macro for gemv_trans_mx1 * Fix multi-threadings incorrectnesss of armv7 neon gemv_trans_mx1 * add test_gemm_accuracy
563f0cc5 · Shuai Yuan · GitHub · 04c139b9 · 563f0cc5 · 563f0cc5
7 changed file
--- a/doc/development_android.md
+++ b/doc/development_android.md
@@ -10,9 +10,10 @@
 需要: NDK17及以上、cmake 3.0及以上
 ### 执行编译
 在paddle-mobile根目录中，执行以下命令：
-```
+```shell
 cd tools
 sh build.sh android
@@ -25,13 +26,12 @@ sh build.sh android  mobilenet googlenet
 ```
-执行完毕后，生成的so位于 build/release/ 目录中
+执行完毕后，生成的`so`位于`build/release/`目录中：  
-jni 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/tree/develop/src/io/jni](https://github.com/PaddlePaddle/paddle-mobile/tree/develop/src/io/jni)
-c++ 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/blob/develop/src/io/paddle_inference_api.h](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/src/io/paddle_inference_api.h) 
+- jni 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/tree/develop/src/io/jni](https://github.com/PaddlePaddle/paddle-mobile/tree/develop/src/io/jni)  
+- c++ 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/blob/develop/src/io/paddle_inference_api.h](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/src/io/paddle_inference_api.h)   
-单测可执行文件位于 test/build 目录中。
+单测可执行文件位于`test/build`目录中。
 如果有环境问题, 可以看接下来的环节
@@ -39,26 +39,26 @@ c++ 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/blob/develop/
 ##### 下载Android NDK
-如果你的电脑安装了Android Studio, 可以在 Android Studio 中直接下载安装 NDK
+如果你的电脑安装了Android Studio, 可以在 Android Studio 中直接下载安装`NDK`或者可以在 [https://developer.android.com/ndk/](https://developer.android.com/ndk/) 这里自行下载，也可以通过以下命令获取：
-或者可以在 [https://developer.android.com/ndk/](https://developer.android.com/ndk/) 这里自行下载，也可以通过以下命令获取：
 - Mac平台
-```
+```shell
 wget https://dl.google.com/android/repository/android-ndk-r17b-darwin-x86_64.zip
 unzip android-ndk-r17b-darwin-x86_64.zip
 ```
 - Linux平台
-```
+```shell
 wget https://dl.google.com/android/repository/android-ndk-r17b-linux-x86_64.zip
 unzip android-ndk-r17b-linux-x86_64.zip
 ```
 ##### 设置环境变量
-工程中自带的独立工具链会根据环境变量NDK_ROOT查找NDK，因此需要配置环境变量：
+工程中自带的独立工具链会根据环境变量`NDK_ROOT`查找NDK，因此需要配置环境变量：
-```
+```shell
 export NDK_ROOT = "path to ndk"
 ```
@@ -66,16 +66,17 @@ export NDK_ROOT = "path to ndk"
 - Mac平台
-mac 平台下可以使用 homebrew 安装
+mac 平台下可以使用`homebrew`安装
-```
+```shell
 brew install cmake
 ```
 - Linux平台
-linux 下可以使用 apt-get 进行安装
+linux 下可以使用`apt-get`进行安装
-```
+```shell
 apt-get install cmake
 ```
@@ -84,24 +85,29 @@ apt-get install cmake
 如果想要获得体积更小的库，可选择编译支持指定模型结构的库。
 如执行如下命令：
-```
+```shell
 sh build.sh android googlenet
 ```
 会得到一个支持googlnet的体积更小的库。
 ## 基于Docker容器编译
 ### 1. 安装 docker
 安装 docker 的方式，参考官方文档 [https://docs.docker.com/install/](https://docs.docker.com/install/)
 ### 2. 使用 docker 搭建构建环境
 首先进入 paddle-mobile 的目录下，执行 `docker build`
 以 Linux/Mac 为例 (windows 建议在 'Docker Quickstart Terminal' 中执行)
-```
+```shell
 $ docker build -t paddle-mobile:dev - < Dockerfile
 ```
 使用 `docker images` 可以看到我们新建的 image
-```
+```shell
 $ docker images
 REPOSITORY      TAG     IMAGE ID       CREATED         SIZE
 paddle-mobile   dev     33b146787711   45 hours ago    372MB
@@ -109,7 +115,7 @@ paddle-mobile   dev     33b146787711   45 hours ago    372MB
 ### 3. 使用 docker 构建
 进入 paddle-mobile 目录，执行 docker run
-```
+```shell
 $ docker run -it --mount type=bind,source=$PWD,target=/paddle-mobile paddle-mobile:dev
 root@5affd29d4fc5:/ # cd /paddle-mobile
 # 生成构建 android 产出的 Makefile
@@ -120,6 +126,7 @@ root@5affd29d4fc5:/ # rm CMakeCache.txt
 root@5affd29d4fc5:/ # cmake -DCMAKE_TOOLCHAIN_FILE=tools/toolchains/arm-linux-gnueabi.cmake
 ```
 ### 4. 设置编译选项
 可以通过 ccmake 设置编译选项
 ```
@@ -148,40 +155,36 @@ root@5affd29d4fc5:/ # ccmake .
 root@5affd29d4fc5:/ # make
 ```
 ### 6. 查看构建产出
-构架产出可以在 host 机器上查看，在 paddle-mobile 的目录下，build 以及 test/build 下，可以使用 adb 指令或者 scp 传输到 device 上执行
+构架产出可以在 host 机器上查看，在 paddle-mobile 的目录下，build 以及`test/build`下，可以使用`adb`指令或`scp`传输到`device`上执行
 ## 测试
 在编译完成后，我们提供了自动化的测试脚本，帮助用户将运行单测文件所需要的模型及库文件push到Android设备
-* 下载测试需要的 [mobilenet和test_image_1x3x224x224_float(预处理过的 NCHW 文件) 文件](http://mms-graph.bj.bcebos.com/paddle-mobile/opencl_test_src.zip)
+执行下面的脚本，该脚本会下载测试需要的 [mobilenet和test_image_1x3x224x224_float(预处理过的 NCHW 文件) 文件](http://mms-graph.bj.bcebos.com/paddle-mobile/opencl_test_src.zip)，在项目下的`test`目录创建模型和图片文件夹，并将`mobilenet`复制到`paddle-mobile/test/models`目录下，将`test_image_1x3x224x224_float`复制到`paddle-mobile/test/images`目录下
-* 创建模型和图片文件夹
+```shell
+cd tools
+sh ./prepare_images_and_models.sh
 ```
-cd test
-mkdir models
-mkdir images
-```
-* 将mobilenet复制到paddle-mobile/test/models目录下 将test_image_1x3x224x224_float复制到paddle-mobile/test/images目录下
 * 执行下面命令将可执行文件和预测需要的文件部署到手机
-```
+```shell
 cd tools/android-debug-script
 sh push2android.sh
 ```
 * mobilenet cpu模型预测结果
-假设mobilenet和test_image_1x3x224x224_float文件已经推送到手机上，执行下面命令进行mobilenet cpu的预测
+假设mobilenet和`test_image_1x3x224x224_float`文件已经推送到手机上，执行下面命令进行mobilenet cpu的预测
-```
+```shell
 adb shell
 cd /data/local/tmp/bin/
 export LD_LIBRARY_PATH=.
 ./test-mobilenet
 ```
--- a/doc/development_android_GPU.md
+++ b/doc/development_android_GPU.md
 ## paddle-mobile GPU开发文档
-编译环境配置方法请参考development_android.md文档
+编译环境配置方法请参考`development_android.md`文档
 1. 下载 paddle-mobile
-```
+```shell
 git clone https://github.com/PaddlePaddle/paddle-mobile.git
 adb pull /system/vendor/lib/libOpenCL.so paddle-mobile/third_party/opencl
-修改paddle-mobile/CMakeLists.txt文件，执行如下操作:
+# 修改paddle-mobile/CMakeLists.txt文件，执行如下操作:
-option(GPU_CL "opencl gpu" OFF)->option(GPU_CL "opencl gpu" ON)
+# option(GPU_CL "opencl gpu" OFF)->option(GPU_CL "opencl gpu" ON)
 cd paddle-mobile/tools
 sh build.sh android
 ```
-2. 将单测可执行文件和模型部署到手机
-下载测试需要的mobilenet和test_image_1x3x224x224_float文件，下载地址：http://mms-graph.bj.bcebos.com/paddle-mobile/opencl_test_src.zip
+2. 将单测可执行文件和模型部署到手机
-```
+执行下面的脚本，该脚本会下载测试需要的 [mobilenet和test_image_1x3x224x224_float(预处理过的 NCHW 文件) 文件](http://mms-graph.bj.bcebos.com/paddle-mobile/opencl_test_src.zip)，在项目下的`test`目录创建模型>和图片文件夹，并将`mobilenet`复制到`paddle-mobile/test/models`目录下，将`test_image_1x3x224x224_float`复制到`paddle-mobile/test/images`目录下
-cd ../test
-mkdir models
-mkdir images
+```shell
+cd tools
+sh ./prepare_images_and_models.sh
 ```
-将mobilenet复制到paddle-mobile/test/models目录下
-将test_image_1x3x224x224_float复制到paddle-mobile/test/images目录下
 执行下面命令将可执行文件和预测需要的文件部署到手机
-```
+```shell
 cd ../tools/android-debug-script
 sh push2android.sh
 ```
-3. 在adb shell中执行对应的可执行文件（目前只支持mobilenet，后续会支持更多的网络模型）
-```
+3. 在`adb shell`中执行对应的可执行文件（目前只支持mobilenet，后续会支持更多的网络模型）
+```shell
 adb shell
 cd /data/local/tmp/bin/
 export LD_LIBRARY_PATH=.
 ./test-mobilenetgpu
 ```
 4. mobilenet cpu模型预测结果
-假设mobilenet和test_image_1x3x224x224_float文件已经推送到手机上，执行下面命令进行mobilenet cpu的预测
+执行下面命令进行mobilenet cpu的预测
-```
+```shell
 adb shell
 cd /data/local/tmp/bin/
 export LD_LIBRARY_PATH=.
 ./test-mobilenet
 ```
 5. 预测结果
  手机型号：小米6(CPU 835,GPU Adreno 540)
@@ -78,8 +73,3 @@ export LD_LIBRARY_PATH=.
  1线程：90ms
  2线程：50ms
  4线程：29ms
--- a/src/memory/t_malloc.cpp
+++ b/src/memory/t_malloc.cpp
@@ -57,9 +57,11 @@ void *Alloc(size_t size) {
  void *r = reinterpret_cast<void *>(reinterpret_cast<size_t>(p + offset) &
                                     (~(MALLOC_ALIGN - 1)));
  static_cast<void **>(r)[-1] = p;
-  return r;
+  return r;  // if necessary, you need initialize memory by yourself (developer)
 }
+// if you use this ptr again after Free
+// you need to assign `ptr` as nullptr yourself (developer)
 void Free(void *ptr) {
  if (ptr) {
    free(static_cast<void **>(ptr)[-1]);

--- a/src/operators/math/gemm/gemm_kernel.h
+++ b/src/operators/math/gemm/gemm_kernel.h
@@ -420,94 +420,149 @@ void sgemv_notrans_mx1(const int M, const int N, const float alpha,
 void sgemv_trans_mx1(const int M, const int N, const float alpha,
                     const float *A, const int lda, const float *B,
                     const float beta, float *C) {
-  float32x4_t _valpha = vdupq_n_f32(alpha);
+// create buff_c to store temp computation result for each threading
-  if (beta == 0.f) {
+#ifdef _OPENMP
-    float32x4_t vzero = vdupq_n_f32(0.f);
+  int threads_num = omp_get_max_threads();
-    for (int m = 0; m < M - 3; m += 4) {
+#else
-      vst1q_f32(C + m, vzero);
+  int threads_num = 1;
-    }
+#endif  // _OPENMP
-    for (int m = (M & 0xfffffffc); m < M; ++m) {
+  float *buf_c = static_cast<float *>(
-      C[m] = 0.f;
+      paddle_mobile::memory::Alloc(sizeof(float) * threads_num * M));
-    }
+  memset(buf_c, 0, threads_num * M * sizeof(float));
-  } else {
-    float32x4_t vbeta = vdupq_n_f32(beta);
-    for (int m = 0; m < M - 3; m += 4) {
-      float32x4_t _vc = vld1q_f32(C + m);
-      _vc = vmulq_f32(_vc, vbeta);
-      vst1q_f32(C + m, _vc);
-    }
-    for (int m = (M & 0xfffffffc); m < M; ++m) {
-      C[m] *= beta;
-    }
-  }
  #pragma omp parallel for
  for (int n = 0; n < N - 3; n += 4) {
-    const float *in0 = A + n * lda;
+#ifdef _OPENMP
-    const float *in1 = in0 + lda;
+    const int tid = omp_get_thread_num();
-    const float *in2 = in1 + lda;
+#else
-    const float *in3 = in2 + lda;
+    const int tid = 0;
-    float32x4_t _b = vld1q_f32(B + n);
+#endif  // _OPENMP
-    float32x4_t _sum0;
+    register float *thread_buf_c = buf_c + tid * M;
-    int m = 0;
+    register const float *in0 = A + n * lda;
+    register const float *in1 = in0 + lda;
+    register const float *in2 = in1 + lda;
+    register const float *in3 = in2 + lda;
+    register float32x4_t _b = vld1q_f32(B + n);
+    register float32x4_t _sum0;
+    register int m = 0;
    for (; m < M - 3; m += 4) {
      float32x4_t _r0 = vld1q_f32(in0 + m);
      float32x4_t _r1 = vld1q_f32(in1 + m);
      float32x4_t _r2 = vld1q_f32(in2 + m);
      float32x4_t _r3 = vld1q_f32(in3 + m);
-      float32x4_t _vc = vld1q_f32(C + m);
+      float32x4_t _vbuff_c = vld1q_f32(thread_buf_c + m);
      _sum0 = vmulq_lane_f32(_r0, vget_low_f32(_b), 0);
      _sum0 = vmlaq_lane_f32(_sum0, _r1, vget_low_f32(_b), 1);
      _sum0 = vmlaq_lane_f32(_sum0, _r2, vget_high_f32(_b), 0);
      _sum0 = vmlaq_lane_f32(_sum0, _r3, vget_high_f32(_b), 1);
-      _sum0 = vmulq_f32(_sum0, _valpha);
+      _sum0 = vaddq_f32(_sum0, _vbuff_c);
-      _sum0 = vaddq_f32(_sum0, _vc);
-      vst1q_f32(C + m, _sum0);
+      vst1q_f32(thread_buf_c + m, _sum0);
    }
    if (m < M) {
+      float32x4_t _sum0 = vdupq_n_f32(0.0f);
      float32x4_t _r0 = vld1q_f32(in0 + m);
      float32x4_t _r1 = vld1q_f32(in1 + m);
      float32x4_t _r2 = vld1q_f32(in2 + m);
      float32x4_t _r3 = vld1q_f32(in3 + m);
-      float32x4_t _vc = vld1q_f32(C + m);
+      float32x4_t _vbuff_c = vld1q_f32(thread_buf_c + m);
      _sum0 = vmulq_lane_f32(_r0, vget_low_f32(_b), 0);
      _sum0 = vmlaq_lane_f32(_sum0, _r1, vget_low_f32(_b), 1);
      _sum0 = vmlaq_lane_f32(_sum0, _r2, vget_high_f32(_b), 0);
      _sum0 = vmlaq_lane_f32(_sum0, _r3, vget_high_f32(_b), 1);
-      _sum0 = vmulq_f32(_sum0, _valpha);
+      _sum0 = vaddq_f32(_sum0, _vbuff_c);
-      _sum0 = vaddq_f32(_sum0, _vc);
      switch (M - m) {
        case 3:
-          vst1q_lane_f32(C + m + 2, _sum0, 2);
+          vst1q_lane_f32(thread_buf_c + m + 2, _sum0, 2);
        case 2:
-          vst1_f32(C + m, vget_low_f32(_sum0));
+          vst1_f32(thread_buf_c + m, vget_low_f32(_sum0));
          break;
        case 1:
-          vst1q_lane_f32(C + m, _sum0, 0);
+          vst1q_lane_f32(thread_buf_c + m, _sum0, 0);
          break;
      }
    }
  }
  // remain n
+  #pragma omp parallel for
  for (int n = (N & 0xfffffffc); n < N; ++n) {
-    const float *in0 = A + n * lda;
+#ifdef _OPENMP
-    float32x4_t _b = vld1q_dup_f32(B + n);
+    const int tid = omp_get_thread_num();
-    float32x4_t _sum0;
+#else
-    int m = 0;
+    const int tid = 0;
+#endif  // _OPENMP
+    register float *thread_buf_c = buf_c + tid * M;
+    register const float *in0 = A + n * lda;
+    register float32x4_t _b = vld1q_dup_f32(B + n);
+    register float32x4_t _sum0;
+    register int m = 0;
    for (; m < M - 3; m += 4) {
      float32x4_t _r0 = vld1q_f32(in0 + m);
-      _sum0 = vld1q_f32(C + m);
+      float32x4_t _vbuff_c = vld1q_f32(thread_buf_c + m);
-      _r0 = vmulq_f32(_r0, _b);
+      _sum0 = vmulq_f32(_r0, _b);
-      _r0 = vmulq_f32(_valpha, _r0);
+      _sum0 = vaddq_f32(_sum0, _vbuff_c);
-      _sum0 = vaddq_f32(_sum0, _r0);
+      vst1q_f32(thread_buf_c + m, _sum0);
-      vst1q_f32(C + m, _sum0);
    }
    for (; m < M; ++m) {
-      C[m] += alpha * (in0[m] * B[n]);
+      thread_buf_c[m] += in0[m] * B[n];
    }
  }
+  // reduction operate for buf_c, sum to C and do left operations
+  // y := alpha * A' * X + beta * y
+  // reduction operate: sum multi-threadings result for over-all: A' * X
+  register float32x4_t _valpha = vdupq_n_f32(alpha);
+  if (beta == 0.f) {
+    #pragma omp parallel for
+    for (int m = 0; m < M; m += 4) {
+      register float32x4_t _sum0 = vld1q_f32(buf_c + m);
+      for (int tid = 1; tid < threads_num; ++tid) {
+        _sum0 += vld1q_f32(buf_c + tid * M + m);
+      }
+      vst1q_f32(C + m, _sum0 * _valpha);
+    }
+    #pragma omp parallel for
+    for (int m = (M & 0xfffffffc); m < M; ++m) {
+      register float _sum0 = *(buf_c + m);
+      for (register int tid = 1; tid < threads_num; ++tid) {
+        _sum0 += *(buf_c + tid * M + m);
+      }
+      C[m] = _sum0 * alpha;
+    }
+  } else {  // beta != 0.f
+    register float32x4_t _vbeta = vdupq_n_f32(beta);
+    #pragma omp parallel for
+    for (int m = 0; m < M; m += 4) {
+      register float32x4_t _sum0 = vld1q_f32(buf_c + m);
+      for (register int tid = 1; tid < threads_num; ++tid) {
+        _sum0 += vld1q_f32(buf_c + tid * M + m);
+      }
+      float32x4_t _vc = vld1q_f32(C + m);
+      vst1q_f32(C + m, _sum0 * _valpha + _vbeta * _vc);
+    }
+    #pragma omp parallel for
+    for (int m = (m & 0xfffffffc); m < M; ++m) {
+      register float _sum0 = *(buf_c + m);
+      for (register int tid = 1; tid < threads_num; ++tid) {
+        _sum0 += *(buf_c + tid * M + m);
+      }
+      C[m] = _sum0 * alpha + beta * C[m];
+    }
+    #pragma omp parallel for
+    for (int m = (m & 0xfffffffc); m < M; ++m) {
+      register float _sum0 = *(buf_c + m);
+      for (register int tid = 1; tid < threads_num; ++tid) {
+        _sum0 += *(buf_c + tid * M + m);
+      }
+      C[m] = _sum0 * alpha + beta * C[m];
+    }
+  }
+  // free buff_c
+  paddle_mobile::memory::Free(buf_c);
 }
 void sgemv_mx1(const bool trans, const int M, const int N, const float alpha,

--- a/test/common/test_gemm_accuracy.cpp
+++ b/test/common/test_gemm_accuracy.cpp
 /* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

--- a/tools/android-debug-script/push2android.sh
+++ b/tools/android-debug-script/push2android.sh
@@ -22,6 +22,7 @@ fi
 IMAGES_DIR="/data/local/tmp/images"
 adb shell mkdir ${IMAGES_DIR}
 LIB_PATH="../../build/release/arm-v7a/build/*"
+#LIB_PATH="../../build/release/arm-v8a/build/*"
 adb push ${EXE_FILE} ${EXE_DIR}
 for file in ${LIB_PATH}
 do

--- a/tools/prepare_images_and_models.sh
+++ b/tools/prepare_images_and_models.sh
+#!/usr/bin/env bash
+# decalre download paths of images and models
+PADDLE_MOBILE_ROOT="$(pwd)/../"
+IMAGES_AND_MODELS="opencl_test_src"
+IMAGES_AND_MODELS_PATH="http://mms-graph.bj.bcebos.com/paddle-mobile/${IMAGES_AND_MODELS}.zip"
+# download and unzip zip-files of images and models
+mkdir ${PADDLE_MOBILE_ROOT}/download/
+cd ${PADDLE_MOBILE_ROOT}/download/
+wget -c ${IMAGES_AND_MODELS_PATH}
+unzip -o ./${IMAGES_AND_MODELS}.zip
+# create models and images directories below test
+mkdir ${PADDLE_MOBILE_ROOT}/test/models
+mkdir ${PADDLE_MOBILE_ROOT}/test/images
+# move to test directory
+cp ./${IMAGES_AND_MODELS}/input_3x224x224_banana ${PADDLE_MOBILE_ROOT}/test/images/
+cp -r ./${IMAGES_AND_MODELS}/mobilenet ${PADDLE_MOBILE_ROOT}/test/models/