未验证 提交 563f0cc5 编写于 作者: S Shuai Yuan 提交者: GitHub

Fix Multi-threading incorrectness of NEON armv7 gemv_trans_mx1 (#1556)

* add prepare data and model script, modified corresponding docs:development_android and development_android_GPU

* Add verified multi-threading supported gemv_trans_mx1

* add comments for alloc and free

* optimize reduction progress of gemv_trans_mx1

* add macro for gemv_trans_mx1

* Fix multi-threadings incorrectnesss of armv7 neon gemv_trans_mx1

* add test_gemm_accuracy
上级 04c139b9
......@@ -10,9 +10,10 @@
需要: NDK17及以上、cmake 3.0及以上
### 执行编译
在paddle-mobile根目录中,执行以下命令:
```
```shell
cd tools
sh build.sh android
......@@ -25,13 +26,12 @@ sh build.sh android mobilenet googlenet
```
执行完毕后,生成的so位于 build/release/ 目录中
jni 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/tree/develop/src/io/jni](https://github.com/PaddlePaddle/paddle-mobile/tree/develop/src/io/jni)
执行完毕后,生成的`so`位于`build/release/`目录中:
c++ 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/blob/develop/src/io/paddle_inference_api.h](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/src/io/paddle_inference_api.h)
- jni 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/tree/develop/src/io/jni](https://github.com/PaddlePaddle/paddle-mobile/tree/develop/src/io/jni)
- c++ 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/blob/develop/src/io/paddle_inference_api.h](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/src/io/paddle_inference_api.h)
单测可执行文件位于 test/build 目录中。
单测可执行文件位于`test/build`目录中。
如果有环境问题, 可以看接下来的环节
......@@ -39,26 +39,26 @@ c++ 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/blob/develop/
##### 下载Android NDK
如果你的电脑安装了Android Studio, 可以在 Android Studio 中直接下载安装 NDK
或者可以在 [https://developer.android.com/ndk/](https://developer.android.com/ndk/) 这里自行下载,也可以通过以下命令获取:
如果你的电脑安装了Android Studio, 可以在 Android Studio 中直接下载安装`NDK`或者可以在 [https://developer.android.com/ndk/](https://developer.android.com/ndk/) 这里自行下载,也可以通过以下命令获取:
- Mac平台
```
```shell
wget https://dl.google.com/android/repository/android-ndk-r17b-darwin-x86_64.zip
unzip android-ndk-r17b-darwin-x86_64.zip
```
- Linux平台
```
```shell
wget https://dl.google.com/android/repository/android-ndk-r17b-linux-x86_64.zip
unzip android-ndk-r17b-linux-x86_64.zip
```
##### 设置环境变量
工程中自带的独立工具链会根据环境变量NDK_ROOT查找NDK,因此需要配置环境变量:
工程中自带的独立工具链会根据环境变量`NDK_ROOT`查找NDK,因此需要配置环境变量:
```
```shell
export NDK_ROOT = "path to ndk"
```
......@@ -66,16 +66,17 @@ export NDK_ROOT = "path to ndk"
- Mac平台
mac 平台下可以使用 homebrew 安装
mac 平台下可以使用`homebrew`安装
```
```shell
brew install cmake
```
- Linux平台
linux 下可以使用 apt-get 进行安装
```
linux 下可以使用`apt-get`进行安装
```shell
apt-get install cmake
```
......@@ -84,24 +85,29 @@ apt-get install cmake
如果想要获得体积更小的库,可选择编译支持指定模型结构的库。
如执行如下命令:
```
```shell
sh build.sh android googlenet
```
会得到一个支持googlnet的体积更小的库。
## 基于Docker容器编译
### 1. 安装 docker
安装 docker 的方式,参考官方文档 [https://docs.docker.com/install/](https://docs.docker.com/install/)
### 2. 使用 docker 搭建构建环境
首先进入 paddle-mobile 的目录下,执行 `docker build`
以 Linux/Mac 为例 (windows 建议在 'Docker Quickstart Terminal' 中执行)
```
```shell
$ docker build -t paddle-mobile:dev - < Dockerfile
```
使用 `docker images` 可以看到我们新建的 image
```
```shell
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
paddle-mobile dev 33b146787711 45 hours ago 372MB
......@@ -109,7 +115,7 @@ paddle-mobile dev 33b146787711 45 hours ago 372MB
### 3. 使用 docker 构建
进入 paddle-mobile 目录,执行 docker run
```
```shell
$ docker run -it --mount type=bind,source=$PWD,target=/paddle-mobile paddle-mobile:dev
root@5affd29d4fc5:/ # cd /paddle-mobile
# 生成构建 android 产出的 Makefile
......@@ -120,6 +126,7 @@ root@5affd29d4fc5:/ # rm CMakeCache.txt
root@5affd29d4fc5:/ # cmake -DCMAKE_TOOLCHAIN_FILE=tools/toolchains/arm-linux-gnueabi.cmake
```
### 4. 设置编译选项
可以通过 ccmake 设置编译选项
```
......@@ -148,40 +155,36 @@ root@5affd29d4fc5:/ # ccmake .
root@5affd29d4fc5:/ # make
```
### 6. 查看构建产出
构架产出可以在 host 机器上查看,在 paddle-mobile 的目录下,build 以及 test/build 下,可以使用 adb 指令或者 scp 传输到 device 上执行
构架产出可以在 host 机器上查看,在 paddle-mobile 的目录下,build 以及`test/build`下,可以使用`adb`指令或`scp`传输到`device`上执行
## 测试
在编译完成后,我们提供了自动化的测试脚本,帮助用户将运行单测文件所需要的模型及库文件push到Android设备
* 下载测试需要的 [mobilenet和test_image_1x3x224x224_float(预处理过的 NCHW 文件) 文件](http://mms-graph.bj.bcebos.com/paddle-mobile/opencl_test_src.zip)
执行下面的脚本,该脚本会下载测试需要的 [mobilenet和test_image_1x3x224x224_float(预处理过的 NCHW 文件) 文件](http://mms-graph.bj.bcebos.com/paddle-mobile/opencl_test_src.zip),在项目下的`test`目录创建模型和图片文件夹,并将`mobilenet`复制到`paddle-mobile/test/models`目录下,将`test_image_1x3x224x224_float`复制到`paddle-mobile/test/images`目录下
* 创建模型和图片文件夹
```shell
cd tools
sh ./prepare_images_and_models.sh
```
cd test
mkdir models
mkdir images
```
* 将mobilenet复制到paddle-mobile/test/models目录下 将test_image_1x3x224x224_float复制到paddle-mobile/test/images目录下
* 执行下面命令将可执行文件和预测需要的文件部署到手机
```
```shell
cd tools/android-debug-script
sh push2android.sh
```
* mobilenet cpu模型预测结果
假设mobilenet和test_image_1x3x224x224_float文件已经推送到手机上,执行下面命令进行mobilenet cpu的预测
假设mobilenet和`test_image_1x3x224x224_float`文件已经推送到手机上,执行下面命令进行mobilenet cpu的预测
```
```shell
adb shell
cd /data/local/tmp/bin/
export LD_LIBRARY_PATH=.
./test-mobilenet
```
## paddle-mobile GPU开发文档
编译环境配置方法请参考development_android.md文档
编译环境配置方法请参考`development_android.md`文档
1. 下载 paddle-mobile
```
```shell
git clone https://github.com/PaddlePaddle/paddle-mobile.git
adb pull /system/vendor/lib/libOpenCL.so paddle-mobile/third_party/opencl
修改paddle-mobile/CMakeLists.txt文件,执行如下操作:
option(GPU_CL "opencl gpu" OFF)->option(GPU_CL "opencl gpu" ON)
# 修改paddle-mobile/CMakeLists.txt文件,执行如下操作:
# option(GPU_CL "opencl gpu" OFF)->option(GPU_CL "opencl gpu" ON)
cd paddle-mobile/tools
sh build.sh android
```
2. 将单测可执行文件和模型部署到手机
下载测试需要的mobilenet和test_image_1x3x224x224_float文件,下载地址:http://mms-graph.bj.bcebos.com/paddle-mobile/opencl_test_src.zip
2. 将单测可执行文件和模型部署到手机
```
cd ../test
mkdir models
mkdir images
执行下面的脚本,该脚本会下载测试需要的 [mobilenet和test_image_1x3x224x224_float(预处理过的 NCHW 文件) 文件](http://mms-graph.bj.bcebos.com/paddle-mobile/opencl_test_src.zip),在项目下的`test`目录创建模型>和图片文件夹,并将`mobilenet`复制到`paddle-mobile/test/models`目录下,将`test_image_1x3x224x224_float`复制到`paddle-mobile/test/images`目录下
```shell
cd tools
sh ./prepare_images_and_models.sh
```
将mobilenet复制到paddle-mobile/test/models目录下
将test_image_1x3x224x224_float复制到paddle-mobile/test/images目录下
执行下面命令将可执行文件和预测需要的文件部署到手机
```
```shell
cd ../tools/android-debug-script
sh push2android.sh
```
3. 在adb shell中执行对应的可执行文件(目前只支持mobilenet,后续会支持更多的网络模型)
```
3.`adb shell`中执行对应的可执行文件(目前只支持mobilenet,后续会支持更多的网络模型)
```shell
adb shell
cd /data/local/tmp/bin/
export LD_LIBRARY_PATH=.
./test-mobilenetgpu
```
4. mobilenet cpu模型预测结果
假设mobilenet和test_image_1x3x224x224_float文件已经推送到手机上,执行下面命令进行mobilenet cpu的预测
执行下面命令进行mobilenet cpu的预测
```
```shell
adb shell
cd /data/local/tmp/bin/
export LD_LIBRARY_PATH=.
./test-mobilenet
```
5. 预测结果
手机型号:小米6(CPU 835,GPU Adreno 540)
......@@ -78,8 +73,3 @@ export LD_LIBRARY_PATH=.
1线程:90ms
2线程:50ms
4线程:29ms
......@@ -57,9 +57,11 @@ void *Alloc(size_t size) {
void *r = reinterpret_cast<void *>(reinterpret_cast<size_t>(p + offset) &
(~(MALLOC_ALIGN - 1)));
static_cast<void **>(r)[-1] = p;
return r;
return r; // if necessary, you need initialize memory by yourself (developer)
}
// if you use this ptr again after Free
// you need to assign `ptr` as nullptr yourself (developer)
void Free(void *ptr) {
if (ptr) {
free(static_cast<void **>(ptr)[-1]);
......
......@@ -420,94 +420,149 @@ void sgemv_notrans_mx1(const int M, const int N, const float alpha,
void sgemv_trans_mx1(const int M, const int N, const float alpha,
const float *A, const int lda, const float *B,
const float beta, float *C) {
float32x4_t _valpha = vdupq_n_f32(alpha);
if (beta == 0.f) {
float32x4_t vzero = vdupq_n_f32(0.f);
for (int m = 0; m < M - 3; m += 4) {
vst1q_f32(C + m, vzero);
}
for (int m = (M & 0xfffffffc); m < M; ++m) {
C[m] = 0.f;
}
} else {
float32x4_t vbeta = vdupq_n_f32(beta);
for (int m = 0; m < M - 3; m += 4) {
float32x4_t _vc = vld1q_f32(C + m);
_vc = vmulq_f32(_vc, vbeta);
vst1q_f32(C + m, _vc);
}
for (int m = (M & 0xfffffffc); m < M; ++m) {
C[m] *= beta;
}
}
// create buff_c to store temp computation result for each threading
#ifdef _OPENMP
int threads_num = omp_get_max_threads();
#else
int threads_num = 1;
#endif // _OPENMP
float *buf_c = static_cast<float *>(
paddle_mobile::memory::Alloc(sizeof(float) * threads_num * M));
memset(buf_c, 0, threads_num * M * sizeof(float));
#pragma omp parallel for
for (int n = 0; n < N - 3; n += 4) {
const float *in0 = A + n * lda;
const float *in1 = in0 + lda;
const float *in2 = in1 + lda;
const float *in3 = in2 + lda;
float32x4_t _b = vld1q_f32(B + n);
float32x4_t _sum0;
int m = 0;
#ifdef _OPENMP
const int tid = omp_get_thread_num();
#else
const int tid = 0;
#endif // _OPENMP
register float *thread_buf_c = buf_c + tid * M;
register const float *in0 = A + n * lda;
register const float *in1 = in0 + lda;
register const float *in2 = in1 + lda;
register const float *in3 = in2 + lda;
register float32x4_t _b = vld1q_f32(B + n);
register float32x4_t _sum0;
register int m = 0;
for (; m < M - 3; m += 4) {
float32x4_t _r0 = vld1q_f32(in0 + m);
float32x4_t _r1 = vld1q_f32(in1 + m);
float32x4_t _r2 = vld1q_f32(in2 + m);
float32x4_t _r3 = vld1q_f32(in3 + m);
float32x4_t _vc = vld1q_f32(C + m);
float32x4_t _vbuff_c = vld1q_f32(thread_buf_c + m);
_sum0 = vmulq_lane_f32(_r0, vget_low_f32(_b), 0);
_sum0 = vmlaq_lane_f32(_sum0, _r1, vget_low_f32(_b), 1);
_sum0 = vmlaq_lane_f32(_sum0, _r2, vget_high_f32(_b), 0);
_sum0 = vmlaq_lane_f32(_sum0, _r3, vget_high_f32(_b), 1);
_sum0 = vmulq_f32(_sum0, _valpha);
_sum0 = vaddq_f32(_sum0, _vc);
vst1q_f32(C + m, _sum0);
_sum0 = vaddq_f32(_sum0, _vbuff_c);
vst1q_f32(thread_buf_c + m, _sum0);
}
if (m < M) {
float32x4_t _sum0 = vdupq_n_f32(0.0f);
float32x4_t _r0 = vld1q_f32(in0 + m);
float32x4_t _r1 = vld1q_f32(in1 + m);
float32x4_t _r2 = vld1q_f32(in2 + m);
float32x4_t _r3 = vld1q_f32(in3 + m);
float32x4_t _vc = vld1q_f32(C + m);
float32x4_t _vbuff_c = vld1q_f32(thread_buf_c + m);
_sum0 = vmulq_lane_f32(_r0, vget_low_f32(_b), 0);
_sum0 = vmlaq_lane_f32(_sum0, _r1, vget_low_f32(_b), 1);
_sum0 = vmlaq_lane_f32(_sum0, _r2, vget_high_f32(_b), 0);
_sum0 = vmlaq_lane_f32(_sum0, _r3, vget_high_f32(_b), 1);
_sum0 = vmulq_f32(_sum0, _valpha);
_sum0 = vaddq_f32(_sum0, _vc);
_sum0 = vaddq_f32(_sum0, _vbuff_c);
switch (M - m) {
case 3:
vst1q_lane_f32(C + m + 2, _sum0, 2);
vst1q_lane_f32(thread_buf_c + m + 2, _sum0, 2);
case 2:
vst1_f32(C + m, vget_low_f32(_sum0));
vst1_f32(thread_buf_c + m, vget_low_f32(_sum0));
break;
case 1:
vst1q_lane_f32(C + m, _sum0, 0);
vst1q_lane_f32(thread_buf_c + m, _sum0, 0);
break;
}
}
}
// remain n
#pragma omp parallel for
for (int n = (N & 0xfffffffc); n < N; ++n) {
const float *in0 = A + n * lda;
float32x4_t _b = vld1q_dup_f32(B + n);
float32x4_t _sum0;
int m = 0;
#ifdef _OPENMP
const int tid = omp_get_thread_num();
#else
const int tid = 0;
#endif // _OPENMP
register float *thread_buf_c = buf_c + tid * M;
register const float *in0 = A + n * lda;
register float32x4_t _b = vld1q_dup_f32(B + n);
register float32x4_t _sum0;
register int m = 0;
for (; m < M - 3; m += 4) {
float32x4_t _r0 = vld1q_f32(in0 + m);
_sum0 = vld1q_f32(C + m);
_r0 = vmulq_f32(_r0, _b);
_r0 = vmulq_f32(_valpha, _r0);
_sum0 = vaddq_f32(_sum0, _r0);
vst1q_f32(C + m, _sum0);
float32x4_t _vbuff_c = vld1q_f32(thread_buf_c + m);
_sum0 = vmulq_f32(_r0, _b);
_sum0 = vaddq_f32(_sum0, _vbuff_c);
vst1q_f32(thread_buf_c + m, _sum0);
}
for (; m < M; ++m) {
C[m] += alpha * (in0[m] * B[n]);
thread_buf_c[m] += in0[m] * B[n];
}
}
// reduction operate for buf_c, sum to C and do left operations
// y := alpha * A' * X + beta * y
// reduction operate: sum multi-threadings result for over-all: A' * X
register float32x4_t _valpha = vdupq_n_f32(alpha);
if (beta == 0.f) {
#pragma omp parallel for
for (int m = 0; m < M; m += 4) {
register float32x4_t _sum0 = vld1q_f32(buf_c + m);
for (int tid = 1; tid < threads_num; ++tid) {
_sum0 += vld1q_f32(buf_c + tid * M + m);
}
vst1q_f32(C + m, _sum0 * _valpha);
}
#pragma omp parallel for
for (int m = (M & 0xfffffffc); m < M; ++m) {
register float _sum0 = *(buf_c + m);
for (register int tid = 1; tid < threads_num; ++tid) {
_sum0 += *(buf_c + tid * M + m);
}
C[m] = _sum0 * alpha;
}
} else { // beta != 0.f
register float32x4_t _vbeta = vdupq_n_f32(beta);
#pragma omp parallel for
for (int m = 0; m < M; m += 4) {
register float32x4_t _sum0 = vld1q_f32(buf_c + m);
for (register int tid = 1; tid < threads_num; ++tid) {
_sum0 += vld1q_f32(buf_c + tid * M + m);
}
float32x4_t _vc = vld1q_f32(C + m);
vst1q_f32(C + m, _sum0 * _valpha + _vbeta * _vc);
}
#pragma omp parallel for
for (int m = (m & 0xfffffffc); m < M; ++m) {
register float _sum0 = *(buf_c + m);
for (register int tid = 1; tid < threads_num; ++tid) {
_sum0 += *(buf_c + tid * M + m);
}
C[m] = _sum0 * alpha + beta * C[m];
}
#pragma omp parallel for
for (int m = (m & 0xfffffffc); m < M; ++m) {
register float _sum0 = *(buf_c + m);
for (register int tid = 1; tid < threads_num; ++tid) {
_sum0 += *(buf_c + tid * M + m);
}
C[m] = _sum0 * alpha + beta * C[m];
}
}
// free buff_c
paddle_mobile::memory::Free(buf_c);
}
void sgemv_mx1(const bool trans, const int M, const int N, const float alpha,
......
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
......
......@@ -22,6 +22,7 @@ fi
IMAGES_DIR="/data/local/tmp/images"
adb shell mkdir ${IMAGES_DIR}
LIB_PATH="../../build/release/arm-v7a/build/*"
#LIB_PATH="../../build/release/arm-v8a/build/*"
adb push ${EXE_FILE} ${EXE_DIR}
for file in ${LIB_PATH}
do
......
#!/usr/bin/env bash
# decalre download paths of images and models
PADDLE_MOBILE_ROOT="$(pwd)/../"
IMAGES_AND_MODELS="opencl_test_src"
IMAGES_AND_MODELS_PATH="http://mms-graph.bj.bcebos.com/paddle-mobile/${IMAGES_AND_MODELS}.zip"
# download and unzip zip-files of images and models
mkdir ${PADDLE_MOBILE_ROOT}/download/
cd ${PADDLE_MOBILE_ROOT}/download/
wget -c ${IMAGES_AND_MODELS_PATH}
unzip -o ./${IMAGES_AND_MODELS}.zip
# create models and images directories below test
mkdir ${PADDLE_MOBILE_ROOT}/test/models
mkdir ${PADDLE_MOBILE_ROOT}/test/images
# move to test directory
cp ./${IMAGES_AND_MODELS}/input_3x224x224_banana ${PADDLE_MOBILE_ROOT}/test/images/
cp -r ./${IMAGES_AND_MODELS}/mobilenet ${PADDLE_MOBILE_ROOT}/test/models/
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册