diff --git a/doc/development_android.md b/doc/development_android.md
index d0da36aa82cfc9d4826f03bd2bdf1dd8f551965a..ac923557ef19ccb880667c68eda997264dc2cfe0 100644
--- a/doc/development_android.md
+++ b/doc/development_android.md
@@ -10,9 +10,10 @@
 需要: NDK17及以上、cmake 3.0及以上
 
 ### 执行编译
+
 在paddle-mobile根目录中，执行以下命令：
 
-```
+```shell
 
 cd tools
 sh build.sh android
@@ -25,13 +26,12 @@ sh build.sh android  mobilenet googlenet
 
 ```
 
-执行完毕后，生成的so位于 build/release/ 目录中
-
-jni 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/tree/develop/src/io/jni](https://github.com/PaddlePaddle/paddle-mobile/tree/develop/src/io/jni)
+执行完毕后，生成的`so`位于`build/release/`目录中：  
 
-c++ 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/blob/develop/src/io/paddle_inference_api.h](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/src/io/paddle_inference_api.h) 
+- jni 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/tree/develop/src/io/jni](https://github.com/PaddlePaddle/paddle-mobile/tree/develop/src/io/jni)  
+- c++ 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/blob/develop/src/io/paddle_inference_api.h](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/src/io/paddle_inference_api.h)   
 
-单测可执行文件位于 test/build 目录中。
+单测可执行文件位于`test/build`目录中。
 
 如果有环境问题, 可以看接下来的环节
 
@@ -39,26 +39,26 @@ c++ 头文件位于 [https://github.com/PaddlePaddle/paddle-mobile/blob/develop/
 
 ##### 下载Android NDK
 
-如果你的电脑安装了Android Studio, 可以在 Android Studio 中直接下载安装 NDK
-
-或者可以在 [https://developer.android.com/ndk/](https://developer.android.com/ndk/) 这里自行下载，也可以通过以下命令获取：
+如果你的电脑安装了Android Studio, 可以在 Android Studio 中直接下载安装`NDK`或者可以在 [https://developer.android.com/ndk/](https://developer.android.com/ndk/) 这里自行下载，也可以通过以下命令获取：
 
 - Mac平台
-```
+
+```shell
 wget https://dl.google.com/android/repository/android-ndk-r17b-darwin-x86_64.zip
 unzip android-ndk-r17b-darwin-x86_64.zip
-
 ```
+
 - Linux平台
-```
+
+```shell
 wget https://dl.google.com/android/repository/android-ndk-r17b-linux-x86_64.zip
 unzip android-ndk-r17b-linux-x86_64.zip
 ```
 
 ##### 设置环境变量
-工程中自带的独立工具链会根据环境变量NDK_ROOT查找NDK，因此需要配置环境变量：
+工程中自带的独立工具链会根据环境变量`NDK_ROOT`查找NDK，因此需要配置环境变量：
 
-```
+```shell
 export NDK_ROOT = "path to ndk"
 ```
 
@@ -66,16 +66,17 @@ export NDK_ROOT = "path to ndk"
 
 - Mac平台
 
-mac 平台下可以使用 homebrew 安装
+mac 平台下可以使用`homebrew`安装
 
-```
+```shell
 brew install cmake
-
 ```
+
 - Linux平台
 
-linux 下可以使用 apt-get 进行安装
-```
+linux 下可以使用`apt-get`进行安装
+
+```shell
 apt-get install cmake
 
 ```
@@ -84,24 +85,29 @@ apt-get install cmake
 如果想要获得体积更小的库，可选择编译支持指定模型结构的库。
 如执行如下命令：
 
-```
+```shell
 sh build.sh android googlenet
 ```
+
 会得到一个支持googlnet的体积更小的库。
 
 ## 基于Docker容器编译
+
 ### 1. 安装 docker
+
 安装 docker 的方式，参考官方文档 [https://docs.docker.com/install/](https://docs.docker.com/install/)
+
 ### 2. 使用 docker 搭建构建环境
+
 首先进入 paddle-mobile 的目录下，执行 `docker build`
 以 Linux/Mac 为例 (windows 建议在 'Docker Quickstart Terminal' 中执行)
 
-```
+```shell
 $ docker build -t paddle-mobile:dev - < Dockerfile
 ```
 使用 `docker images` 可以看到我们新建的 image
 
-```
+```shell
 $ docker images
 REPOSITORY      TAG     IMAGE ID       CREATED         SIZE
 paddle-mobile   dev     33b146787711   45 hours ago    372MB
@@ -109,7 +115,7 @@ paddle-mobile   dev     33b146787711   45 hours ago    372MB
 ### 3. 使用 docker 构建
 进入 paddle-mobile 目录，执行 docker run
 
-```
+```shell
 $ docker run -it --mount type=bind,source=$PWD,target=/paddle-mobile paddle-mobile:dev
 root@5affd29d4fc5:/ # cd /paddle-mobile
 # 生成构建 android 产出的 Makefile
@@ -120,6 +126,7 @@ root@5affd29d4fc5:/ # rm CMakeCache.txt
 root@5affd29d4fc5:/ # cmake -DCMAKE_TOOLCHAIN_FILE=tools/toolchains/arm-linux-gnueabi.cmake
 ```
 ### 4. 设置编译选项
+
 可以通过 ccmake 设置编译选项
 
 ```
@@ -148,40 +155,36 @@ root@5affd29d4fc5:/ # ccmake .
 root@5affd29d4fc5:/ # make
 ```
 ### 6. 查看构建产出
-构架产出可以在 host 机器上查看，在 paddle-mobile 的目录下，build 以及 test/build 下，可以使用 adb 指令或者 scp 传输到 device 上执行
+
+构架产出可以在 host 机器上查看，在 paddle-mobile 的目录下，build 以及`test/build`下，可以使用`adb`指令或`scp`传输到`device`上执行
 
 ## 测试
+
 在编译完成后，我们提供了自动化的测试脚本，帮助用户将运行单测文件所需要的模型及库文件push到Android设备
 
-* 下载测试需要的 [mobilenet和test_image_1x3x224x224_float(预处理过的 NCHW 文件) 文件](http://mms-graph.bj.bcebos.com/paddle-mobile/opencl_test_src.zip)
+执行下面的脚本，该脚本会下载测试需要的 [mobilenet和test_image_1x3x224x224_float(预处理过的 NCHW 文件) 文件](http://mms-graph.bj.bcebos.com/paddle-mobile/opencl_test_src.zip)，在项目下的`test`目录创建模型和图片文件夹，并将`mobilenet`复制到`paddle-mobile/test/models`目录下，将`test_image_1x3x224x224_float`复制到`paddle-mobile/test/images`目录下
 
-* 创建模型和图片文件夹
 
+```shell
+cd tools
+sh ./prepare_images_and_models.sh
 ```
-cd test
-mkdir models
-mkdir images
-```
-
-* 将mobilenet复制到paddle-mobile/test/models目录下 将test_image_1x3x224x224_float复制到paddle-mobile/test/images目录下
 
 * 执行下面命令将可执行文件和预测需要的文件部署到手机
 
-```
+```shell
 cd tools/android-debug-script
 sh push2android.sh
 ```
 
 * mobilenet cpu模型预测结果
 
-假设mobilenet和test_image_1x3x224x224_float文件已经推送到手机上，执行下面命令进行mobilenet cpu的预测
+假设mobilenet和`test_image_1x3x224x224_float`文件已经推送到手机上，执行下面命令进行mobilenet cpu的预测
 
-```
+```shell
 adb shell
 cd /data/local/tmp/bin/
 export LD_LIBRARY_PATH=.
 ./test-mobilenet
 ```
 
-
-
diff --git a/doc/development_android_GPU.md b/doc/development_android_GPU.md
index 03750260cf343692e52fd667cb797e27e7b6983d..2c8336104e96991ec99d0c0676c954293a919bb2 100644
--- a/doc/development_android_GPU.md
+++ b/doc/development_android_GPU.md
@@ -1,62 +1,57 @@
 ## paddle-mobile GPU开发文档
 
-编译环境配置方法请参考development_android.md文档
+编译环境配置方法请参考`development_android.md`文档
 
 1. 下载 paddle-mobile
 
-```
+```shell
 git clone https://github.com/PaddlePaddle/paddle-mobile.git
 
 adb pull /system/vendor/lib/libOpenCL.so paddle-mobile/third_party/opencl
 
-修改paddle-mobile/CMakeLists.txt文件，执行如下操作:
-option(GPU_CL "opencl gpu" OFF)->option(GPU_CL "opencl gpu" ON)
+# 修改paddle-mobile/CMakeLists.txt文件，执行如下操作:
+# option(GPU_CL "opencl gpu" OFF)->option(GPU_CL "opencl gpu" ON)
 
 cd paddle-mobile/tools
-
 sh build.sh android
-
 ```
-2. 将单测可执行文件和模型部署到手机
 
-下载测试需要的mobilenet和test_image_1x3x224x224_float文件，下载地址：http://mms-graph.bj.bcebos.com/paddle-mobile/opencl_test_src.zip
+2. 将单测可执行文件和模型部署到手机
 
-```
-cd ../test
-mkdir models
-mkdir images
+执行下面的脚本，该脚本会下载测试需要的 [mobilenet和test_image_1x3x224x224_float(预处理过的 NCHW 文件) 文件](http://mms-graph.bj.bcebos.com/paddle-mobile/opencl_test_src.zip)，在项目下的`test`目录创建模型>和图片文件夹，并将`mobilenet`复制到`paddle-mobile/test/models`目录下，将`test_image_1x3x224x224_float`复制到`paddle-mobile/test/images`目录下
 
+```shell
+cd tools
+sh ./prepare_images_and_models.sh
 ```
-将mobilenet复制到paddle-mobile/test/models目录下
-将test_image_1x3x224x224_float复制到paddle-mobile/test/images目录下
 
 执行下面命令将可执行文件和预测需要的文件部署到手机
 
-```
+```shell
 cd ../tools/android-debug-script
 sh push2android.sh
-
 ```
-3. 在adb shell中执行对应的可执行文件（目前只支持mobilenet，后续会支持更多的网络模型）
 
-```
+3. 在`adb shell`中执行对应的可执行文件（目前只支持mobilenet，后续会支持更多的网络模型）
+
+```shell
 adb shell
 cd /data/local/tmp/bin/
 export LD_LIBRARY_PATH=.
 ./test-mobilenetgpu
-
 ```
+
 4. mobilenet cpu模型预测结果
 
-假设mobilenet和test_image_1x3x224x224_float文件已经推送到手机上，执行下面命令进行mobilenet cpu的预测
+执行下面命令进行mobilenet cpu的预测
 
-```
+```shell
 adb shell
 cd /data/local/tmp/bin/
 export LD_LIBRARY_PATH=.
 ./test-mobilenet
-
 ```
+
 5. 预测结果
 
   手机型号：小米6(CPU 835,GPU Adreno 540)
@@ -78,8 +73,3 @@ export LD_LIBRARY_PATH=.
   1线程：90ms
   2线程：50ms
   4线程：29ms
-
-
-
-
-
diff --git a/src/memory/t_malloc.cpp b/src/memory/t_malloc.cpp
index 2fb74d18809f174810866a990396bb0279d256f5..69756585027e2f994c28dc0b8c37d2ac598c2c5f 100644
--- a/src/memory/t_malloc.cpp
+++ b/src/memory/t_malloc.cpp
@@ -57,9 +57,11 @@ void *Alloc(size_t size) {
   void *r = reinterpret_cast<void *>(reinterpret_cast<size_t>(p + offset) &
                                      (~(MALLOC_ALIGN - 1)));
   static_cast<void **>(r)[-1] = p;
-  return r;
+  return r;  // if necessary, you need initialize memory by yourself (developer)
 }
 
+// if you use this ptr again after Free
+// you need to assign `ptr` as nullptr yourself (developer)
 void Free(void *ptr) {
   if (ptr) {
     free(static_cast<void **>(ptr)[-1]);
diff --git a/src/operators/math/gemm/gemm_kernel.h b/src/operators/math/gemm/gemm_kernel.h
index 2d2985a39c822b8bec7a090b04c9472cbd6b87f4..fcffd5ec86daf52e8e4a07dc6dead8766b1ba123 100644
--- a/src/operators/math/gemm/gemm_kernel.h
+++ b/src/operators/math/gemm/gemm_kernel.h
@@ -420,94 +420,149 @@ void sgemv_notrans_mx1(const int M, const int N, const float alpha,
 void sgemv_trans_mx1(const int M, const int N, const float alpha,
                      const float *A, const int lda, const float *B,
                      const float beta, float *C) {
-  float32x4_t _valpha = vdupq_n_f32(alpha);
-  if (beta == 0.f) {
-    float32x4_t vzero = vdupq_n_f32(0.f);
-    for (int m = 0; m < M - 3; m += 4) {
-      vst1q_f32(C + m, vzero);
-    }
-    for (int m = (M & 0xfffffffc); m < M; ++m) {
-      C[m] = 0.f;
-    }
-  } else {
-    float32x4_t vbeta = vdupq_n_f32(beta);
-    for (int m = 0; m < M - 3; m += 4) {
-      float32x4_t _vc = vld1q_f32(C + m);
-      _vc = vmulq_f32(_vc, vbeta);
-      vst1q_f32(C + m, _vc);
-    }
-    for (int m = (M & 0xfffffffc); m < M; ++m) {
-      C[m] *= beta;
-    }
-  }
+// create buff_c to store temp computation result for each threading
+#ifdef _OPENMP
+  int threads_num = omp_get_max_threads();
+#else
+  int threads_num = 1;
+#endif  // _OPENMP
+  float *buf_c = static_cast<float *>(
+      paddle_mobile::memory::Alloc(sizeof(float) * threads_num * M));
+  memset(buf_c, 0, threads_num * M * sizeof(float));
 
   #pragma omp parallel for
   for (int n = 0; n < N - 3; n += 4) {
-    const float *in0 = A + n * lda;
-    const float *in1 = in0 + lda;
-    const float *in2 = in1 + lda;
-    const float *in3 = in2 + lda;
-    float32x4_t _b = vld1q_f32(B + n);
-    float32x4_t _sum0;
-    int m = 0;
+#ifdef _OPENMP
+    const int tid = omp_get_thread_num();
+#else
+    const int tid = 0;
+#endif  // _OPENMP
+    register float *thread_buf_c = buf_c + tid * M;
+    register const float *in0 = A + n * lda;
+    register const float *in1 = in0 + lda;
+    register const float *in2 = in1 + lda;
+    register const float *in3 = in2 + lda;
+    register float32x4_t _b = vld1q_f32(B + n);
+    register float32x4_t _sum0;
+    register int m = 0;
     for (; m < M - 3; m += 4) {
       float32x4_t _r0 = vld1q_f32(in0 + m);
       float32x4_t _r1 = vld1q_f32(in1 + m);
       float32x4_t _r2 = vld1q_f32(in2 + m);
       float32x4_t _r3 = vld1q_f32(in3 + m);
-      float32x4_t _vc = vld1q_f32(C + m);
+      float32x4_t _vbuff_c = vld1q_f32(thread_buf_c + m);
 
       _sum0 = vmulq_lane_f32(_r0, vget_low_f32(_b), 0);
       _sum0 = vmlaq_lane_f32(_sum0, _r1, vget_low_f32(_b), 1);
       _sum0 = vmlaq_lane_f32(_sum0, _r2, vget_high_f32(_b), 0);
       _sum0 = vmlaq_lane_f32(_sum0, _r3, vget_high_f32(_b), 1);
-      _sum0 = vmulq_f32(_sum0, _valpha);
-      _sum0 = vaddq_f32(_sum0, _vc);
-      vst1q_f32(C + m, _sum0);
+      _sum0 = vaddq_f32(_sum0, _vbuff_c);
+
+      vst1q_f32(thread_buf_c + m, _sum0);
     }
     if (m < M) {
+      float32x4_t _sum0 = vdupq_n_f32(0.0f);
       float32x4_t _r0 = vld1q_f32(in0 + m);
       float32x4_t _r1 = vld1q_f32(in1 + m);
       float32x4_t _r2 = vld1q_f32(in2 + m);
       float32x4_t _r3 = vld1q_f32(in3 + m);
-      float32x4_t _vc = vld1q_f32(C + m);
+      float32x4_t _vbuff_c = vld1q_f32(thread_buf_c + m);
 
       _sum0 = vmulq_lane_f32(_r0, vget_low_f32(_b), 0);
       _sum0 = vmlaq_lane_f32(_sum0, _r1, vget_low_f32(_b), 1);
       _sum0 = vmlaq_lane_f32(_sum0, _r2, vget_high_f32(_b), 0);
       _sum0 = vmlaq_lane_f32(_sum0, _r3, vget_high_f32(_b), 1);
-      _sum0 = vmulq_f32(_sum0, _valpha);
-      _sum0 = vaddq_f32(_sum0, _vc);
+      _sum0 = vaddq_f32(_sum0, _vbuff_c);
       switch (M - m) {
         case 3:
-          vst1q_lane_f32(C + m + 2, _sum0, 2);
+          vst1q_lane_f32(thread_buf_c + m + 2, _sum0, 2);
         case 2:
-          vst1_f32(C + m, vget_low_f32(_sum0));
+          vst1_f32(thread_buf_c + m, vget_low_f32(_sum0));
           break;
         case 1:
-          vst1q_lane_f32(C + m, _sum0, 0);
+          vst1q_lane_f32(thread_buf_c + m, _sum0, 0);
           break;
       }
     }
   }
+
   // remain n
+  #pragma omp parallel for
   for (int n = (N & 0xfffffffc); n < N; ++n) {
-    const float *in0 = A + n * lda;
-    float32x4_t _b = vld1q_dup_f32(B + n);
-    float32x4_t _sum0;
-    int m = 0;
+#ifdef _OPENMP
+    const int tid = omp_get_thread_num();
+#else
+    const int tid = 0;
+#endif  // _OPENMP
+    register float *thread_buf_c = buf_c + tid * M;
+    register const float *in0 = A + n * lda;
+    register float32x4_t _b = vld1q_dup_f32(B + n);
+    register float32x4_t _sum0;
+    register int m = 0;
     for (; m < M - 3; m += 4) {
       float32x4_t _r0 = vld1q_f32(in0 + m);
-      _sum0 = vld1q_f32(C + m);
-      _r0 = vmulq_f32(_r0, _b);
-      _r0 = vmulq_f32(_valpha, _r0);
-      _sum0 = vaddq_f32(_sum0, _r0);
-      vst1q_f32(C + m, _sum0);
+      float32x4_t _vbuff_c = vld1q_f32(thread_buf_c + m);
+      _sum0 = vmulq_f32(_r0, _b);
+      _sum0 = vaddq_f32(_sum0, _vbuff_c);
+      vst1q_f32(thread_buf_c + m, _sum0);
     }
     for (; m < M; ++m) {
-      C[m] += alpha * (in0[m] * B[n]);
+      thread_buf_c[m] += in0[m] * B[n];
     }
   }
+
+  // reduction operate for buf_c, sum to C and do left operations
+  // y := alpha * A' * X + beta * y
+  // reduction operate: sum multi-threadings result for over-all: A' * X
+  register float32x4_t _valpha = vdupq_n_f32(alpha);
+  if (beta == 0.f) {
+    #pragma omp parallel for
+    for (int m = 0; m < M; m += 4) {
+      register float32x4_t _sum0 = vld1q_f32(buf_c + m);
+      for (int tid = 1; tid < threads_num; ++tid) {
+        _sum0 += vld1q_f32(buf_c + tid * M + m);
+      }
+      vst1q_f32(C + m, _sum0 * _valpha);
+    }
+    #pragma omp parallel for
+    for (int m = (M & 0xfffffffc); m < M; ++m) {
+      register float _sum0 = *(buf_c + m);
+      for (register int tid = 1; tid < threads_num; ++tid) {
+        _sum0 += *(buf_c + tid * M + m);
+      }
+      C[m] = _sum0 * alpha;
+    }
+  } else {  // beta != 0.f
+    register float32x4_t _vbeta = vdupq_n_f32(beta);
+    #pragma omp parallel for
+    for (int m = 0; m < M; m += 4) {
+      register float32x4_t _sum0 = vld1q_f32(buf_c + m);
+      for (register int tid = 1; tid < threads_num; ++tid) {
+        _sum0 += vld1q_f32(buf_c + tid * M + m);
+      }
+      float32x4_t _vc = vld1q_f32(C + m);
+      vst1q_f32(C + m, _sum0 * _valpha + _vbeta * _vc);
+    }
+    #pragma omp parallel for
+    for (int m = (m & 0xfffffffc); m < M; ++m) {
+      register float _sum0 = *(buf_c + m);
+      for (register int tid = 1; tid < threads_num; ++tid) {
+        _sum0 += *(buf_c + tid * M + m);
+      }
+      C[m] = _sum0 * alpha + beta * C[m];
+    }
+    #pragma omp parallel for
+    for (int m = (m & 0xfffffffc); m < M; ++m) {
+      register float _sum0 = *(buf_c + m);
+      for (register int tid = 1; tid < threads_num; ++tid) {
+        _sum0 += *(buf_c + tid * M + m);
+      }
+      C[m] = _sum0 * alpha + beta * C[m];
+    }
+  }
+
+  // free buff_c
+  paddle_mobile::memory::Free(buf_c);
 }
 
 void sgemv_mx1(const bool trans, const int M, const int N, const float alpha,
diff --git a/test/common/test_gemm_accuracy.cpp b/test/common/test_gemm_accuracy.cpp
index 174459d3f58e82b85b5b189e8da6c0c9cb980a13..fc1041bde0b2bc78d809435a5487052c3db95a5d 100644
--- a/test/common/test_gemm_accuracy.cpp
+++ b/test/common/test_gemm_accuracy.cpp
@@ -1,11 +1,8 @@
 /* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
-
     http://www.apache.org/licenses/LICENSE-2.0
-
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
diff --git a/tools/android-debug-script/push2android.sh b/tools/android-debug-script/push2android.sh
index 68cbc6cf858ed9fbf7f1fd2522cd897309e31f78..a367bb6a29ad0c48f915ad0e67385811df4d1012 100644
--- a/tools/android-debug-script/push2android.sh
+++ b/tools/android-debug-script/push2android.sh
@@ -22,6 +22,7 @@ fi
 IMAGES_DIR="/data/local/tmp/images"
 adb shell mkdir ${IMAGES_DIR}
 LIB_PATH="../../build/release/arm-v7a/build/*"
+#LIB_PATH="../../build/release/arm-v8a/build/*"
 adb push ${EXE_FILE} ${EXE_DIR}
 for file in ${LIB_PATH}
 do
diff --git a/tools/prepare_images_and_models.sh b/tools/prepare_images_and_models.sh
new file mode 100755
index 0000000000000000000000000000000000000000..6f224778d9014a661940eae0cb6bb375846dc204
--- /dev/null
+++ b/tools/prepare_images_and_models.sh
@@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+
+# decalre download paths of images and models
+PADDLE_MOBILE_ROOT="$(pwd)/../"
+IMAGES_AND_MODELS="opencl_test_src"
+IMAGES_AND_MODELS_PATH="http://mms-graph.bj.bcebos.com/paddle-mobile/${IMAGES_AND_MODELS}.zip"
+
+# download and unzip zip-files of images and models
+mkdir ${PADDLE_MOBILE_ROOT}/download/
+cd ${PADDLE_MOBILE_ROOT}/download/
+wget -c ${IMAGES_AND_MODELS_PATH}
+unzip -o ./${IMAGES_AND_MODELS}.zip
+
+# create models and images directories below test
+mkdir ${PADDLE_MOBILE_ROOT}/test/models
+mkdir ${PADDLE_MOBILE_ROOT}/test/images
+
+# move to test directory
+cp ./${IMAGES_AND_MODELS}/input_3x224x224_banana ${PADDLE_MOBILE_ROOT}/test/images/
+cp -r ./${IMAGES_AND_MODELS}/mobilenet ${PADDLE_MOBILE_ROOT}/test/models/