initial to support TIM-VX NPU backend

692d6aeb · BUG1989 · 89e0b3d6 · 692d6aeb · 692d6aeb · 692d6aeb
28 changed file
--- a/.gitignore
+++ b/.gitignore
@@ -52,4 +52,7 @@ build*/
 .vscode

 # 3rd party files depend on ACL
-3rdparty/
\ No newline at end of file
+3rdparty/
+
+# 3rd party files depend on TIM-VX
+src/dev/tim-vx/src/
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -83,6 +83,7 @@ option(TENGINE_ENABLE_ACL "Build with Arm Compute Library(ACL) support" OFF)
 option(TENGINE_ENABLE_VULKAN "Build with Vulkan GPU compute support" OFF)
 option(TENGINE_ENABLE_TENSORRT "Build with nVIDIA TensorRT support" OFF)
 option(TENGINE_ENABLE_CUDABACKEND "Build with nVIDIA cuda support" OFF)
+option(TENGINE_ENABLE_TIM_VX "Build with VSI Tensor Interface Module for OpenVX support" OFF)

 # add_definitions(-DCONFIG_DISABLE_PARAM_ACCESS)
 # add_definitions(-DCONFIG_INTERN_ALLOCATOR)

--- a/README.md
+++ b/README.md
@@ -77,6 +77,7 @@ Tengine Lite 参考和借鉴了下列项目：
 - [ACL](https://github.com/ARM-software/ComputeLibrary)
 - [stb](https://github.com/nothings/stb)
 - [convertmodel](https://convertmodel.com)
+- [TIM-VX](https://github.com/VeriSilicon/TIM-VX)

 ## License


--- a/README_EN.md
+++ b/README_EN.md
@@ -45,7 +45,7 @@ The core code of Tengine Lite consists of 4 modules:

 ### Model Convert tool

- [Pre-compiled version](https://github.com/OAID/Tengine-Convert-Tools/releases/download/v0.1/tm_convert_tool): Pre-compiled model convert tool is provided on Linux system;
+- [Pre-compiled version](https://github.com/OAID/Tengine/releases/download/lite-v1.2/convert_tool.zip): Pre-compiled model convert tool is provided on Linux system;
 - [Online Convert tool](https://convertmodel.com/#outputFormat=tengine): Based on WebAssembly (the models are converted locally by browsers, no private data will be uploaded);
 - [Source Compilation](https://github.com/OAID/Tengine-Convert-Tools): Refer to **Tengine-Convert-Tools** project, convert tool could be built by users.

@@ -66,11 +66,13 @@ Tengine Lite got ideas and developed based on these projects：
 - [MegEngine](https://github.com/MegEngine/MegEngine)
 - [ONNX](https://github.com/onnx/onnx)
 - [ncnn](https://github.com/Tencent/ncnn)
+- [FeatherCNN](https://github.com/Tencent/FeatherCNN)
 - [MNN](https://github.com/alibaba/MNN)
 - [Paddle Lite](https://github.com/PaddlePaddle/Paddle-Lite)
 - [ACL](https://github.com/ARM-software/ComputeLibrary)
 - [stb](https://github.com/nothings/stb)
 - [convertmodel](https://convertmodel.com)
+- [TIM-VX](https://github.com/VeriSilicon/TIM-VX)

 ## License


--- a/doc/architecture.png
+++ b/doc/architecture.png
--- a/doc/npu_tim-vx_user_manual.md
+++ b/doc/npu_tim-vx_user_manual.md
+# Tengine Lite VeriSilicon TIM-VX User Manual
+
+## Brief
+
+TIM-VX is a software integration module provided by VeriSilicon to facilitate deployment of Neural-Networks on OpenVX enabled ML accelerators.
+
+Tengine Lite has supported to integrate with TIM-VX Library of Verisilicon to inference CNN by Khadas VIM3(Amlogic A311D).
+
+## Build
+
+For some special reasons, only supported on Khadas VIM3 to work the following steps, currently.
+
+### TIM-VX NPU Library
+
+#### Download Source code of TIM-VX 
+
+```bash
+$ git clone https://github.com/VeriSilicon/TIM-VX.git
+```
+
+#### Download prebuild-sdk of A311D
+
+```bash
+$ wget -c https://github.com/VeriSilicon/TIM-VX/releases/download/v1.1.28/aarch64_A311D_D312513_A294074_R311680_T312233_O312045.tgz
+$ tar zxvf aarch64_A311D_D312513_A294074_R311680_T312233_O312045.tgz
+$ mv aarch64_A311D_D312513_A294074_R311680_T312233_O312045 prebuild-sdk-a311d
+```
+
+### Tengine Lite
+
+#### Download Tengine Lite
+
+```bash
+$ git clone https://github.com/OAID/Tengine.git tengine-lite
+$ cd tengine-lite
+```
+
+#### Create depend files
+
+```bash
+$ cd <tengine-lite-root-dir>
+$ mkdir -p ./3rdparty/tim-vx/lib/aarch64
+$ mkdir -p ./3rdparty/tim-vx/include
+$ cp -rf ../TIM-VX/include/*    ./3rdparty/tim-vx/include/
+$ cp -rf ../TIM-VX/src    ./src/dev/tim-vx/
+$ cp -rf ../prebuild-sdk-a311d/include/*    ./3rdparty/tim-vx/include/
+$ cp -rf ../prebuild-sdk-a311d/lib/*.so    ./3rdparty/tim-vx/lib/aarch64/
+```
+
+#### Build Tengine Lite
+
+```bash
+$ mkdir build && cd build
+$ cmake -DTENGINE_ENABLE_TIM_VX=ON -DTENGINE_ENABLE_TIM_VX_INTEGRATION=ON ..
+$ make -j4
+$ make install
+```
+
+## Demo
+
+#### Depned librarys
+
+```
+3rdparty/tim-vx/lib/
+├── libOpenVX.so.1 
+├── libVSC.so
+├── libGAL.so
+├── libArchModelSw.so
+└── libNNArchPerf.so 
+
+build-tim-vx-arm64/install/lib/
+└── libtengine-lite.so
+```
+
+On the Khadas VIM3, it need to replace those libraries in the /lib/ path
+
+#### Set uint8 Inference mode
+
+TIM-VX Library needs the uint8 network model
+
+```bash
+/* set runtime options */
+struct options opt;
+opt.num_thread = num_thread;
+opt.cluster = TENGINE_CLUSTER_ALL;
+opt.precision = TENGINE_MODE_UINT8;
+opt.affinity = 0;
+```
+
+#### Result
+
+```
+[khadas@Khadas tengine-lite]# ./tm_classification_timvx -m squeezenet_uint8.tmfile -i cat.jpg -r 1 -s 0.017,0.017,0.017 -r 10
+Tengine plugin allocator TIMVX is registered.
+Image height not specified, use default 227
+Image width not specified, use default  227
+Mean value not specified, use default   104.0, 116.7, 122.7
+tengine-lite library version: 1.2-dev
+TIM-VX prerun.
+
+model file : squeezenet_uint8.tmfile
+image file : cat.jpg
+img_h, img_w, scale[3], mean[3] : 227 227 , 0.017 0.017 0.017, 104.0 116.7 122.7
+Repeat 10 times, thread 1, avg time 2.95 ms, max_time 3.42 ms, min_time 2.76 ms
+--------------------------------------
+34.786182, 278
+33.942883, 287
+33.732056, 280
+32.045452, 277
+30.780502, 282
+```
--- a/doc/roadmap.md
+++ b/doc/roadmap.md
@@ -7,7 +7,8 @@
 - [ ] fix the Float32 bugs of Vulkan 
 - [ ] support the mode type of PaddlePaddle
 - [x] support the mode type of OneFlow
- [ ] opensource the plugin implement of NPU (A311D)
+- [x] opensource the plugin implement of NPU (A311D)
 - [x] opensource the plugin implement of CUDA
 - [x] opensource the plugin implement of TensorRT
+- [ ] opensource the plugin implement of NNIE
 - [x] add more test case
--- a/examples/CMakeLists.txt
+++ b/examples/CMakeLists.txt
@@ -25,6 +25,7 @@ tengine_example(tm_classification_int8      tm_classification_int8.c)
 tengine_example(tm_classification_uint8     tm_classification_uint8.c)
 tengine_example(tm_classification_vulkan    tm_classification_vulkan.c)
 tengine_example(tm_classification_acl       tm_classification_acl.c)
+tengine_example(tm_classification_timvx     tm_classification_timvx.c)
 tengine_example(tm_classification_trt       tm_classification_trt.cpp)
 tengine_example(tm_classification_cuda      tm_classification_cuda.cpp)
 tengine_example(tm_mobilenet_ssd            tm_mobilenet_ssd.c)

--- a/examples/tm_classification_timvx.c
+++ b/examples/tm_classification_timvx.c
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2020, OPEN AI LAB
+ * Author: qtang@openailab.com
+ */
+
+#include <stdlib.h>
+#include <stdio.h>
+
+#include "common.h"
+#include "tengine_c_api.h"
+#include "tengine_operations.h"
+
+#define DEFAULT_IMG_H 227
+#define DEFAULT_IMG_W 227
+#define DEFAULT_SCALE1 1.f
+#define DEFAULT_SCALE2 1.f
+#define DEFAULT_SCALE3 1.f
+#define DEFAULT_MEAN1 104.007
+#define DEFAULT_MEAN2 116.669
+#define DEFAULT_MEAN3 122.679
+#define DEFAULT_LOOP_COUNT 1
+#define DEFAULT_THREAD_COUNT 1
+#define DEFAULT_CPU_AFFINITY 255
+
+void get_input_uint8_data(const char* image_file, uint8_t* input_data, int img_h, int img_w, float* mean, float* scale,
+                          float input_scale, int zero_point)
+{
+    image img = imread_process(image_file, img_w, img_h, mean, scale);
+
+    float* image_data = ( float* )img.data;
+
+    for (int i = 0; i < img_w * img_h * 3; i++)
+    {
+        int udata = (round)(image_data[i] / input_scale + (float)zero_point);
+        if (udata > 255)
+            udata = 255;
+        else if (udata < 0)
+            udata = 0;
+
+        input_data[i] = udata;
+    }
+
+    free_image(img);
+}
+
+int tengine_classify(const char* model_file, const char* image_file, int img_h, int img_w, float* mean, float* scale,
+					  int loop_count, int num_thread, int affinity)
+{
+    /* set runtime options */
+    struct options opt;
+    opt.num_thread = num_thread;
+    opt.cluster = TENGINE_CLUSTER_ALL;
+    opt.precision = TENGINE_MODE_UINT8;
+    opt.affinity = affinity;
+
+    /* inital tengine */
+    if (init_tengine() != 0)
+    {
+        fprintf(stderr, "Initial tengine failed.\n");
+        return -1;
+    }
+    fprintf(stderr, "tengine-lite library version: %s\n", get_tengine_version());
+
+    /* create arm acl backend */
+    context_t timvx_context = create_context("timvx", 1);
+    int rtt = add_context_device(timvx_context, "TIMVX");
+    if (0 > rtt)
+    {
+        fprintf(stderr, " add_context_device VSI DEVICE failed.\n");
+        return -1;
+    }
+
+    /* create graph, load tengine model xxx.tmfile */
+    graph_t graph = create_graph(timvx_context, "tengine", model_file);
+    if (NULL == graph)
+    {
+        fprintf(stderr, "Create graph failed.\n");
+        fprintf(stderr, "errno: %d \n", get_tengine_errno());
+        return -1;
+    }
+
+    /* set the input shape to initial the graph, and prerun graph to infer shape */
+    int img_size = img_h * img_w * 3;
+    int dims[] = {1, 3, img_h, img_w};    // nchw
+    uint8_t* input_data = ( uint8_t* )malloc(img_size);
+
+    tensor_t input_tensor = get_graph_input_tensor(graph, 0, 0);
+    if (input_tensor == NULL)
+    {
+        fprintf(stderr, "Get input tensor failed\n");
+        return -1;
+    }
+
+    if (set_tensor_shape(input_tensor, dims, 4) < 0)
+    {
+        fprintf(stderr, "Set input tensor shape failed\n");
+        return -1;
+    }
+
+    if (set_tensor_buffer(input_tensor, input_data, img_size) < 0)
+    {
+        fprintf(stderr, "Set input tensor buffer failed\n");
+        return -1;
+    }    
+
+    /* prerun graph, set work options(num_thread, cluster, precision) */
+    if (prerun_graph_multithread(graph, opt) < 0)
+    {
+        fprintf(stderr, "Prerun multithread graph failed.\n");
+        return -1;
+    }
+
+    /* prepare process input data, set the data mem to input tensor */
+    float input_scale = 0.f;
+    int input_zero_point = 0;
+    get_tensor_quant_param(input_tensor, &input_scale, &input_zero_point, 1);
+    get_input_uint8_data(image_file, input_data, img_h, img_w, mean, scale, input_scale, input_zero_point);
+
+    /* run graph */
+    double min_time = DBL_MAX;
+    double max_time = DBL_MIN;
+    double total_time = 0.;
+    for (int i = 0; i < loop_count; i++)
+    {
+        double start = get_current_time();
+        if (run_graph(graph, 1) < 0)
+        {
+            fprintf(stderr, "Run graph failed\n");
+            return -1;
+        }
+        double end = get_current_time();
+        double cur = end - start;
+        total_time += cur;
+        if (min_time > cur)
+            min_time = cur;
+        if (max_time < cur)
+            max_time = cur;
+    }
+    fprintf(stderr, "\nmodel file : %s\n", model_file);
+    fprintf(stderr, "image file : %s\n", image_file);
+    fprintf(stderr, "img_h, img_w, scale[3], mean[3] : %d %d , %.3f %.3f %.3f, %.1f %.1f %.1f\n", img_h, img_w,
+            scale[0], scale[1], scale[2], mean[0], mean[1], mean[2]);
+    fprintf(stderr, "Repeat %d times, thread %d, avg time %.2f ms, max_time %.2f ms, min_time %.2f ms\n", loop_count,
+            num_thread, total_time / loop_count, max_time, min_time);
+    fprintf(stderr, "--------------------------------------\n");
+
+    /* get the result of classification */
+    tensor_t output_tensor = get_graph_output_tensor(graph, 0, 0);
+    uint8_t* output_u8 = ( uint8_t* )get_tensor_buffer(output_tensor);
+    int output_size = get_tensor_buffer_size(output_tensor);
+
+    /* dequant */
+    float output_scale = 0.f;
+    int output_zero_point = 0;
+    get_tensor_quant_param(output_tensor, &output_scale, &output_zero_point, 1);
+    float* output_data = ( float* )malloc(output_size * sizeof(float));
+    for (int i = 0; i < output_size; i++)
+        output_data[i] = (( float )output_u8[i] - ( float )output_zero_point) * output_scale;
+
+    print_topk(output_data, output_size, 5);
+    fprintf(stderr, "--------------------------------------\n");
+
+    /* release tengine */
+    free(input_data);
+    free(output_data);
+    postrun_graph(graph);
+    destroy_graph(graph);
+    release_tengine();
+
+    return 0;
+}
+
+void show_usage()
+{
+    fprintf(
+        stderr,
+        "[Usage]:  [-h]\n    [-m model_file] [-i image_file]\n [-g img_h,img_w] [-s scale[0],scale[1],scale[2]] [-w "
+        "mean[0],mean[1],mean[2]] [-r loop_count] [-t thread_count] [-a cpu_affinity]\n");
+    fprintf(
+        stderr,
+        "\nmobilenet example: \n    ./classification -m /path/to/mobilenet.tmfile -i /path/to/img.jpg -g 224,224 -s "
+        "0.017,0.017,0.017 -w 104.007,116.669,122.679\n");
+}
+
+int main(int argc, char* argv[])
+{
+    int loop_count = DEFAULT_LOOP_COUNT;
+    int num_thread = DEFAULT_THREAD_COUNT;
+    int cpu_affinity = DEFAULT_CPU_AFFINITY;
+    char* model_file = NULL;
+    char* image_file = NULL;
+    float img_hw[2] = {0.f};
+    int img_h = 0;
+    int img_w = 0;
+    float mean[3] = {-1.f, -1.f, -1.f};
+    float scale[3] = {0.f, 0.f, 0.f};
+
+    int res;
+    while ((res = getopt(argc, argv, "m:i:l:g:s:w:r:t:a:h")) != -1)
+    {
+        switch (res)
+        {
+            case 'm':
+                model_file = optarg;
+                break;
+            case 'i':
+                image_file = optarg;
+                break;
+            case 'g':
+                split(img_hw, optarg, ",");
+                img_h = ( int )img_hw[0];
+                img_w = ( int )img_hw[1];
+                break;
+            case 's':
+                split(scale, optarg, ",");
+                break;
+            case 'w':
+                split(mean, optarg, ",");
+                break;
+            case 'r':
+                loop_count = atoi(optarg);
+                break;
+            case 't':
+                num_thread = atoi(optarg);
+                break;
+            case 'a':
+                cpu_affinity = atoi(optarg);
+                break;
+            case 'h':
+                show_usage();
+                return 0;
+            default:
+                break;
+        }
+    }
+
+    /* check files */
+    if (model_file == NULL)
+    {
+        fprintf(stderr, "Error: Tengine model file not specified!\n");
+        show_usage();
+        return -1;
+    }
+
+    if (image_file == NULL)
+    {
+        fprintf(stderr, "Error: Image file not specified!\n");
+        show_usage();
+        return -1;
+    }
+
+    if (!check_file_exist(model_file) || !check_file_exist(image_file))
+        return -1;
+
+    if (img_h == 0)
+    {
+        img_h = DEFAULT_IMG_H;
+        fprintf(stderr, "Image height not specified, use default %d\n", img_h);
+    }
+
+    if (img_w == 0)
+    {
+        img_w = DEFAULT_IMG_W;
+        fprintf(stderr, "Image width not specified, use default  %d\n", img_w);
+    }
+
+    if (scale[0] == 0.f || scale[1] == 0.f || scale[2] == 0.f)
+    {
+        scale[0] = DEFAULT_SCALE1;
+        scale[1] = DEFAULT_SCALE2;
+        scale[2] = DEFAULT_SCALE3;
+        fprintf(stderr, "Scale value not specified, use default  %.1f, %.1f, %.1f\n", scale[0], scale[1], scale[2]);
+    }
+
+    if (mean[0] == -1.0 || mean[1] == -1.0 || mean[2] == -1.0)
+    {
+        mean[0] = DEFAULT_MEAN1;
+        mean[1] = DEFAULT_MEAN2;
+        mean[2] = DEFAULT_MEAN3;
+        fprintf(stderr, "Mean value not specified, use default   %.1f, %.1f, %.1f\n", mean[0], mean[1], mean[2]);
+    }
+
+    if (tengine_classify(model_file, image_file, img_h, img_w, mean, scale, loop_count, num_thread, cpu_affinity) < 0)
+        return -1;
+
+    return 0;
+}
--- a/examples/tm_classification_uint8.c
+++ b/examples/tm_classification_uint8.c
@@ -172,8 +172,6 @@ int tengine_classify(const char* model_file, const char* image_file, int img_h,
    /* release tengine */
    free(input_data);
    free(output_data);
-    release_graph_tensor(input_tensor);
-    release_graph_tensor(output_tensor);
    postrun_graph(graph);
    destroy_graph(graph);
    release_tengine();

--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -200,6 +200,87 @@ if (TENGINE_ENABLE_TENSORRT)
    file(GLOB_RECURSE TENGINE_BACKEND_TENSORRT_OPS "${CMAKE_CURRENT_SOURCE_DIR}/dev/tensorrt/op/*.cpp")
 endif ()

+if (TENGINE_ENABLE_TIM_VX)
+    if (${TENGINE_TARGET_PROCESSOR} MATCHES "ARM")
+        set(TIM_VX_ARCH "aarch64")
+    elseif (${TENGINE_TARGET_PROCESSOR} MATCHES "X86")
+        set(TIM_VX_ARCH "x86_64")
+    else()
+        message(FATAL_ERROR "Tengine: Unsupported OS:${TENGINE_TARGET_PROCESSOR}")
+    endif()
+
+    if (TENGINE_ENABLE_TIM_VX_INTEGRATION)
+        set(VSI_TIM_NAME "tim_vx_internal")
+        set(VSI_TIM_VX_BASE "${CMAKE_CURRENT_SOURCE_DIR}/dev/tim-vx/src/tim/vx")
+
+        aux_source_directory(${VSI_TIM_VX_BASE} VSI_TIM_VX_SRC)
+        aux_source_directory(${VSI_TIM_VX_BASE}/ops VSI_TIM_OPS_SRC)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src VSI_TIM_INTERNAL_SRC)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src/kernel VSI_TIM_INTERNAL_KERNEL)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src/kernel/cl VSI_TIM_INTERNAL_KERNEL_CL)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src/kernel/cpu VSI_TIM_INTERNAL_KERNEL_CPU)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src/kernel/evis VSI_TIM_INTERNAL_KERNEL_EVIS)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src/kernel/vx VSI_TIM_INTERNAL_KERNEL_VX)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src/ops VSI_TIM_INTERNAL_OPS)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src/client VSI_TIM_INTERNAL_CLIENT)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src/libnnext VSI_TIM_INTERNAL_LIBNNEXT)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src/libnnext/ops/kernel VSI_TIM_INTERNAL_LIBNNEXT_OPS_KERNEL)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src/quantization VSI_TIM_INTERNAL_QUANTIZATION)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src/custom/ops VSI_TIM_INTERNAL_CUSTOM_OPS)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src/custom/ops/kernel VSI_TIM_INTERNAL_CUSTOM_OPS_KERNEL)
+        aux_source_directory(${VSI_TIM_VX_BASE}/internal/src/utils VSI_TIM_INTERNAL_UTILS)
+
+        list(APPEND VSI_TIM_VX_ALL_SRC
+            ${VSI_TIM_VX_SRC}
+            ${VSI_TIM_OPS_SRC}
+            ${VSI_TIM_INTERNAL_SRC}
+            ${VSI_TIM_INTERNAL_KERNEL}
+            ${VSI_TIM_INTERNAL_KERNEL_CL}
+            ${VSI_TIM_INTERNAL_KERNEL_CPU}
+            ${VSI_TIM_INTERNAL_KERNEL_EVIS}
+            ${VSI_TIM_INTERNAL_KERNEL_VX}
+            ${VSI_TIM_INTERNAL_OPS}
+            ${VSI_TIM_INTERNAL_CLIENT}
+            ${VSI_TIM_INTERNAL_LIBNNEXT}
+            ${VSI_TIM_INTERNAL_LIBNNEXT_OPS_KERNEL}
+            ${VSI_TIM_INTERNAL_QUANTIZATION}
+            ${VSI_TIM_INTERNAL_CUSTOM_OPS}
+            ${VSI_TIM_INTERNAL_CUSTOM_OPS_KERNEL}
+            ${VSI_TIM_INTERNAL_UTILS}
+            )
+
+        #message("VSI_TIM_VX_ALL_SRC=${VSI_TIM_VX_ALL_SRC}")
+
+        add_library(${VSI_TIM_NAME} STATIC ${VSI_TIM_VX_ALL_SRC})
+        target_link_directories(${VSI_TIM_NAME} PUBLIC ${CMAKE_SOURCE_DIR}/3rdparty/tim-vx/lib/${TIM_VX_ARCH})
+        target_link_libraries(${VSI_TIM_NAME} PRIVATE CLC GAL OpenVX OpenVXU VSC ArchModelSw NNArchPerf)
+        target_include_directories(${VSI_TIM_NAME} PRIVATE ${CMAKE_SOURCE_DIR}/3rdparty/tim-vx/include)
+        target_include_directories(${VSI_TIM_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/dev/tim-vx/include)
+        target_include_directories(${VSI_TIM_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/dev/tim-vx/include/tim/vx)
+        target_include_directories(${VSI_TIM_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/dev/tim-vx/src/tim/vx)
+        target_include_directories(${VSI_TIM_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/dev/tim-vx/src/tim/vx/internal/include)
+        set_target_properties(${VSI_TIM_NAME} PROPERTIES CXX_STANDARD_REQUIRED 14)
+        set_target_properties(${VSI_TIM_NAME} PROPERTIES CXX_STANDARD 14)
+        set(VSI_TIM_OVXLIB_API_ATTR "__attribute__\(\(visibility\(\"default\"\)\)\)")
+        target_compile_definitions(${VSI_TIM_NAME} PRIVATE "-DOVXLIB_API=${VSI_TIM_OVXLIB_API_ATTR}")
+        target_compile_options(${VSI_TIM_NAME} PRIVATE $<$<OR:$<COMPILE_LANGUAGE:C>,$<COMPILE_LANGUAGE:CXX>>:-fPIC>)
+        target_compile_options(${VSI_TIM_NAME} PRIVATE $<$<OR:$<COMPILE_LANGUAGE:C>,$<COMPILE_LANGUAGE:CXX>>:-O0>)
+        target_compile_options(${VSI_TIM_NAME} PRIVATE $<$<OR:$<COMPILE_LANGUAGE:C>,$<COMPILE_LANGUAGE:CXX>>:-g>)
+    endif()
+
+    list(APPEND TENGINE_INCLUDE_DIRS_PRIVATE  ${CMAKE_CURRENT_SOURCE_DIR}/dev/tim-vx)
+    list(APPEND TENGINE_INCLUDE_DIRS_PRIVATE  ${CMAKE_CURRENT_SOURCE_DIR}/dev/tim-vx/op)
+    list(APPEND TENGINE_INCLUDE_DIRS_PRIVATE  ${CMAKE_CURRENT_SOURCE_DIR}/dev/tim-vx/include)
+
+    list(APPEND TENGINE_INCLUDE_DIRS_PRIVATE  ${CMAKE_SOURCE_DIR}/3rdparty/tim-vx/include)
+
+    list(APPEND TENGINE_TIM_VX_LIB_DIRS ${CMAKE_SOURCE_DIR}/3rdparty/tim-vx/lib/${TIM_VX_ARCH})
+
+    file(GLOB TENGINE_BACKEND_TIM_VX_BASE "${CMAKE_CURRENT_SOURCE_DIR}/dev/tim-vx/*.cc")
+    file(GLOB TENGINE_BACKEND_TIM_VX_OPS  "${CMAKE_CURRENT_SOURCE_DIR}/dev/tim-vx/op/*.cc")
+endif ()
+
+
 # add nVIDIA cudabackend support
 if (TENGINE_ENABLE_CUDABACKEND)
    enable_language(CUDA)
@@ -299,7 +380,9 @@ if (${TENGINE_TARGET_PROCESSOR} MATCHES "ARM")
        ${TENGINE_BACKEND_TENSORRT_BASE}
        ${TENGINE_BACKEND_TENSORRT_OPS}
        ${TENGINE_BACKEND_CUDABACKEND_BASE}
-        ${TENGINE_BACKEND_CUDABACKEND_OPS})
+        ${TENGINE_BACKEND_CUDABACKEND_OPS}
+        ${TENGINE_BACKEND_TIM_VX_BASE}
+        ${TENGINE_BACKEND_TIM_VX_OPS})
 elseif (${TENGINE_TARGET_PROCESSOR} MATCHES "X86")
    add_library(${CMAKE_PROJECT_NAME} SHARED
        ${TENGINE_LIB_SRCS} ${TENGINE_FRONT_END_SRCS}
@@ -313,7 +396,9 @@ elseif (${TENGINE_TARGET_PROCESSOR} MATCHES "X86")
        ${TENGINE_BACKEND_TENSORRT_BASE}
        ${TENGINE_BACKEND_TENSORRT_OPS}
        ${TENGINE_BACKEND_CUDABACKEND_BASE}
-        ${TENGINE_BACKEND_CUDABACKEND_OPS})
+        ${TENGINE_BACKEND_CUDABACKEND_OPS}
+        ${TENGINE_BACKEND_TIM_VX_BASE}
+        ${TENGINE_BACKEND_TIM_VX_OPS})
 elseif (${TENGINE_TARGET_PROCESSOR} MATCHES "MIPS")
    add_definitions(-mips64r2)
    add_definitions(-mabi=64)
@@ -336,8 +421,10 @@ else()
 endif()


+if (NOT TENGINE_FORCE_SKIP_OPENMP)
+    TENGINE_USE_LIB_OPENMP(${CMAKE_PROJECT_NAME})
+endif()

-TENGINE_USE_LIB_OPENMP(${CMAKE_PROJECT_NAME})
 # show linking libraries
 if(TENGINE_VERBOSE)
    message (STATUS "TENGINE: 'TENGINE_LINKING_LIBRARIES_PRIVATE' is ${TENGINE_LINKING_LIBRARIES_PRIVATE}.")
@@ -386,6 +473,15 @@ if (TENGINE_ENABLE_TENSORRT)
    list(APPEND TENGINE_LINKING_LIBRARIES_PRIVATE cudart)
 endif()

+if (TENGINE_ENABLE_TIM_VX)
+    target_link_directories(${CMAKE_PROJECT_NAME} PUBLIC ${TENGINE_TIM_VX_LIB_DIRS})
+    if  (TENGINE_ENABLE_TIM_VX_INTEGRATION)
+        list(APPEND TENGINE_LINKING_LIBRARIES_PRIVATE ${VSI_TIM_NAME})
+    else()
+        list(APPEND TENGINE_LINKING_LIBRARIES_PRIVATE tim-vx)
+    endif()
+endif()
+
 if (TENGINE_ENABLE_CUDABACKEND)
    target_compile_options(${CMAKE_PROJECT_NAME} PRIVATE $<$<COMPILE_LANGUAGE:CUDA>: ${TENGINE_COMPILE_DEFINITION_CUDA_PRIVATE}>)
    target_compile_options(${CMAKE_PROJECT_NAME} PRIVATE $<$<COMPILE_LANGUAGE:CUDA>: ${TENGINE_COMPILE_OPTIONS_CUDA_PRIVATE}>)

--- a/src/dev/tim-vx/op/timvx_add.cc
+++ b/src/dev/tim-vx/op/timvx_add.cc
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: hhchen@openailab.com
+ */
+
+#include "timvx_executor.hpp"
+
+extern "C"
+{
+#include "tengine_op.h"
+#include "eltwise_param.h"
+}
+
+
+bool VXEngine::AddEltwisSumNode(struct ir_node* ir_node)
+{
+    TLOG_INFO("Tengine TIM-VX: Support OP(%d) OP_RELU.\n", ir_node->idx);
+    struct ir_graph* ir_graph = ir_node->graph;
+
+    std::vector<std::shared_ptr<tim::vx::Tensor> > add_in_tensor(ir_node->input_num);
+    for (int i = 0; i < ir_node->input_num; i++)
+    {
+        struct ir_tensor* input_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[i]);
+        add_in_tensor[i] = this->vx_tensor_map[input_tensor->idx];
+        fprintf(stderr,"\nadd_in_tensor.shape()\n");
+        for (int j = 0; j < 4; j++)
+        {
+            fprintf(stderr,"%d ",add_in_tensor[i]->GetShape()[j]);
+        }
+    }
+    struct ir_tensor* output_tensor = get_ir_graph_tensor(ir_graph, ir_node->output_tensors[0]);
+    fprintf(stderr,"\nadd_out_tensor.shape()\n");
+    for (int j = 0; j < 4; j++)
+    {
+        fprintf(stderr,"%d ",this->vx_tensor_map[output_tensor->idx]->GetShape()[j]);
+    }
+
+    eltwise_param* param = (eltwise_param*)ir_node->op.param_mem;
+
+    switch (param->type)
+    {
+        case ELT_SUM:
+        {
+            auto eltsum = graph->CreateOperation<tim::vx::ops::Add>();
+            (*eltsum)
+                .BindInputs(add_in_tensor)
+                .BindOutputs({ this->vx_tensor_map[output_tensor->idx] });
+            break;
+        }
+        default:
+            break;
+    }
+}
--- a/src/dev/tim-vx/op/timvx_clip.cc
+++ b/src/dev/tim-vx/op/timvx_clip.cc
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: hhchen@openailab.com
+ */
+
+#include "timvx_executor.hpp"
+
+extern "C"
+{
+#include "tengine_op.h"
+}
+
+
+bool VXEngine::AddClipNode(struct ir_node* ir_node)
+{
+    TLOG_INFO("Tengine TIM-VX: Support OP(%d) OP_RELU.\n", ir_node->idx);
+    struct ir_graph* ir_graph = ir_node->graph;
+
+    struct ir_tensor* input_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[0]);
+    struct ir_tensor* output_tensor = get_ir_graph_tensor(ir_graph, ir_node->output_tensors[0]);
+
+    auto relu = this->graph->CreateOperation<tim::vx::ops::Relu6>();
+    (*relu).BindInput( this->vx_tensor_map[input_tensor->idx] )
+        .BindOutput({ this->vx_tensor_map[output_tensor->idx] });
+
+    return true;
+}
--- a/src/dev/tim-vx/op/timvx_concat.cc
+++ b/src/dev/tim-vx/op/timvx_concat.cc
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: hhchen@openailab.com
+ */
+
+#include "timvx_executor.hpp"
+
+extern "C"
+{
+#include "tengine_op.h"
+#include "concat_param.h"
+}
+
+
+bool VXEngine::AddConcatNode(struct ir_node* ir_node)
+{
+    TLOG_INFO("Tengine TIM-VX: Support OP(%d) OP_CONCAT.\n", ir_node->idx);
+    struct ir_graph* ir_graph = ir_node->graph;
+
+    std::vector<std::shared_ptr<tim::vx::Tensor> > concat_in_tensor(ir_node->input_num);
+    for (int i = 0; i < ir_node->input_num; i++)
+    {
+        struct ir_tensor* input_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[i]);
+        concat_in_tensor[i] = this->vx_tensor_map[input_tensor->idx];
+    }
+
+    struct concat_param* param = (struct concat_param*)ir_node->op.param_mem;
+
+    struct ir_tensor* output_tensor = get_ir_graph_tensor(ir_graph, ir_node->output_tensors[0]);
+
+    auto concat = graph->CreateOperation<tim::vx::ops::Concat>(output_tensor->dim_num - param->axis - 1, ir_node->input_num);
+    (*concat)
+        .BindInputs(concat_in_tensor)
+        .BindOutputs({ this->vx_tensor_map[output_tensor->idx] });
+
+    return true;
+}
--- a/src/dev/tim-vx/op/timvx_convolution.cc
+++ b/src/dev/tim-vx/op/timvx_convolution.cc
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: hhchen@openailab.com
+ */
+
+#include "timvx_executor.hpp"
+
+extern "C"
+{
+#include "tengine_op.h"
+#include "convolution_param.h"
+}
+
+
+bool VXEngine::AddConvolutionNode(struct ir_node* ir_node)
+{
+    TLOG_INFO("Tengine TIM-VX: Support OP(%d) OP_CONV.\n", ir_node->idx);
+    struct ir_graph* ir_graph = ir_node->graph;
+
+    struct conv_param* param = (struct conv_param*)ir_node->op.param_mem;
+
+    struct ir_tensor* input_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[0]);
+    struct ir_tensor* weight_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[1]);
+    struct ir_tensor* output_tensor = get_ir_graph_tensor(ir_graph, ir_node->output_tensors[0]);
+
+    tim::vx::PadType padtype;
+
+    int h = input_tensor->dims[2];
+    int out_h = (h - 1) / param->stride_h + 1;
+    int total_len_h = (out_h - 1) * param->stride_h + param->kernel_h;
+    int pad_num_h = total_len_h - h;
+    int pad_h0 = 0;
+    if (param->pad_h0 == pad_num_h / 2 && param->pad_h1 == pad_num_h - pad_num_h / 2)
+    {
+        pad_h0 = -1;
+    }
+
+    int w = input_tensor->dims[3];
+    int out_w = (w - 1) / param->stride_w + 1;
+    int total_len_w = (out_w - 1) * param->stride_w + param->kernel_w;
+    int pad_num_w = total_len_w - w;
+    int pad_w0 = 0;
+    if (param->pad_w0 == pad_num_w / 2 && param->pad_w1 == pad_num_w - pad_num_w / 2)
+    {
+        pad_w0 = -1;
+    }
+
+    if (pad_h0 == -1 && pad_w0 == -1)
+    {
+        TLOG_INFO("Log:tim::vx::PadType::SAME\n");
+        padtype = tim::vx::PadType::SAME;
+    }
+    else if(param->pad_h0 == 0 && param->pad_w0 == 0)
+    {
+        TLOG_INFO("Log:tim::vx::PadType::VALID\n");
+        padtype = tim::vx::PadType::VALID;
+    }
+
+    int multiplier = 0;
+    if (param->group == weight_tensor->dims[0])
+        multiplier = 1;
+    auto conv = this->graph->CreateOperation<tim::vx::ops::Conv2d>(
+        weight_tensor->dims[0], padtype,
+        std::array<uint32_t, 2>({ (unsigned int)param->kernel_h, (unsigned int)param->kernel_w }),
+        std::array<uint32_t, 2>({ (unsigned int)param->stride_h, (unsigned int)param->stride_w }),
+        std::array<uint32_t, 2>({ (unsigned int)param->dilation_h, (unsigned int)param->dilation_w }),
+        multiplier);
+
+    if (param->activation >= 0)
+    {
+        tim::vx::Quantization tmp_quant(tim::vx::QuantType::ASYMMETRIC,
+                                        output_tensor->scale, output_tensor->zero_point);
+        tim::vx::ShapeType vx_shape;
+        std::vector<uint32_t> perm;
+        for (int i = output_tensor->dim_num - 1; i >= 0; i--)
+        {
+            vx_shape.push_back(output_tensor->dims[i]);
+            perm.push_back(output_tensor->dims[i]);
+        }
+        tim::vx::TensorSpec tmp_spec(tim::vx::DataType::UINT8, vx_shape,
+                                     tim::vx::TensorAttribute::TRANSIENT,
+                                             tmp_quant);
+        TLOG_INFO("Log:0append relu\n");
+        auto tmp_output = this->graph->CreateTensor(tmp_spec);
+
+        if (ir_node->input_num > 2)
+        {
+            TLOG_INFO("Log:Use Bias\n");
+            struct ir_tensor* bias_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[2]);
+            (*conv)
+                .BindInputs({ this->vx_tensor_map[input_tensor->idx], this->vx_tensor_map[weight_tensor->idx], this->vx_tensor_map[bias_tensor->idx] })
+                .BindOutputs({ tmp_output });
+        }
+        else
+        {
+            (*conv)
+                .BindInputs({ this->vx_tensor_map[input_tensor->idx], this->vx_tensor_map[weight_tensor->idx] })
+                .BindOutputs({ tmp_output });
+        }
+//        this->vx_tensor_map[output_tensor->idx] = tmp_output;
+        if (param->activation == 0)
+        {
+            TLOG_INFO("Log:1.1append relu\n");
+            auto relu = this->graph->CreateOperation<tim::vx::ops::Relu>();
+            (*relu).BindInput( tmp_output )
+                .BindOutput({ this->vx_tensor_map[output_tensor->idx] });
+        }
+        else if (param->activation == 6)
+        {
+            TLOG_INFO("Log:2append relu6\n");
+            auto relu = this->graph->CreateOperation<tim::vx::ops::Relu6>();
+            (*relu).BindInput({ tmp_output })
+                .BindOutput({ this->vx_tensor_map[output_tensor->idx] });
+        }
+
+    }
+    else
+    {
+        if (ir_node->input_num > 2)
+        {
+            TLOG_INFO("Log:Use Bias\n");
+            struct ir_tensor* bias_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[2]);
+            (*conv)
+                .BindInputs({ this->vx_tensor_map[input_tensor->idx], this->vx_tensor_map[weight_tensor->idx], this->vx_tensor_map[bias_tensor->idx] })
+                .BindOutputs({ this->vx_tensor_map[output_tensor->idx] });
+        }
+        else
+        {
+            (*conv)
+                .BindInputs({ this->vx_tensor_map[input_tensor->idx], this->vx_tensor_map[weight_tensor->idx] })
+                .BindOutputs({ this->vx_tensor_map[output_tensor->idx] });
+        }
+    }
+
+    return true;
+}
--- a/src/dev/tim-vx/op/timvx_dropout.cc
+++ b/src/dev/tim-vx/op/timvx_dropout.cc
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: hhchen@openailab.com
+ */
+
+#include "timvx_executor.hpp"
+
+extern "C"
+{
+#include "tengine_op.h"
+#include "concat_param.h"
+}
+
+
+bool VXEngine::AddDropoutNode(struct ir_node* ir_node)
+{
+    TLOG_INFO("Tengine TIM-VX: Support OP(%d) OP_DROPOUT.\n", ir_node->idx);
+    struct ir_graph* ir_graph = ir_node->graph;
+
+    struct ir_tensor* input_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[0]);
+    struct ir_tensor* output_tensor = get_ir_graph_tensor(ir_graph, ir_node->output_tensors[0]);
+
+    std::vector<uint32_t> perm;
+    for (int i = output_tensor->dim_num - 1; i >= 0; i--)
+    {
+        perm.push_back(output_tensor->dims[i]);
+    }
+
+    auto flatten = graph->CreateOperation<tim::vx::ops::Reshape>(perm);
+
+    (*flatten)
+        .BindInputs({ this->vx_tensor_map[input_tensor->idx] })
+        .BindOutputs({ this->vx_tensor_map[output_tensor->idx] });
+
+    return true;
+}
--- a/src/dev/tim-vx/op/timvx_fc.cc
+++ b/src/dev/tim-vx/op/timvx_fc.cc
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: hhchen@openailab.com
+ */
+
+#include "timvx_executor.hpp"
+
+extern "C"
+{
+#include "tengine_op.h"
+#include "convolution_param.h"
+}
+
+
+bool VXEngine::AddFullyConnectionNode(struct ir_node* ir_node)
+{
+    TLOG_INFO("Tengine TIM-VX: Support OP(%d) OP_FC.\n", ir_node->idx);
+    struct ir_graph* ir_graph = ir_node->graph;
+
+    struct ir_tensor* input_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[0]);
+    struct ir_tensor* weight_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[1]);
+    struct ir_tensor* output_tensor = get_ir_graph_tensor(ir_graph, ir_node->output_tensors[0]);
+
+    auto fc = graph->CreateOperation<tim::vx::ops::FullyConnected>(
+        2, weight_tensor->dims[0]);
+
+    if (ir_node->input_num > 2)
+    {
+        TLOG_INFO("Log:Use Bias\n");
+        struct ir_tensor* bias_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[2]);
+        (*fc)
+            .BindInputs({this->vx_tensor_map[input_tensor->idx], this->vx_tensor_map[weight_tensor->idx], this->vx_tensor_map[bias_tensor->idx]})
+            .BindOutputs({ this->vx_tensor_map[output_tensor->idx] });
+    }
+    else
+    {
+        (*fc)
+            .BindInputs({ this->vx_tensor_map[input_tensor->idx], this->vx_tensor_map[weight_tensor->idx] })
+            .BindOutputs({ this->vx_tensor_map[output_tensor->idx] });
+    }
+
+    return true;
+}
--- a/src/dev/tim-vx/op/timvx_flatten.cc
+++ b/src/dev/tim-vx/op/timvx_flatten.cc
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: hhchen@openailab.com
+ */
+
+#include "timvx_executor.hpp"
+
+extern "C"
+{
+#include "tengine_op.h"
+}
+
+
+bool VXEngine::AddFlattenNode(struct ir_node* ir_node)
+{
+    TLOG_INFO("Tengine TIM-VX: Support OP(%d) OP_FLATTEN.\n", ir_node->idx);
+    struct ir_graph* ir_graph = ir_node->graph;
+
+    struct ir_tensor* input_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[0]);
+    struct ir_tensor* output_tensor = get_ir_graph_tensor(ir_graph, ir_node->output_tensors[0]);
+
+    std::vector<uint32_t> perm;
+    for (int i = output_tensor->dim_num - 1; i >= 0; i--)
+    {
+        perm.push_back(output_tensor->dims[i]);
+    }
+
+    auto flatten = graph->CreateOperation<tim::vx::ops::Reshape>(perm);
+
+    (*flatten)
+        .BindInputs({ this->vx_tensor_map[input_tensor->idx] })
+        .BindOutputs({ this->vx_tensor_map[output_tensor->idx] });
+
+    return true;
+}
--- a/src/dev/tim-vx/op/timvx_pooling.cc
+++ b/src/dev/tim-vx/op/timvx_pooling.cc
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: hhchen@openailab.com
+ */
+
+#include "timvx_executor.hpp"
+
+extern "C"
+{
+#include "tengine_op.h"
+#include "pooling_param.h"
+}
+
+
+bool VXEngine::AddPoolingNode(struct ir_node* ir_node)
+{
+    TLOG_INFO("Tengine TIM-VX: Support OP(%d) OP_POOL.\n", ir_node->idx);
+    struct ir_graph* ir_graph = ir_node->graph;
+
+    struct pool_param* param = (struct pool_param*)ir_node->op.param_mem;
+
+    struct ir_tensor* input_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[0]);
+    struct ir_tensor* output_tensor = get_ir_graph_tensor(ir_graph, ir_node->output_tensors[0]);
+
+    tim::vx::PoolType pooltype;
+    if (param->pool_method == 0)
+    {
+        pooltype = tim::vx::PoolType::MAX;
+    }
+    else
+    {
+        pooltype = tim::vx::PoolType::AVG;
+    }
+
+    tim::vx::PadType padtype;
+
+    int h = input_tensor->dims[2];
+    int out_h = (h - 1) / param->stride_h + 1;
+    int total_len_h = (out_h - 1) * param->stride_h + param->kernel_h;
+    int pad_num_h = total_len_h - h;
+    int pad_h0 = 0;
+    if (param->pad_h0 == pad_num_h / 2 && param->pad_h1 == pad_num_h - pad_num_h / 2)
+    {
+        pad_h0 = -1;
+    }
+
+    int w = input_tensor->dims[3];
+    int out_w = (w - 1) / param->stride_w + 1;
+    int total_len_w = (out_w - 1) * param->stride_w + param->kernel_w;
+    int pad_num_w = total_len_w - w;
+    int pad_w0 = 0;
+    if (param->pad_w0 == pad_num_w / 2 && param->pad_w1 == pad_num_w - pad_num_w / 2)
+    {
+        pad_w0 = -1;
+    }
+
+    if (pad_h0 == -1 && pad_w0 == -1)
+    {
+        TLOG_INFO("Log:tim::vx::PadType::SAME\n");
+        padtype = tim::vx::PadType::SAME;
+    }
+    else if(param->pad_h0 == 0 && param->pad_w0 == 0)
+    {
+        TLOG_INFO("Log:tim::vx::PadType::VALID\n");
+        padtype = tim::vx::PadType::VALID;
+    }
+
+    auto pool = graph->CreateOperation<tim::vx::ops::Pool2d>(
+        pooltype, padtype,
+        std::array<uint32_t, 2>({ (unsigned int)param->kernel_h, (unsigned int)param->kernel_w}),
+           std::array<uint32_t, 2>({(unsigned int)param->stride_h, (unsigned int)param->stride_w}));
+
+    (*pool).BindInputs({ this->vx_tensor_map[input_tensor->idx] })
+           .BindOutputs({ this->vx_tensor_map[output_tensor->idx] });
+
+    return true;
+}
--- a/src/dev/tim-vx/op/timvx_relu.cc
+++ b/src/dev/tim-vx/op/timvx_relu.cc
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: hhchen@openailab.com
+ */
+
+#include "timvx_executor.hpp"
+
+extern "C"
+{
+#include "tengine_op.h"
+}
+
+
+bool VXEngine::AddReluNode(struct ir_node* ir_node)
+{
+    TLOG_INFO("Tengine TIM-VX: Support OP(%d) OP_RELU.\n", ir_node->idx);
+    struct ir_graph* ir_graph = ir_node->graph;
+
+    struct ir_tensor* input_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[0]);
+    struct ir_tensor* output_tensor = get_ir_graph_tensor(ir_graph, ir_node->output_tensors[0]);
+
+    auto relu = this->graph->CreateOperation<tim::vx::ops::Relu>();
+    (*relu).BindInput( this->vx_tensor_map[input_tensor->idx] )
+        .BindOutput({ this->vx_tensor_map[output_tensor->idx] });
+
+    return true;
+}
--- a/src/dev/tim-vx/timvx_device.cc
+++ b/src/dev/tim-vx/timvx_device.cc
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: lswang@openailab.com
+ */
+
+extern "C"
+{
+#include "vector.h"
+#include "nn_device.h"
+#include "tengine_ir.h"
+#include "tengine_log.h"
+#include "tengine_errno.h"
+#include "dev_allocator.h"
+#include "tengine_c_api.h"
+}
+
+#include "timvx_device.hpp"
+#include "timvx_limit.hpp"
+#include "timvx_graph.hpp"
+
+
+extern "C"
+{
+int timvx_describe(struct dev_allocator* allocator, struct vector* allowed_ops, struct vector* blocked_ops, struct vector* precision);
+int timvx_evaluation(struct dev_allocator* allocator, struct subgraph* sub_graph, struct vector* evolution_tensors, struct vector* evolution_nodes);
+int timvx_allocate(struct dev_allocator* allocator, struct subgraph* sub_graph);
+int timvx_release(struct dev_allocator* allocator, struct subgraph* sub_graph);
+}
+
+
+int timvx_describe(struct dev_allocator* allocator, struct vector* allowed_ops, struct vector* blocked_ops, struct vector* precision)
+{
+    (void)allocator;
+
+    for (int op_type : timvx_supported_ops)
+    {
+        push_vector_data(allowed_ops, &op_type);
+    }
+
+    for (int i = 0, j = 0; i < OP_BUILTIN_LAST; i++)
+    {
+        int op_type = timvx_supported_ops[j];
+        if (op_type != i)
+        {
+            push_vector_data(blocked_ops, &i);
+        }
+        else
+        {
+            if (j < sizeof(timvx_supported_ops) / sizeof(timvx_supported_ops[0]))
+                j++;
+        }
+    }
+
+    int precision_var = TENGINE_DT_UINT8;
+    push_vector_data(precision, &precision_var);
+    precision_var = TENGINE_DT_FP16;
+    push_vector_data(precision, &precision_var);
+    precision_var = TENGINE_DT_FP32;
+    push_vector_data(precision, &precision_var);
+
+    return 0;
+}
+
+
+int timvx_evaluation(struct dev_allocator* allocator, struct subgraph* sub_graph, struct vector* evolution_tensors, struct vector* evolution_nodes)
+{
+    // nothing to do with tensorrt
+    (void)allocator;
+    (void)sub_graph;
+    (void)evolution_tensors;
+    (void)evolution_nodes;
+
+    return 0;
+}
+
+
+int timvx_allocate(struct dev_allocator* allocator, struct subgraph* sub_graph)
+{
+    if (nullptr == allocator)
+    {
+        set_tengine_errno(EBADSLT);
+        return -1;
+    }
+
+    if (!strcmp(TIMVX_DEV_NAME, allocator->name))
+    {
+        set_tengine_errno(EBADSLT);
+        return -1;
+    }
+
+    /* set the correct input wait count: INPUT tensor is always ready */
+    sub_graph->input_wait_count = 0;
+
+    for (int i = 0; i < sub_graph->input_num; i++)
+    {
+        struct ir_tensor* tensor = get_ir_graph_tensor(sub_graph->graph, sub_graph->input_tensor_list[i]);
+
+        if (tensor->tensor_type == TENSOR_TYPE_VAR)
+            sub_graph->input_wait_count++;
+    }
+
+    return 0;
+}
+
+
+int timvx_release(struct dev_allocator* allocator, struct subgraph* sub_graph)
+{
+    (void)sub_graph;
+
+    if (nullptr == allocator || !strcmp(TIMVX_DEV_NAME, allocator->name))
+    {
+        return -1;
+    }
+
+    return 0;
+}
+
+
+extern "C"
+{
+static struct timvx_device timvx_dev = {
+    .base = {
+            .name = TIMVX_DEV_NAME,
+            .init = timvx_dev_init,
+            .prerun = timvx_dev_prerun,
+            .run = timvx_dev_run,
+            .postrun = timvx_dev_postrun,
+            .async_run = nullptr,
+            .async_wait = nullptr,
+            .release = timvx_dev_release,
+            .release_exec_graph = nullptr,},
+        .load_graph = nullptr,
+        .load_ir_graph = nullptr,
+        .unload_graph = nullptr,
+};
+
+
+static struct dev_allocator timvx_allocator = {
+    .name = TIMVX_DEV_NAME,
+    .describe = timvx_describe,
+    .evaluation = timvx_evaluation,
+    .allocate = timvx_allocate,
+    .release = timvx_release,
+};
+
+
+int register_timvx_device(void)
+{
+    TLOG_INFO("Tengine plugin device %s is registered.\n", timvx_dev.base.name);
+    return register_nn_device(&timvx_dev.base);
+}
+
+
+
+#ifdef STANDLONE_MODE
+void register_timvx_allocator(void)
+#else
+static void register_timvx_allocator(void)
+#endif
+{
+    TLOG_INFO("Tengine plugin allocator %s is registered.\n", timvx_allocator.name);
+    init_allocator_registry(&timvx_allocator);
+}
+
+
+#ifndef STANDLONE_MODE
+REGISTER_NN_DEVICE(&timvx_dev.base);
+REGISTER_DEV_ALLOCATOR(register_timvx_allocator);
+#endif
+}
--- a/src/dev/tim-vx/timvx_device.hpp
+++ b/src/dev/tim-vx/timvx_device.hpp
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: lswang@openailab.com
+ */
+
+#ifndef __TIMVX_DEVICE_H__
+#define __TIMVX_DEVICE_H__
+
+#define TIMVX_DEV_NAME "TIMVX"
+
+extern "C"
+{
+#include "tengine_c_api.h"
+
+struct timvx_device
+{
+    struct nn_device base;
+
+    int (*load_graph)(struct timvx_device* dev);
+
+    int (*load_ir_graph)(struct timvx_device* dev);
+
+    int (*unload_graph)(struct timvx_device* dev);
+};
+
+DLLEXPORT int register_timvx_device(void);
+}
+
+#endif
--- a/src/dev/tim-vx/timvx_executor.cc
+++ b/src/dev/tim-vx/timvx_executor.cc
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: lswang@openailab.com
+ */
+
+#include "timvx_executor.hpp"
+#include "timvx_helper.hpp"
+
+extern "C"
+{
+#include "tengine_op.h"
+#include "tengine_log.h"
+}
+
+#define DEFAULT_DEVICE_ID 0
+#define DEFAULT_MAX_BATCH 128
+
+
+VXEngine::VXEngine()
+{
+    this->context = tim::vx::Context::Create();
+    this->graph = context->CreateGraph();
+};
+
+
+void VXEngine::VXTensorMap(struct ir_graph* ir_graph, int ir_tensor_idx, int spec_type)
+{
+    auto iter = this->vx_tensor_map.find(ir_tensor_idx);
+    TLOG_INFO("Log:ir_tensor_idx %d\n",ir_tensor_idx);
+    TLOG_INFO("Log:#### 001 %d\n",spec_type); 
+    if (this->vx_tensor_map.end() == iter)
+    {
+        struct ir_tensor* ir_tensor = get_ir_graph_tensor(ir_graph, ir_tensor_idx);
+        unsigned int* Dims = (unsigned int*)ir_tensor->dims;
+
+        tim::vx::DataType datatype;
+        switch(ir_tensor->data_type)
+        {
+            case (1):
+                TLOG_INFO("Log:tim::vx::DataType::FLOAT16\n");
+                datatype = tim::vx::DataType::FLOAT16;
+                break;
+            case (3):
+                TLOG_INFO("Log:tim::vx::DataType::UINT8\n");
+                datatype = tim::vx::DataType::UINT8;
+                break;
+            case (4):
+                TLOG_INFO("Log:tim::vx::DataType::INT32\n");
+                datatype = tim::vx::DataType::INT32;
+                break;
+            default:
+                fprintf(stderr,"Don't support this date type(%d)\n",ir_tensor->data_type);
+                break;
+        }
+
+        tim::vx::ShapeType vx_shape;
+        TLOG_INFO("Log:ir_tensor->dim_num %d\n",ir_tensor->dim_num);
+
+        struct ir_node* ir_node = get_ir_graph_node(ir_graph, ir_tensor->producer);
+        if (ir_node->op.op_type == OP_FC && ir_node->output_tensors[0] == ir_tensor_idx)
+        {
+            for (int i = 1; i >= 0; i--)
+            {
+                vx_shape.push_back(Dims[i]);
+            }
+        }
+        else
+        {
+            for (int i = ir_tensor->dim_num - 1; i >= 0; i--)
+            {
+                vx_shape.push_back(Dims[i]);
+            }
+        }
+
+        tim::vx::Quantization vx_quant(tim::vx::QuantType::ASYMMETRIC, ir_tensor->scale,
+                                       ir_tensor->zero_point);
+
+        std::shared_ptr<tim::vx::Tensor> vx_tensor;
+
+        TLOG_INFO("Log:#### 010 %d\n",spec_type);         
+        if (spec_type == SPEC_TYPE_OUTPUT)
+        {
+            tim::vx::TensorSpec vx_spec(datatype, vx_shape,
+                                        tim::vx::TensorAttribute::OUTPUT, vx_quant);
+            vx_tensor = this->graph->CreateTensor(vx_spec);
+        }
+        else if (spec_type == SPEC_TYPE_DWCONV)
+        {
+            TLOG_INFO("Log:#### 111 SPEC_TYPE_DWCONV\n"); 
+            vx_shape[ir_tensor->dim_num - 2] = vx_shape[ir_tensor->dim_num - 1];
+            vx_shape[ir_tensor->dim_num - 1] = 1;
+            tim::vx::TensorSpec vx_spec(datatype, vx_shape,
+                                        tim::vx::TensorAttribute::CONSTANT, vx_quant);
+            vx_tensor = this->graph->CreateTensor(vx_spec, ir_tensor->data);
+        }
+        else if (ir_tensor->tensor_type == TENSOR_TYPE_INPUT )
+        {
+            tim::vx::TensorSpec vx_spec(datatype, vx_shape,
+                                        tim::vx::TensorAttribute::INPUT, vx_quant);
+            vx_tensor = this->graph->CreateTensor(vx_spec);
+        }
+        else if (ir_tensor->tensor_type == TENSOR_TYPE_VAR)
+        {
+            tim::vx::TensorSpec vx_spec(datatype, vx_shape,
+                                        tim::vx::TensorAttribute::TRANSIENT, vx_quant);
+            vx_tensor = this->graph->CreateTensor(vx_spec);
+        }
+        else if (ir_tensor->tensor_type == TENSOR_TYPE_CONST)
+        {
+            tim::vx::TensorSpec vx_spec(datatype, vx_shape,
+                                        tim::vx::TensorAttribute::CONSTANT, vx_quant);
+            vx_tensor = this->graph->CreateTensor(vx_spec, ir_tensor->data);
+        }
+        this->vx_tensor_map[ir_tensor_idx] = vx_tensor;
+    }
+    TLOG_INFO("\n");
+}
+
+int VXEngine::Build(struct subgraph* subgraph)
+{
+    struct ir_graph* ir_graph = subgraph->graph;
+
+    for (int i = 0; i < subgraph->node_num; i++)
+    {
+        uint16_t node_id = subgraph->node_list[i];
+        struct ir_node* ir_node = get_ir_graph_node(ir_graph, node_id);
+        auto op_type = ir_node->op.op_type;
+
+        switch (op_type)
+        {
+            case OP_CLIP:
+                this->AddClipNode(ir_node);
+                break;
+            case OP_CONCAT:
+                this->AddConcatNode(ir_node);
+                break;
+            case OP_CONST:
+            case OP_INPUT:
+                continue;
+            case OP_CONV:
+                this->AddConvolutionNode(ir_node);
+                break;
+            case OP_DROPOUT:
+                this->AddDropoutNode(ir_node);
+                break;
+            case OP_ELTWISE:
+                this->AddEltwisSumNode(ir_node);
+                break;
+            case OP_FC:
+                this->AddFullyConnectionNode(ir_node);
+                break;
+            case OP_FLATTEN:
+                this->AddFlattenNode(ir_node);
+                break;
+//            case OP_PERMUTE:
+//                this->AddPermuteNode(ir_graph, ir_node);
+//                break;
+            case OP_POOL:
+                this->AddPoolingNode(ir_node);
+                break;
+            case OP_RELU:
+                this->AddReluNode(ir_node);
+                break;
+//            case OP_RESHAPE:
+//                this->AddReshapeNode(ir_graph, ir_node);
+//                break;
+//            case OP_SLICE:
+//                this->AddSliceNode(ir_graph, ir_node);
+//                break;
+//            case OP_SOFTMAX:
+//                this->AddSoftmaxNode(ir_graph, ir_node);
+            default:
+                fprintf(stderr, "Tengine TIM-VX: Cannot support OP(%d).\n", ir_node->idx);
+                break;
+        }
+    }
+}
+
+
+int VXEngine::VXEnginePreRun(struct subgraph* subgraph)
+{
+    struct ir_graph* ir_graph = subgraph->graph;
+
+    /* Add TIM-VX Tensor */
+    TLOG_INFO("Log:subgraph->node_num %d\n", subgraph->node_num);
+    for (uint8_t i = 0; i < subgraph->output_num; i++)
+    {
+        int ir_tensor_idx = subgraph->output_tensor_list[i];
+        this->VXTensorMap(ir_graph, ir_tensor_idx, SPEC_TYPE_OUTPUT);
+    }
+    for (int i = 0; i < subgraph->node_num; i++)
+    {
+        uint16_t node_id = subgraph->node_list[i];
+        struct ir_node* ir_node = get_ir_graph_node(ir_graph, node_id);
+        if (ir_node->op.op_type == OP_CONV)
+        {
+            struct conv_param* conv_param = ( struct conv_param* )ir_node->op.param_mem;
+            if (conv_param->group == conv_param->output_channel)
+            {
+                TLOG_INFO("Log:#### 000 SPEC_TYPE_DWCONV\n");
+                this->VXTensorMap(ir_graph, ir_node->input_tensors[1], SPEC_TYPE_DWCONV);
+            }       
+        } 
+    }
+    for (int i = 0; i < subgraph->node_num; i++)
+    {
+        uint16_t node_id = subgraph->node_list[i];
+        struct ir_node* ir_node = get_ir_graph_node(ir_graph, node_id);
+        for (int j = 0; j < ir_node->input_num; j++)
+        {
+            int ir_tensor_idx = ir_node->input_tensors[j];
+            this->VXTensorMap(ir_graph, ir_tensor_idx, 0);
+        }
+        for (int j = 0; j < ir_node->output_num; j++)
+        {
+            int ir_tensor_idx = ir_node->output_tensors[j];
+            this->VXTensorMap(ir_graph, ir_tensor_idx, 0);
+        }
+    }
+
+    /* Add TIM-VX Node */
+    this->Build(subgraph);
+
+    // fprintf(stderr,"subgraph->node_num %d\n",subgraph->node_num);
+    if (subgraph->node_num > 0)
+    {
+        if (!this->graph->Compile()) {
+            std::cout << "\nCompile graph fail." << std::endl;
+            return -1;
+        }
+    }
+
+    return 0;
+};
+
+int VXEngine::VXEngineRun(struct subgraph* subgraph)
+{
+    struct ir_graph* ir_graph = subgraph->graph;
+
+    /* upload data */
+    // fprintf(stderr,"subgraph->input_num %d\n",subgraph->input_num);
+    if (subgraph->input_num > 0)
+    {
+        for (uint8_t i = 0; i < subgraph->input_num; i++)
+        {
+            int ir_tensor_idx = subgraph->input_tensor_list[i];
+            struct ir_tensor* ir_tensor = get_ir_graph_tensor(ir_graph, ir_tensor_idx);
+            if (!this->vx_tensor_map[ir_tensor_idx]->CopyDataToTensor(ir_tensor->data, ir_tensor->elem_num * ir_tensor->elem_size)) {
+                std::cout << "Copy input data fail." << std::endl;
+                return -1;
+            }
+        }
+
+        if (!this->graph->Run()) {
+            std::cout << "Run graph fail." << std::endl;
+            return -1;
+        }
+
+        /* download data */
+        for (uint8_t i = 0; i < subgraph->output_num; i++)
+        {
+            int ir_tensor_idx = subgraph->output_tensor_list[i];
+            struct ir_tensor* ir_tensor = get_ir_graph_tensor(ir_graph, ir_tensor_idx);
+            if (ir_tensor->data == NULL)
+            {
+                TLOG_INFO("Log:download data is NULL\n");
+                uint8_t* u8data = (uint8_t*)malloc(ir_tensor->elem_size * ir_tensor->elem_num);
+                ir_tensor->data = u8data;
+            }
+
+            if (!this->vx_tensor_map[ir_tensor_idx]->CopyDataFromTensor(ir_tensor->data)) 
+            {
+                TLOG_INFO("Log:Copy input data fail\n");
+                return -1;
+            }
+        }
+    }
+
+    return 0;
+}
+
+void VXEngine::VXEnginePostRun()
+{
+
+};
--- a/src/dev/tim-vx/timvx_executor.hpp
+++ b/src/dev/tim-vx/timvx_executor.hpp
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: lswang@openailab.com
+ */
+
+#ifndef __TIMVX_TIMVX_EXECUTOR_HPP__
+#define __TIMVX_TIMVX_EXECUTOR_HPP__
+
+extern "C"
+{
+#include "tengine_ir.h"
+#include "tengine_log.h"
+}
+
+#include <map>
+#include <algorithm>
+#include <iomanip>
+#include <iostream>
+#include <tuple>
+#include <vector>
+
+#include "tim/vx/context.h"
+#include "tim/vx/graph.h"
+#include "tim/vx/operation.h"
+
+#include "tim/vx/ops/activations.h"
+#include "tim/vx/ops/concat.h"
+#include "tim/vx/ops/conv2d.h"
+#include "tim/vx/ops/elementwise.h"
+#include "tim/vx/ops/fullyconnected.h"
+#include "tim/vx/ops/pool2d.h"
+#include "tim/vx/ops/reshape.h"
+#include "tim/vx/ops/softmax.h"
+#include "tim/vx/tensor.h"
+
+#include "convolution_param.h"
+
+#define SPEC_TYPE_OUTPUT 1
+#define SPEC_TYPE_DWCONV 2
+
+typedef std::map<uint32_t, std::shared_ptr<tim::vx::Tensor>> dict_irt2vxt;
+
+
+class VXEngine
+{
+public:
+    VXEngine();
+    ~VXEngine() = default;
+
+    int VXEnginePreRun(struct subgraph* subgraph);
+    int VXEngineRun(struct subgraph* subgraph);
+    void VXEnginePostRun();
+
+private:
+    int Build(struct subgraph* subgraph);
+    void VXTensorMap(struct ir_graph* ir_graph, int ir_tensor_idx, int spec_type);
+
+    bool AddClipNode(struct ir_node* ir_node);
+    bool AddConcatNode(struct ir_node* ir_node);
+    bool AddConvolutionNode(struct ir_node* ir_node);
+    bool AddDropoutNode(struct ir_node* ir_node);
+    bool AddEltwisSumNode(struct ir_node* ir_node);
+    bool AddFlattenNode(struct ir_node* ir_node);
+    bool AddFullyConnectionNode(struct ir_node* node);
+    bool AddPoolingNode(struct ir_node* ir_node);
+    bool AddReluNode(struct ir_node* ir_node);
+
+
+
+public:
+    std::shared_ptr<tim::vx::Context> context;
+    std::shared_ptr<tim::vx::Graph> graph;
+    std::shared_ptr<tim::vx::Operation> ops;
+
+private:
+    dict_irt2vxt     vx_tensor_map;
+    dict_irt2vxt     vx_node_map;
+
+
+
+};
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+#endif
--- a/src/dev/tim-vx/timvx_graph.cc
+++ b/src/dev/tim-vx/timvx_graph.cc
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: lswang@openailab.com
+ */
+
+#include "timvx_graph.hpp"
+#include "timvx_executor.hpp"
+
+extern "C"
+{
+#include "nn_device.h"
+}
+
+
+int timvx_dev_init(struct nn_device* dev)
+{
+    (void)dev;
+    return 0;
+}
+
+
+int timvx_dev_prerun(struct nn_device* dev, struct subgraph* subgraph, int num_thread, int cpu_affinity, int mode)
+{
+    fprintf(stderr,"TIM-VX prerun.\n");
+    subgraph->exec_graph = new VXEngine;
+    auto engine = (VXEngine*)subgraph->exec_graph;
+
+    return engine->VXEnginePreRun(subgraph);
+}
+
+
+int timvx_dev_run(struct nn_device* dev, struct subgraph* subgraph)
+{
+    auto engine = (VXEngine*)subgraph->exec_graph;
+    return engine->VXEngineRun(subgraph);
+}
+
+
+int timvx_dev_postrun(struct nn_device* dev, struct subgraph* subgraph)
+{
+    auto engine = (VXEngine*)subgraph->exec_graph;
+    engine->VXEnginePostRun();
+    delete engine;
+
+    return 0;
+}
+
+
+int timvx_dev_release(struct nn_device* dev)
+{
+    (void)dev;
+    return 0;
+}
--- a/src/dev/tim-vx/timvx_graph.hpp
+++ b/src/dev/tim-vx/timvx_graph.hpp
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: lswang@openailab.com
+ */
+
+#ifndef __TIMVX_TIMVX_GRAPH_HPP__
+#define __TIMVX_TIMVX_GRAPH_HPP__
+
+extern "C"
+{
+#include "nn_device.h"
+#include "tengine_ir.h"
+
+
+int timvx_dev_init(struct nn_device* dev);
+int timvx_dev_prerun(struct nn_device* dev, struct subgraph* subgraph, int num_thread, int cpu_affinity, int mode);
+int timvx_dev_run(struct nn_device* dev, struct subgraph* subgraph);
+int timvx_dev_postrun(struct nn_device* dev, struct subgraph* subgraph);
+int timvx_dev_release(struct nn_device* dev);
+
+}
+
+#endif
--- a/src/dev/tim-vx/timvx_helper.hpp
+++ b/src/dev/tim-vx/timvx_helper.hpp
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: lswang@openailab.com
+ */
+
+#pragma once
+
+#include <memory>
+#include <sstream>
+#include <ctime>
+#include <iomanip>
+#include <iostream>
+#include <ostream>
+#include <string>
+
+
+#ifdef _MSC_VER
+#define FN_NAME __FUNCTION__
+#else
+#define FN_NAME __func__
+#endif
+
+#if (!defined(__ANDROID__) && defined(__aarch64__)) || defined(__QNX__)
+#define ENABLE_DLA_API 1
+#endif
+
+#define CHECK(status)                                                       \
+    do                                                                      \
+    {                                                                       \
+        auto ret = (status);                                                \
+        if (ret != 0)                                                       \
+        {                                                                   \
+            Log(Loglevel, "TensorRT Engine",  "Cuda failure: %d", ret);     \
+            abort();                                                        \
+        }                                                                   \
+    } while (0)
+
+
+constexpr long double operator"" _GiB(long double val)
+{
+    return val * (1 << 30);
+}
+constexpr long double operator"" _MiB(long double val) { return val * (1 << 20); }
+constexpr long double operator"" _KiB(long double val) { return val * (1 << 10); }
+
+// These is necessary if we want to be able to write 1_GiB instead of 1.0_GiB.
+// Since the return type is signed, -1_GiB will work as expected.
+constexpr long long int operator"" _GiB(long long unsigned int val) { return val * (1 << 30); }
+constexpr long long int operator"" _MiB(long long unsigned int val) { return val * (1 << 20); }
+constexpr long long int operator"" _KiB(long long unsigned int val) { return val * (1 << 10); }
+
+
+
--- a/src/dev/tim-vx/timvx_limit.hpp
+++ b/src/dev/tim-vx/timvx_limit.hpp
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Copyright (c) 2021, Open AI Lab
+ * Author: hhchen@openailab.com
+ */
+
+#pragma once
+
+extern "C"
+{
+#include "tengine_op.h"
+}
+
+
+const int timvx_supported_ops[] = {
+
+        OP_CLIP,
+        OP_CONCAT,
+        OP_CONST,
+        OP_CONV,
+        OP_DROPOUT,
+        OP_ELTWISE,
+        OP_FC,
+        OP_FLATTEN,
+        OP_INPUT,
+//        OP_PERMUTE,
+        OP_POOL,
+        OP_RELU,
+        OP_RESHAPE,
+        OP_SLICE,
+        OP_SOFTMAX
+};