diff --git a/doc/fluid/advanced_guide/inference_deployment/inference/native_infer.md b/doc/fluid/advanced_guide/inference_deployment/inference/native_infer.md index 0917528e9393b12e55462201502763fd4f20eeed..a360b0cd774f311c4dbaabd8c6575f033a6b2e91 100644 --- a/doc/fluid/advanced_guide/inference_deployment/inference/native_infer.md +++ b/doc/fluid/advanced_guide/inference_deployment/inference/native_infer.md @@ -174,7 +174,7 @@ float *output_d = output_t->data(PaddlePlace::kGPU, &output_size); ## C++预测样例编译测试 1. 下载或编译paddle预测库,参考[安装与编译C++预测库](./build_and_install_lib_cn.html)。 -2. 下载[预测样例](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.6.tar.gz)并解压,进入`sample/inference`目录下。 +2. 下载[预测样例](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz)并解压,进入`sample/inference`目录下。 `inference` 文件夹目录结构如下: @@ -205,10 +205,11 @@ float *output_d = output_t->data(PaddlePlace::kGPU, &output_size); WITH_GPU=OFF USE_TENSORRT=OFF - # 按照运行环境设置预测库路径、CUDA库路径、CUDNN库路径、模型路径 + # 按照运行环境设置预测库路径、CUDA库路径、CUDNN库路径、TensorRT路径、模型路径 LIB_DIR=YOUR_LIB_DIR CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR + TENSORRT_ROOT_DIR=YOUR_TENSORRT_ROOT_DIR MODEL_DIR=YOUR_MODEL_DIR ``` @@ -232,7 +233,7 @@ float *output_d = output_t->data(PaddlePlace::kGPU, &output_size); ### 多线程预测 Paddle Fluid支持通过在不同线程运行多个AnalysisPredictor的方式来优化预测性能,支持CPU和GPU环境。 -使用多线程预测的样例详见[C++预测样例编译测试](#C++预测样例编译测试)中下载的[预测样例](https://paddle-inference-dist.cdn.bcebos.com/tensorrt_test/paddle_trt_samples_v1.6.tar.gz)中的 +使用多线程预测的样例详见[C++预测样例编译测试](#C++预测样例编译测试)中下载的[预测样例](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz)中的 `thread_mobilenet_test.cc`文件。可以将`run.sh`中`mobilenet_test`替换成`thread_mobilenet_test`再执行 ``` diff --git a/doc/fluid/advanced_guide/inference_deployment/inference/native_infer_en.md b/doc/fluid/advanced_guide/inference_deployment/inference/native_infer_en.md index cea09c23132205e9bc87754734a88250230863da..4a8b012dee0b460248f8f9ebf5dba8b518ca6ce6 100644 --- a/doc/fluid/advanced_guide/inference_deployment/inference/native_infer_en.md +++ b/doc/fluid/advanced_guide/inference_deployment/inference/native_infer_en.md @@ -2,155 +2,226 @@ To make the deployment of inference model more convenient, a set of high-level APIs are provided in Fluid to hide diverse optimization processes in low level. -Inference library contains: - -- header file `paddle_inference_api.h` which defines all interfaces -- library file `libpaddle_fluid.so` or `libpaddle_fluid.a` - Details are as follows: -## PaddleTensor - -PaddleTensor defines basic format of input and output data for inference. Common fields are as follows: +## Use AnalysisPredictor to perform high-performance inference +Paddy fluid uses AnalysisPredictor to perform inference. AnalysisPredictor is a high-performance inference engine. Through the analysis of the calculation graph, the engine completes a series of optimization of the calculation graph (such as the integration of OP, the optimization of memory / graphic memory, the support of MKLDNN, TensorRT and other underlying acceleration libraries), which can greatly improve the inference performance. -- `name` is used to indicate the name of variable in model correspondent with input data. -- `shape` represents the shape of a Tensor. -- `data` is stored in `PaddleBuf` in method of consecutive storage. `PaddleBuf` can receieve outer data or independently `malloc` memory. You can refer to associated definitions in head file. -- `dtype` represents data type of Tensor. +In order to show the complete inference process, the following is a complete example of using AnalysisPredictor. The specific concepts and configurations involved will be detailed in the following sections. -## Use Config to create different engines +#### AnalysisPredictor sample -The low level of high-level API contains various optimization methods which are called engines. Switch between different engines is done by transferring different Config. - -- `NativeConfig` native engine, consisting of native forward operators of paddle, can naturally support all models trained by paddle. +``` c++ +#include "paddle_inference_api.h" -- `AnalysisConfig` TensorRT mixed engine. It is used to speed up GPU and supports [TensorRT] with subgraph. Moreover, this engine supports all paddle models and automatically slices part of computing subgraphs to TensorRT to speed up the process (WIP). For specific usage, please refer to [here](http://paddlepaddle.org/documentation/docs/zh/1.1/user_guides/howto/inference/paddle_tensorrt_infer.html). +namespace paddle { +void CreateConfig(AnalysisConfig* config, const std::string& model_dirname) { + // load model from disk + config->SetModel(model_dirname + "/model", + model_dirname + "/params"); + // config->SetModel(model_dirname); + // use SetModelBuffer if load model from memory + // config->SetModelBuffer(prog_buffer, prog_size, params_buffer, params_size); + config->EnableUseGpu(100 /*init graphic memory by 100MB*/, 0 /*set GPUID to 0*/); + + /* for cpu + config->DisableGpu(); + config->EnableMKLDNN(); // enable MKLDNN + config->SetCpuMathLibraryNumThreads(10); + */ + + config->SwitchUseFeedFetchOps(false); + // set to true if there are multiple inputs + config->SwitchSpecifyInputNames(true); + config->SwitchIrDebug(true); // If the visual debugging option is enabled, a dot file will be generated after each graph optimization process + // config->SwitchIrOptim(false); // The default is true. Turn off all optimizations if set to false + // config->EnableMemoryOptim(); // Enable memory / graphic memory reuse +} + +void RunAnalysis(int batch_size, std::string model_dirname) { + // 1. create AnalysisConfig + AnalysisConfig config; + CreateConfig(&config, model_dirname); + + // 2. create predictor based on config, and prepare input data + auto predictor = CreatePaddlePredictor(config); + int channels = 3; + int height = 224; + int width = 224; + float input[batch_size * channels * height * width] = {0}; + + // 3. build inputs + // uses ZeroCopy API here to avoid extra copying from CPU, improving performance + auto input_names = predictor->GetInputNames(); + auto input_t = predictor->GetInputTensor(input_names[0]); + input_t->Reshape({batch_size, channels, height, width}); + input_t->copy_from_cpu(input); + + // 4. run inference + CHECK(predictor->ZeroCopyRun()); + + // 5. get outputs + std::vector out_data; + auto output_names = predictor->GetOutputNames(); + auto output_t = predictor->GetOutputTensor(output_names[0]); + std::vector output_shape = output_t->shape(); + int out_num = std::accumulate(output_shape.begin(), output_shape.end(), 1, std::multiplies()); + + out_data.resize(out_num); + output_t->copy_to_cpu(out_data.data()); +} +} // namespace paddle + +int main() { + // the model can be downloaded from http://paddle-inference-dist.cdn.bcebos.com/tensorrt_test/mobilenet.tar.gz + paddle::RunAnalysis(1, "./mobilenet"); + return 0; +} +``` -## Process of Inference Deployment +## Use AnalysisConfig to manage inference configurations -In general, the steps are: +AnalysisConfig manages the inference configuration of AnalysisPredictor, providing model path setting, inference engine running device selection, and a variety of options to optimize the inference process. The configuration method is as follows: -1. Use appropriate configuration to create `PaddlePredictor` -2. Create `PaddleTensor` for input and transfer it into `PaddlePredictor` -3. `PaddleTensor` for fetching output +#### General optimizing configuration +``` c++ +config->SwitchIrOptim(true); // Enable analysis and optimization of calculation graph,including OP fusion, etc +config->EnableMemoryOptim(); // Enable memory / graphic memory reuse +``` +**Note:** Using ZeroCopyTensor requires following setting: +``` c++ +config->SwitchUseFeedFetchOps(false); // disable feed and fetch OP +``` -The complete process of implementing a simple model is shown below with part of details omitted. +#### set model and param path +When loading the model from disk, there are two ways to set the path of AnalysisConfig to load the model and parameters according to the storage mode of the model and parameter file: -```c++ -#include "paddle_inference_api.h" +* Non combined form: when there is a model file and multiple parameter files under the model folder 'model_dir', the path of the model folder is passed in. The default name of the model file is'__model_'. +``` c++ +config->SetModel("./model_dir"); +``` -// create a config and modify associated options -paddle::NativeConfig config; -config.model_dir = "xxx"; -config.use_gpu = false; -// create a native PaddlePredictor -auto predictor = - paddle::CreatePaddlePredictor(config); -// create input tensor -int64_t data[4] = {1, 2, 3, 4}; -paddle::PaddleTensor tensor; -tensor.shape = std::vector({4, 1}); -tensor.data.Reset(data, sizeof(data)); -tensor.dtype = paddle::PaddleDType::INT64; -// create output tensor whose memory is reusable -std::vector outputs; -// run inference -CHECK(predictor->Run(slots, &outputs)); -// fetch outputs ... +* Combined form: when there is only one model file 'model' and one parameter file 'params' under the model folder' model_dir ', the model file and parameter file path are passed in. +``` c++ +config->SetModel("./model_dir/model", "./model_dir/params"); ``` At compile time, it is proper to co-build with `libpaddle_fluid.a/.so` . +#### Configure CPU inference +``` c++ +config->DisableGpu(); // disable GPU +config->EnableMKLDNN(); // enable MKLDNN, accelerating CPU inference +config->SetCpuMathLibraryNumThreads(10); // set number of threads of CPU Math libs, accelerating CPU inference if CPU cores are adequate +``` +#### Configure GPU inference +``` c++ +config->EnableUseGpu(100, 0); // initialize 100M graphic memory, using GPU ID 0 +config->GpuDeviceId(); // Returns the GPU ID being used +// Turn on TRT to improve GPU performance. You need to use library with tensorrt +config->EnableTensorRtEngine(1 << 20 /*workspace_size*/, + batch_size /*max_batch_size*/, + 3 /*min_subgraph_size*/, + AnalysisConfig::Precision::kFloat32 /*precision*/, + false /*use_static*/, + false /*use_calib_mode*/); +``` +## Use ZeroCopyTensor to manage I/O -## Adavanced Usage - -### memory management of input and output - `data` field of `PaddleTensor` is a `PaddleBuf`, used to manage a section of memory for copying data. - -There are two modes in term of memory management in `PaddleBuf` : - -1. Automatic allocation and manage memory - - ```c++ - int some_size = 1024; - PaddleTensor tensor; - tensor.data.Resize(some_size); - ``` - -2. Transfer outer memory - - ```c++ - int some_size = 1024; - // You can allocate outside memory and keep it available during the usage of PaddleTensor - void* memory = new char[some_size]; +ZeroCopyTensor is the input / output data structure of AnalysisPredictor. The use of zerocopytensor can avoid redundant data copy when preparing input and obtaining output, and improve inference performance. - tensor.data.Reset(memory, some_size); - // ... +**Note:** Using zerocopytensor, be sure to set `config->SwitchUseFeedFetchOps(false);`. - // You need to release memory manually to avoid memory leak +``` c++ +// get input/output tensor +auto input_names = predictor->GetInputNames(); +auto input_t = predictor->GetInputTensor(input_names[0]); +auto output_names = predictor->GetOutputNames(); +auto output_t = predictor->GetOutputTensor(output_names[0]); + +// reshape tensor +input_t->Reshape({batch_size, channels, height, width}); + +// Through the copy_from_cpu interface, the CPU data is prepared; through the copy_to_cpu interface, the output data is copied to the CPU +input_t->copy_from_cpu(input_data /*data pointer*/); +output_t->copy_to_cpu(out_data /*data pointer*/); + +// set LOD +std::vector> lod_data = {{0}, {0}}; +input_t->SetLoD(lod_data); + +// get Tensor data pointer +float *input_d = input_t->mutable_data(PaddlePlace::kGPU); // use PaddlePlace::kCPU when running inference on CPU +int output_size; +float *output_d = output_t->data(PaddlePlace::kGPU, &output_size); +``` - delete[] memory; +## C++ inference sample +1. Download or compile C++ Inference Library, refer to [Install and Compile C++ Inference Library](./build_and_install_lib_en.html). +2. Download [C++ inference sample](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz) and uncompress it , then enter `sample/inference` directory. + + `inference` directory structure is as following: + + ``` shell + inference + ├── CMakeLists.txt + ├── mobilenet_test.cc + ├── thread_mobilenet_test.cc + ├── mobilenetv1 + │ ├── model + │ └── params + ├── run.sh + └── run_impl.sh + ``` + + - `mobilenet_test.cc` is the source code for single-thread inference. + - `thread_mobilenet_test.cc` is the source code for multi-thread inference. + - `mobilenetv1` is the model directory. + - `run.sh` is the script for running inference. + +3. Configure script: + + Before running, we need to configure script `run.sh` as following: + + ``` shell + # set whether to enable MKL, GPU or TensorRT. Enabling TensorRT requires WITH_GPU being ON + WITH_MKL=ON + WITH_GPU=OFF + USE_TENSORRT=OFF + + # set path to CUDA lib dir, CUDNN lib dir, TensorRT root dir and model dir + LIB_DIR=YOUR_LIB_DIR + CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR + CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR + TENSORRT_ROOT_DIR=YOUR_TENSORRT_ROOT_DIR + MODEL_DIR=YOUR_MODEL_DIR ``` + + Please configure `run.sh` depending on your environment. -In the two modes, the first is more convenient while the second strictly controls memory management to facilitate integration with `tcmalloc` and other libraries. +4. Build and run the sample. -### Upgrade performance based on contrib::AnalysisConfig + ``` shell + sh run.sh + ``` -AnalyisConfig is at the stage of pre-release and protected by `namespace contrib` , which may be adjusted in the future. +## Performance tuning +### Tuning on CPU +1. If the CPU model allows, try to use the version with AVX and MKL. +2. You can try to use Intel's MKLDNN acceleration. +3. When the number of CPU cores available is enough, you can increase the num value in the setting `config->SetCpuMathLibraryNumThreads(num);`. -Similar to `NativeConfig` , `AnalysisConfig` can create a inference engine with high performance after a series of optimization, including analysis and optimization of computing graph as well as integration and revise for some important Ops, which **largely promotes the peformance of models, such as While, LSTM, GRU** . +### Tuning on GPU +1. You can try to open the TensorRT subgraph acceleration engine. Through the graph analysis, Paddle can automatically fuse certain subgraphs, and call NVIDIA's TensorRT for acceleration. For details, please refer to [Use Paddle-TensorRT Library for inference](./paddle_tensorrt_infer_en.html)。 -The usage of `AnalysisConfig` is similiar with that of `NativeConfig` but the former *only supports CPU at present and is supporting GPU more and more*. +### Tuning with multi-thread +Paddle Fluid supports optimizing prediction performance by running multiple AnalysisPredictors on different threads, and supports CPU and GPU environments. -```c++ -AnalysisConfig config; -config.SetModel(dirname); // set the directory of the model -config.EnableUseGpu(100, 0 /*gpu id*/); // use GPU,or -config.DisableGpu(); // use CPU -config.SwitchSpecifyInputNames(true); // need to appoint the name of your input -config.SwitchIrOptim(); // turn on the optimization switch,and a sequence of optimizations will be executed in operation -``` +sample of using multi-threads is `thread_mobilenet_test.cc` downloaded from [sample](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz). You can change `mobilenet_test` in `run.sh` to `thread_mobilenet_test` to run inference with multi-thread. -Note that input PaddleTensor needs to be allocated. Previous examples need to be revised as follows: - -```c++ -auto predictor = - paddle::CreatePaddlePredictor(config); // it needs AnalysisConfig here -// create input tensor -int64_t data[4] = {1, 2, 3, 4}; -paddle::PaddleTensor tensor; -tensor.shape = std::vector({4, 1}); -tensor.data.Reset(data, sizeof(data)); -tensor.dtype = paddle::PaddleDType::INT64; -tensor.name = "input0"; // name need to be set here ``` - -The subsequent execution process is totally the same with `NativeConfig` . - -### variable-length sequence input -When dealing with variable-length sequence input, you need to set LoD for `PaddleTensor` . - -``` c++ -# Suppose the sequence lengths are [3, 2, 4, 1, 2, 3] in order. -tensor.lod = {{0, - /*0 + 3=*/3, - /*3 + 2=*/5, - /*5 + 4=*/9, - /*9 + 1=*/10, - /*10 + 2=*/12, - /*12 + 3=*/15}}; +sh run.sh ``` -For more specific examples, please refer to[LoD-Tensor Instructions](../../../beginners_guide/basic_concept/lod_tensor_en.html) - -### Suggestion for Performance - -1. If the CPU type permits, it's best to use the versions with support for AVX and MKL. -2. Reuse input and output `PaddleTensor` to avoid frequent memory allocation resulting in low performance -3. Try to replace `NativeConfig` with `AnalysisConfig` to perform optimization for CPU or GPU inference - -## Code Demo - -[inference demos](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/inference/api/demo_ci) diff --git a/doc/fluid/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer.md b/doc/fluid/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer.md index a31ccdd541ea0c4c38cd1d5d436f548d6ef6994a..6ccdedbeae58859528a79f2659575146047b71e3 100644 --- a/doc/fluid/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer.md +++ b/doc/fluid/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer.md @@ -20,8 +20,8 @@ NVIDIA TensorRT 是一个高性能的深度学习预测库,可为深度学习 1. 从源码编译时,TensorRT预测库目前仅支持使用GPU编译,且需要设置编译选项TENSORRT_ROOT为TensorRT所在的路径。 2. Windows支持需要TensorRT 版本5.0以上。 3. Paddle-TRT目前仅支持固定输入shape。 -4. 若使用用户自行安装的TensorRT,需要手动在`NvInfer.h`文件中为`class IPluginFactory`和`class IGpuAllocator`分别添加虚析构函数: - ``` c++ +4. 下载安装TensorRT后,需要手动在`NvInfer.h`文件中为`class IPluginFactory`和`class IGpuAllocator`分别添加虚析构函数: + ``` c++ virtual ~IPluginFactory() {}; virtual ~IGpuAllocator() {}; ``` @@ -59,8 +59,9 @@ config->EnableTensorRtEngine(1 << 20 /* workspace_size*/, ## Paddle-TRT样例编译测试 -1. 下载或编译带有 TensorRT 的paddle预测库,参考[安装与编译C++预测库](../../inference_deployment/inference/build_and_install_lib_cn.html)。 -2. 下载[预测样例](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.6.tar.gz)并解压,进入`sample/paddle-TRT`目录下。 +1. 下载或编译带有 TensorRT 的paddle预测库,参考[安装与编译C++预测库](./build_and_install_lib_cn.html)。 +2. 从[NVIDIA官网](https://developer.nvidia.com/nvidia-tensorrt-download)下载对应本地环境中cuda和cudnn版本的TensorRT,需要登陆NVIDIA开发者账号。 +3. 下载[预测样例](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz)并解压,进入`sample/paddle-TRT`目录下。 `paddle-TRT` 文件夹目录结构如下: @@ -85,8 +86,8 @@ config->EnableTensorRtEngine(1 << 20 /* workspace_size*/, 在这里假设样例所在的目录为 `SAMPLE_BASE_DIR/sample/paddle-TRT` -3. 配置编译与运行脚本 - +4. 配置编译与运行脚本 + 编译运行预测样例之前,需要根据运行环境配置编译与运行脚本`run.sh`。`run.sh`的选项与路径配置的部分如下: ```shell @@ -95,20 +96,17 @@ config->EnableTensorRtEngine(1 << 20 /* workspace_size*/, WITH_GPU=ON USE_TENSORRT=ON - # 按照运行环境设置预测库路径、CUDA库路径、CUDNN库路径、模型路径 + # 按照运行环境设置预测库路径、CUDA库路径、CUDNN库路径、TensorRT路径、模型路径 LIB_DIR=YOUR_LIB_DIR CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR + TENSORRT_ROOT_DIR=YOUR_TENSORRT_ROOT_DIR MODEL_DIR=YOUR_MODEL_DIR ``` 按照实际运行环境配置`run.sh`中的选项开关和所需lib路径。 -4. 编译与运行样例 - - ``` shell - sh run.sh - ``` +5. 编译与运行样例 ## Paddle-TRT INT8使用 diff --git a/doc/fluid/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer_en.md b/doc/fluid/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer_en.md index 978aef0eb3e2ca0ae0a362a6f77618dd67b6c5b1..83f41bb81c5a2a1caff6b527cbb4794d400b9b01 100644 --- a/doc/fluid/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer_en.md +++ b/doc/fluid/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer_en.md @@ -1,155 +1,201 @@ -# Use Paddle-TensorRT Library for inference - -NVIDIA TensorRT is a is a platform for high-performance deep learning inference. It delivers low latency and high throughput for deep learning inference application. -Subgraph is used in PaddlePaddle to preliminarily integrate TensorRT, which enables TensorRT module to enhance inference performance of paddle models. The module is still under development. Currently supported models are AlexNet, MobileNet, ResNet50, VGG19, ResNext, Se-ReNext, GoogleNet, DPN, ICNET, Deeplabv3 Mobile, Net-SSD and so on. We will introduce the obtaining, usage and theory of Paddle-TensorRT library in this documentation. - -## Contents - - [compile Paddle-TRT inference libraries](#compile Paddle-TRT inference libraries) - - [Paddle-TRT interface usage](#Paddle-TRT interface usage) - - [Paddle-TRT example compiling test](#Paddle-TRT example compiling test) - - [Paddle-TRT INT8 usage](#Paddle-TRT_INT8 usage) - - [Paddle-TRT subgraph operation principle](#Paddle-TRT subgraph operation principle) - -## compile Paddle-TRT inference libraries - -**Use Docker to build inference libraries** - -TRT inference libraries can only be compiled using GPU. - -1. Download Paddle - - ``` - git clone https://github.com/PaddlePaddle/Paddle.git - ``` - -2. Get docker image - - ``` - nvidia-docker run --name paddle_trt -v $PWD/Paddle:/Paddle -it hub.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash - ``` - -3. Build Paddle TensorRT - - ``` - # perform the following operations in docker container - cd /Paddle - mkdir build - cd build - cmake .. \ - -DWITH_FLUID_ONLY=ON \ - -DWITH_MKL=OFF \ - -DWITH_MKLDNN=OFF \ - -DCMAKE_BUILD_TYPE=Release \ - -DWITH_PYTHON=OFF \ - -DTENSORRT_ROOT=/usr \ - -DON_INFER=ON - - # build - make -j - # generate inference library - make inference_lib_dist -j - ``` - -## Paddle-TRT interface usage - -[`paddle_inference_api.h`]('https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/api/paddle_inference_api.h') defines all APIs of TensorRT. - -General steps are as follows: -1. Create appropriate AnalysisConfig. -2. Create `PaddlePredictor` based on config. -3. Create input tensor. -4. Get output tensor and output result. - -A complete process is shown below: - -```c++ -#include "paddle_inference_api.h" - -namespace paddle { -using paddle::AnalysisConfig; - -void RunTensorRT(int batch_size, std::string model_dirname) { - // 1. Create MixedRTConfig - AnalysisConfig config(model_dirname); - // config->SetModel(model_dirname + "/model", - // model_dirname + "/params"); - - config->EnableUseGpu(100, 0 /*gpu_id*/); - config->EnableTensorRtEngine(1 << 20 /*work_space_size*/, batch_size /*max_batch_size*/); - - // 2. Create predictor based on config - auto predictor = CreatePaddlePredictor(config); - // 3. Create input tensor - int height = 224; - int width = 224; - float data[batch_size * 3 * height * width] = {0}; - - PaddleTensor tensor; - tensor.shape = std::vector({batch_size, 3, height, width}); - tensor.data = PaddleBuf(static_cast(data), - sizeof(float) * (batch_size * 3 * height * width)); - tensor.dtype = PaddleDType::FLOAT32; - std::vector paddle_tensor_feeds(1, tensor); - - // 4. Create output tensor - std::vector outputs; - // 5. Inference - predictor->Run(paddle_tensor_feeds, &outputs, batch_size); - - const size_t num_elements = outputs.front().data.length() / sizeof(float); - auto *data = static_cast(outputs.front().data.data()); - for (size_t i = 0; i < num_elements; i++) { - std::cout << "output: " << data[i] << std::endl; - } -} -} // namespace paddle - -int main() { - // Download address of the model http://paddle-inference-dist.cdn.bcebos.com/tensorrt_test/mobilenet.tar.gz - paddle::RunTensorRT(1, "./mobilenet"); - return 0; -} -``` -The compilation process is [here](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/inference/api/demo_ci) - -## Paddle-TRT INT8 usage - - 1. Paddle-TRT INT8 introduction -The parameters of the neural network are redundant to some extent. In many tasks, we can turn the Float32 model into Int8 model on the premise of precision. At present, Paddle-TRT supports to turn the trained Float32 model into Int8 model off line. The specific processes are as follows: 1)**Create the calibration table**. We prepare about 500 real input data, and input the data to the model. Paddle-TRT will count the range information of each op input and output value in the model, and record in the calibration table. The information can reduce the information loss during model transformation. 2)After creating the calibration table, run the model again, **Paddle-TRT will load the calibration table automatically**, and conduct the inference in the INT8 mode. - - 2. compile and test the INT8 example - - ```shell - cd SAMPLE_BASE_DIR/sample - # sh run_impl.sh {the address of inference libraries} {the name of test script} {model directories} - # We generate 500 input data to simulate the process, and it's suggested that you use real example for experiment. - sh run_impl.sh BASE_DIR/fluid_inference_install_dir/ fluid_generate_calib_test SAMPLE_BASE_DIR/sample/mobilenetv1 - - ``` - - After the running period, there will be a new file named trt_calib_* under the `SAMPLE_BASE_DIR/sample/build/mobilenetv1` model directory, which is the calibration table. - - ``` shell - # conduct INT8 inference - # copy the model file with calibration tables to a specific address - cp -rf SAMPLE_BASE_DIR/sample/build/mobilenetv1 SAMPLE_BASE_DIR/sample/mobilenetv1_calib - sh run_impl.sh BASE_DIR/fluid_inference_install_dir/ fluid_int8_test SAMPLE_BASE_DIR/sample/mobilenetv1_calib - ``` - -## Paddle-TRT subgraph operation principle - -Subgraph is used to integrate TensorRT in PaddlePaddle. After model is loaded, neural network can be represented as a computing graph composed of variables and computing nodes. Functions Paddle TensorRT implements are to scan the whole picture, discover subgraphs that can be optimized with TensorRT and replace them with TensorRT nodes. During the inference of model, Paddle will call TensorRT library to optimize TensorRT nodes and call native library of Paddle to optimize other nodes. During the inference, TensorRT can integrate Op horizonally and vertically to filter redundant Ops and is able to choose appropriate kernel for specific Op in specific platform to speed up the inference of model. - -A simple model expresses the process : - -**Original Network** -

- -

- -**Transformed Network** -

- -

- -We can see in the Original Network that the green nodes represent nodes supported by TensorRT, the red nodes represent variables in network and yellow nodes represent nodes which can only be operated by native functions in Paddle. Green nodes in original network are extracted to compose subgraph which is replaced by a single TensorRT node to be transformed into `block-25` node in network. When such nodes are encountered during the runtime, TensorRT library will be called to execute them. +# Use Paddle-TensorRT Library for inference + +NVIDIA TensorRT is a is a platform for high-performance deep learning inference. It delivers low latency and high throughput for deep learning inference application. +Subgraph is used in PaddlePaddle to preliminarily integrate TensorRT, which enables TensorRT module to enhance inference performance of paddle models. The module is still under development. Currently supported models are as following: + +|classification|detection|segmentation| +|---|---|---| +|mobilenetv1|yolov3|ICNET| +|resnet50|SSD|| +|vgg16|mask-rcnn|| +|resnext|faster-rcnn|| +|AlexNet|cascade-rcnn|| +|Se-ResNext|retinanet|| +|GoogLeNet|mobilenet-SSD|| +|DPN||| + +We will introduce the obtaining, usage and theory of Paddle-TensorRT library in this documentation. + +**Note:** + +1. When compiling from source, TensorRT library currently only supports GPU compilation, and you need to set the compilation option TensorRT_ROOT to the path where tensorrt is located. +2. Windows support requires TensorRT version 5.0 or higher. +3. Paddle-TRT currently only supports fixed input shape. +4. After downloading and installing tensorrt, you need to manually add virtual destructors for `class IPluginFactory` and `class IGpuAllocator` in the `NvInfer.h` file: + ``` c++ + virtual ~IPluginFactory() {}; + virtual ~IGpuAllocator() {}; + ``` + +## Paddle-TRT interface usage + +When using AnalysisPredictor, we enable Paddle-TRT by setting + +``` c++ +config->EnableTensorRtEngine(1 << 20 /* workspace_size*/, + batch_size /* max_batch_size*/, + 3 /* min_subgraph_size*/, + AnalysisConfig::Precision::kFloat32 /* precision*/, + false /* use_static*/, + false /* use_calib_mode*/); +``` +The details of this interface is as following: + +- **`workspace_size`**: type:int, default is 1 << 20. Sets the max workspace size of TRT. TensorRT will choose kernels under this constraint. +- **`max_batch_size`**: type:int, default is 1. Sets the max batch size. Batch sizes during runtime cannot exceed this value. +- **`min_subgraph_size`**: type:int, default is 3. Subgraph is used to integrate TensorRT in PaddlePaddle. To avoid low performance, Paddle-TRT is only enabled when th number of nodes in th subgraph is more than `min_subgraph_size`. +- **`precision`**: type:`enum class Precision {kFloat32 = 0, kHalf, kInt8,};`, default is `AnalysisConfig::Precision::kFloat32`. Sets the precision of TRT, supporting FP32(kFloat32), FP16(kHalf), Int8(kInt8). Using Paddle-TRT int8 calibration requires setting `precision` to `AnalysisConfig::Precision::kInt8`, and `use_calib_mode` to true. +- **`use_static`**: type:bool, default is false. If set to true, Paddle-TRT will serialize optimization information to disk, to deserialize next time without optimizing again. +- **`use_calib_mode`**: type:bool, default is false. Using Paddle-TRT int8 calibration requires setting this option to true. + +**Note:** Paddle-TRT currently only supports fixed input shape. + +## Paddle-TRT example compiling test + +1. Download or compile Paddle Inference with TensorRT support, refer to [Install and Compile C++ Inference Library](./build_and_install_lib_en.html). +2. Download NVIDIA TensorRT(with consistent version of cuda and cudnn in local environment) from [NVIDIA TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) with an NVIDIA developer account. +3. Download [Paddle Inference sample](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz) and uncompress, and enter `sample/paddle-TRT` directory. + + `paddle-TRT` directory structure is as following: + + ``` + paddle-TRT + ├── CMakeLists.txt + ├── mobilenet_test.cc + ├── fluid_generate_calib_test.cc + ├── fluid_int8_test.cc + ├── mobilenetv1 + │ ├── model + │ └── params + ├── run.sh + └── run_impl.sh + ``` + + - `mobilenet_test.cc` is the c++ source code of inference using Paddle-TRT + - `fluid_generate_calib_test.cc` is the c++ source code of inference using Paddle-TRT int8 calibration to generate calibration table + - `fluid_int8_test.cc` is the c++ source code of inference using Paddle-TRT int8 + - `mobilenetv1` is the model dir + - `run.sh` is the script for running inference + + Here we assume that the current directory is `SAMPLE_BASE_DIR/sample/paddle-TRT`. + + ``` shell + # set whether to enable MKL, GPU or TensorRT. Enabling TensorRT requires WITH_GPU being ON + WITH_MKL=ON + WITH_GPU=OFF + USE_TENSORRT=OFF + + # set path to CUDA lib dir, CUDNN lib dir, TensorRT root dir and model dir + LIB_DIR=YOUR_LIB_DIR + CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR + CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR + TENSORRT_ROOT_DIR=YOUR_TENSORRT_ROOT_DIR + MODEL_DIR=YOUR_MODEL_DIR + ``` + + Please configure `run.sh` depending on your environment. + +4. Build and run the sample. + + ``` shell + sh run.sh + ``` + +## Paddle-TRT INT8 usage + +1. Paddle-TRT INT8 introduction + The parameters of the neural network are redundant to some extent. In many tasks, we can turn the Float32 model into Int8 model on the premise of precision. At present, Paddle-TRT supports to turn the trained Float32 model into Int8 model off line. The specific processes are as follows: + + 1)**Create the calibration table**. We prepare about 500 real input data, and input the data to the model. Paddle-TRT will count the range information of each op input and output value in the model, and record in the calibration table. The information can reduce the information loss during model transformation. + + 2)After creating the calibration table, run the model again, **Paddle-TRT will load the calibration table automatically**, and conduct the inference in the INT8 mode. + +2. compile and test the INT8 example + + change the `mobilenet_test` in `run.sh` to `fluid_generate_calib_test` and run + + ``` shell + sh run.sh + ``` + + We generate 500 input data to simulate the process, and it's suggested that you use real example for experiment. After the running period, there will be a new file named trt_calib_* under the `SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/_opt_cache` model directory, which is the calibration table. + + Then copy the model dir with calibration infomation to path + + ``` shell + cp -rf SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/ SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib + ``` + + change `fluid_generate_calib_test` in `run.sh` to `fluid_int8_test`, and change model dir path to `SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib` and run + + ``` shell + sh run.sh + ``` + +## Paddle-TRT subgraph operation principle + + Subgraph is used to integrate TensorRT in PaddlePaddle. After model is loaded, neural network can be represented as a computing graph composed of variables and computing nodes. Functions Paddle TensorRT implements are to scan the whole picture, discover subgraphs that can be optimized with TensorRT and replace them with TensorRT nodes. During the inference of model, Paddle will call TensorRT library to optimize TensorRT nodes and call native library of Paddle to optimize other nodes. During the inference, TensorRT can integrate Op horizonally and vertically to filter redundant Ops and is able to choose appropriate kernel for specific Op in specific platform to speed up the inference of model. + + +A simple model expresses the process : + +**Original Network** +

+ +

+ +**Transformed Network** +

+ +

+ + We can see in the Original Network that the green nodes represent nodes supported by TensorRT, the red nodes represent variables in network and yellow nodes represent nodes which can only be operated by native functions in Paddle. Green nodes in original network are extracted to compose subgraph which is replaced by a single TensorRT node to be transformed into `block-25` node in network. When such nodes are encountered during the runtime, TensorRT library will be called to execute them. + +## Paddle-TRT benchmark + +### Test Environment +- CPU:Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz GPU:Tesla P4 +- TensorRT 4.0, CUDA 8.0, CUDNN V7 +- models: ResNet50,MobileNet,ResNet101, Inception V3. + +### Test set +**PaddlePaddle, Pytorch, Tensorflow** + +- PaddlePaddle integrates TensorRT with subgraph, model[link](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models)。 +- Pytorch uses original kernels, model[link1](https://github.com/pytorch/vision/tree/master/torchvision/models), [link2](https://github.com/marvis/pytorch-mobilenet)。 +- We tested TF original and TF-TRT**对TF—TRT的测试并没有达到预期的效果,后期会对其进行补充**, model[link](https://github.com/tensorflow/models)。 + + +#### ResNet50 + +|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)| +|---|---|---|---| +|1|4.64117 |16.3|10.878| +|5|6.90622| 22.9 |20.62| +|10|7.9758 |40.6|34.36| + +#### MobileNet + +|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)| +|---|---|---|---| +|1| 1.7541 | 7.8 |2.72| +|5| 3.04666 | 7.8 |3.19| +|10|4.19478 | 14.47 |4.25| + +#### ResNet101 + +|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)| +|---|---|---|---| +|1|8.95767| 22.48 |18.78| +|5|12.9811 | 33.88 |34.84| +|10|14.1463| 61.97 |57.94| + + +#### Inception v3 + +|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)| +|---|---|---|---| +|1|15.1613 | 24.2 |19.1| +|5|18.5373 | 34.8 |27.2| +|10|19.2781| 54.8 |36.7| +