To make the deployment of inference model more convenient, a set of high-level APIs are provided in Fluid to hide diverse optimization processes in low level.
Inference library contains:
- header file `paddle_inference_api.h` which defines all interfaces
- library file `libpaddle_fluid.so` or `libpaddle_fluid.a`
Details are as follows:
## PaddleTensor
PaddleTensor defines basic format of input and output data for inference. Common fields are as follows:
## <a name="Use AnalysisPredictor to perform high-performance inference"> Use AnalysisPredictor to perform high-performance inference</a>
Paddy fluid uses AnalysisPredictor to perform inference. AnalysisPredictor is a high-performance inference engine. Through the analysis of the calculation graph, the engine completes a series of optimization of the calculation graph (such as the integration of OP, the optimization of memory / graphic memory, the support of MKLDNN, TensorRT and other underlying acceleration libraries), which can greatly improve the inference performance.
-`name` is used to indicate the name of variable in model correspondent with input data.
-`shape` represents the shape of a Tensor.
-`data` is stored in `PaddleBuf` in method of consecutive storage. `PaddleBuf` can receieve outer data or independently `malloc` memory. You can refer to associated definitions in head file.
-`dtype` represents data type of Tensor.
In order to show the complete inference process, the following is a complete example of using AnalysisPredictor. The specific concepts and configurations involved will be detailed in the following sections.
## Use Config to create different engines
#### AnalysisPredictor sample
The low level of high-level API contains various optimization methods which are called engines. Switch between different engines is done by transferring different Config.
-`NativeConfig` native engine, consisting of native forward operators of paddle, can naturally support all models trained by paddle.
``` c++
#include "paddle_inference_api.h"
-`AnalysisConfig` TensorRT mixed engine. It is used to speed up GPU and supports [TensorRT] with subgraph. Moreover, this engine supports all paddle models and automatically slices part of computing subgraphs to TensorRT to speed up the process (WIP). For specific usage, please refer to [here](http://paddlepaddle.org/documentation/docs/zh/1.1/user_guides/howto/inference/paddle_tensorrt_infer.html).
// the model can be downloaded from http://paddle-inference-dist.cdn.bcebos.com/tensorrt_test/mobilenet.tar.gz
paddle::RunAnalysis(1,"./mobilenet");
return0;
}
```
## Process of Inference Deployment
## <a name="Use AnalysisConfig to manage inference configurations"> Use AnalysisConfig to manage inference configurations</a>
In general, the steps are:
AnalysisConfig manages the inference configuration of AnalysisPredictor, providing model path setting, inference engine running device selection, and a variety of options to optimize the inference process. The configuration method is as follows:
1. Use appropriate configuration to create `PaddlePredictor`
2. Create `PaddleTensor` for input and transfer it into `PaddlePredictor`
3.`PaddleTensor` for fetching output
#### General optimizing configuration
``` c++
config->SwitchIrOptim(true);// Enable analysis and optimization of calculation graph,including OP fusion, etc
**Note:** Using ZeroCopyTensor requires following setting:
``` c++
config->SwitchUseFeedFetchOps(false);// disable feed and fetch OP
```
The complete process of implementing a simple model is shown below with part of details omitted.
#### set model and param path
When loading the model from disk, there are two ways to set the path of AnalysisConfig to load the model and parameters according to the storage mode of the model and parameter file:
```c++
#include "paddle_inference_api.h"
* Non combined form: when there is a model file and multiple parameter files under the model folder 'model_dir', the path of the model folder is passed in. The default name of the model file is'__model_'.
* Combined form: when there is only one model file 'model' and one parameter file 'params' under the model folder' model_dir ', the model file and parameter file path are passed in.
## <a name="Use ZeroCopyTensor to manage I/O"> Use ZeroCopyTensor to manage I/O</a>
## Adavanced Usage
### memory management of input and output
`data` field of `PaddleTensor` is a `PaddleBuf`, used to manage a section of memory for copying data.
There are two modes in term of memory management in `PaddleBuf` :
1. Automatic allocation and manage memory
```c++
int some_size = 1024;
PaddleTensor tensor;
tensor.data.Resize(some_size);
```
2. Transfer outer memory
```c++
int some_size = 1024;
// You can allocate outside memory and keep it available during the usage of PaddleTensor
void* memory = new char[some_size];
ZeroCopyTensor is the input / output data structure of AnalysisPredictor. The use of zerocopytensor can avoid redundant data copy when preparing input and obtaining output, and improve inference performance.
tensor.data.Reset(memory, some_size);
// ...
**Note:** Using zerocopytensor, be sure to set `config->SwitchUseFeedFetchOps(false);`.
// You need to release memory manually to avoid memory leak
## <a name="C++ inference sample"> C++ inference sample</a>
1. Download or compile C++ Inference Library, refer to [Install and Compile C++ Inference Library](./build_and_install_lib_en.html).
2. Download [C++ inference sample](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz) and uncompress it , then enter `sample/inference` directory.
`inference` directory structure is as following:
``` shell
inference
├── CMakeLists.txt
├── mobilenet_test.cc
├── thread_mobilenet_test.cc
├── mobilenetv1
│ ├── model
│ └── params
├── run.sh
└── run_impl.sh
```
- `mobilenet_test.cc` is the source code for single-thread inference.
- `thread_mobilenet_test.cc` is the source code for multi-thread inference.
- `mobilenetv1` is the model directory.
- `run.sh` is the script for running inference.
3. Configure script:
Before running, we need to configure script `run.sh` as following:
``` shell
# set whether to enable MKL, GPU or TensorRT. Enabling TensorRT requires WITH_GPU being ON
WITH_MKL=ON
WITH_GPU=OFF
USE_TENSORRT=OFF
# set path to CUDA lib dir, CUDNN lib dir, TensorRT root dir and model dir
LIB_DIR=YOUR_LIB_DIR
CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR
CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR
TENSORRT_ROOT_DIR=YOUR_TENSORRT_ROOT_DIR
MODEL_DIR=YOUR_MODEL_DIR
```
Please configure `run.sh` depending on your environment.
In the two modes, the first is more convenient while the second strictly controls memory management to facilitate integration with `tcmalloc` and other libraries.
4. Build and run the sample.
### Upgrade performance based on contrib::AnalysisConfig
``` shell
sh run.sh
```
AnalyisConfig is at the stage of pre-release and protected by `namespace contrib` , which may be adjusted in the future.
1. If the CPU model allows, try to use the version with AVX and MKL.
2. You can try to use Intel's MKLDNN acceleration.
3. When the number of CPU cores available is enough, you can increase the num value in the setting `config->SetCpuMathLibraryNumThreads(num);`.
Similar to `NativeConfig` , `AnalysisConfig` can create a inference engine with high performance after a series of optimization, including analysis and optimization of computing graph as well as integration and revise for some important Ops, which **largely promotes the peformance of models, such as While, LSTM, GRU** .
### Tuning on GPU
1. You can try to open the TensorRT subgraph acceleration engine. Through the graph analysis, Paddle can automatically fuse certain subgraphs, and call NVIDIA's TensorRT for acceleration. For details, please refer to [Use Paddle-TensorRT Library for inference](./paddle_tensorrt_infer_en.html)。
The usage of `AnalysisConfig` is similiar with that of `NativeConfig` but the former *only supports CPU at present and is supporting GPU more and more*.
### Tuning with multi-thread
Paddle Fluid supports optimizing prediction performance by running multiple AnalysisPredictors on different threads, and supports CPU and GPU environments.
```c++
AnalysisConfigconfig;
config.SetModel(dirname);// set the directory of the model
config.EnableUseGpu(100,0/*gpu id*/);// use GPU,or
config.DisableGpu();// use CPU
config.SwitchSpecifyInputNames(true);// need to appoint the name of your input
config.SwitchIrOptim();// turn on the optimization switch,and a sequence of optimizations will be executed in operation
```
sample of using multi-threads is `thread_mobilenet_test.cc` downloaded from [sample](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz). You can change `mobilenet_test` in `run.sh` to `thread_mobilenet_test` to run inference with multi-thread.
Note that input PaddleTensor needs to be allocated. Previous examples need to be revised as follows:
```c++
autopredictor=
paddle::CreatePaddlePredictor<paddle::contrib::AnalysisConfig>(config);// it needs AnalysisConfig here
// create input tensor
int64_tdata[4]={1,2,3,4};
paddle::PaddleTensortensor;
tensor.shape=std::vector<int>({4,1});
tensor.data.Reset(data,sizeof(data));
tensor.dtype=paddle::PaddleDType::INT64;
tensor.name="input0";// name need to be set here
```
The subsequent execution process is totally the same with `NativeConfig` .
### variable-length sequence input
When dealing with variable-length sequence input, you need to set LoD for `PaddleTensor` .
``` c++
# Suppose the sequence lengths are [3, 2, 4, 1, 2, 3] in order.
tensor.lod={{0,
/*0 + 3=*/3,
/*3 + 2=*/5,
/*5 + 4=*/9,
/*9 + 1=*/10,
/*10 + 2=*/12,
/*12 + 3=*/15}};
sh run.sh
```
For more specific examples, please refer to[LoD-Tensor Instructions](../../../beginners_guide/basic_concept/lod_tensor_en.html)
### Suggestion for Performance
1. If the CPU type permits, it's best to use the versions with support for AVX and MKL.
2. Reuse input and output `PaddleTensor` to avoid frequent memory allocation resulting in low performance
3. Try to replace `NativeConfig` with `AnalysisConfig` to perform optimization for CPU or GPU inference
NVIDIA TensorRT is a is a platform for high-performance deep learning inference. It delivers low latency and high throughput for deep learning inference application.
Subgraph is used in PaddlePaddle to preliminarily integrate TensorRT, which enables TensorRT module to enhance inference performance of paddle models. The module is still under development. Currently supported models are AlexNet, MobileNet, ResNet50, VGG19, ResNext, Se-ReNext, GoogleNet, DPN, ICNET, Deeplabv3 Mobile, Net-SSD and so on. We will introduce the obtaining, usage and theory of Paddle-TensorRT library in this documentation.
[`paddle_inference_api.h`]('https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/api/paddle_inference_api.h') defines all APIs of TensorRT.
The parameters of the neural network are redundant to some extent. In many tasks, we can turn the Float32 model into Int8 model on the premise of precision. At present, Paddle-TRT supports to turn the trained Float32 model into Int8 model off line. The specific processes are as follows: 1)**Create the calibration table**. We prepare about 500 real input data, and input the data to the model. Paddle-TRT will count the range information of each op input and output value in the model, and record in the calibration table. The information can reduce the information loss during model transformation. 2)After creating the calibration table, run the model again, **Paddle-TRT will load the calibration table automatically**, and conduct the inference in the INT8 mode.
2. compile and test the INT8 example
```shell
cd SAMPLE_BASE_DIR/sample
# sh run_impl.sh {the address of inference libraries} {the name of test script} {model directories}
# We generate 500 input data to simulate the process, and it's suggested that you use real example for experiment.
sh run_impl.sh BASE_DIR/fluid_inference_install_dir/ fluid_generate_calib_test SAMPLE_BASE_DIR/sample/mobilenetv1
```
After the running period, there will be a new file named trt_calib_* under the `SAMPLE_BASE_DIR/sample/build/mobilenetv1` model directory, which is the calibration table.
``` shell
# conduct INT8 inference
# copy the model file with calibration tables to a specific address
Subgraph is used to integrate TensorRT in PaddlePaddle. After model is loaded, neural network can be represented as a computing graph composed of variables and computing nodes. Functions Paddle TensorRT implements are to scan the whole picture, discover subgraphs that can be optimized with TensorRT and replace them with TensorRT nodes. During the inference of model, Paddle will call TensorRT library to optimize TensorRT nodes and call native library of Paddle to optimize other nodes. During the inference, TensorRT can integrate Op horizonally and vertically to filter redundant Ops and is able to choose appropriate kernel for specific Op in specific platform to speed up the inference of model.
We can see in the Original Network that the green nodes represent nodes supported by TensorRT, the red nodes represent variables in network and yellow nodes represent nodes which can only be operated by native functions in Paddle. Green nodes in original network are extracted to compose subgraph which is replaced by a single TensorRT node to be transformed into `block-25` node in network. When such nodes are encountered during the runtime, TensorRT library will be called to execute them.
# Use Paddle-TensorRT Library for inference
NVIDIA TensorRT is a is a platform for high-performance deep learning inference. It delivers low latency and high throughput for deep learning inference application.
Subgraph is used in PaddlePaddle to preliminarily integrate TensorRT, which enables TensorRT module to enhance inference performance of paddle models. The module is still under development. Currently supported models are as following:
|classification|detection|segmentation|
|---|---|---|
|mobilenetv1|yolov3|ICNET|
|resnet50|SSD||
|vgg16|mask-rcnn||
|resnext|faster-rcnn||
|AlexNet|cascade-rcnn||
|Se-ResNext|retinanet||
|GoogLeNet|mobilenet-SSD||
|DPN|||
We will introduce the obtaining, usage and theory of Paddle-TensorRT library in this documentation.
**Note:**
1. When compiling from source, TensorRT library currently only supports GPU compilation, and you need to set the compilation option TensorRT_ROOT to the path where tensorrt is located.
2. Windows support requires TensorRT version 5.0 or higher.
3. Paddle-TRT currently only supports fixed input shape.
4. After downloading and installing tensorrt, you need to manually add virtual destructors for `class IPluginFactory` and `class IGpuAllocator` in the `NvInfer.h` file:
2.DownloadNVIDIATensorRT(withconsistentversionofcudaandcudnninlocalenvironment)from[NVIDIATensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) with an NVIDIA developer account.
3.Download[PaddleInferencesample](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz) and uncompress, and enter `sample/paddle-TRT` directory.
`paddle-TRT`directorystructureisasfollowing:
```
paddle-TRT
├── CMakeLists.txt
├── mobilenet_test.cc
├── fluid_generate_calib_test.cc
├── fluid_int8_test.cc
├── mobilenetv1
│ ├── model
│ └── params
├── run.sh
└── run_impl.sh
```
- `mobilenet_test.cc` is the c++ source code of inference using Paddle-TRT
- `fluid_generate_calib_test.cc` is the c++ source code of inference using Paddle-TRT int8 calibration to generate calibration table
- `fluid_int8_test.cc` is the c++ source code of inference using Paddle-TRT int8
- `mobilenetv1` is the model dir
- `run.sh` is the script for running inference
Here we assume that the current directory is `SAMPLE_BASE_DIR/sample/paddle-TRT`.
``` shell
# set whether to enable MKL, GPU or TensorRT. Enabling TensorRT requires WITH_GPU being ON
WITH_MKL=ON
WITH_GPU=OFF
USE_TENSORRT=OFF
# set path to CUDA lib dir, CUDNN lib dir, TensorRT root dir and model dir
LIB_DIR=YOUR_LIB_DIR
CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR
CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR
TENSORRT_ROOT_DIR=YOUR_TENSORRT_ROOT_DIR
MODEL_DIR=YOUR_MODEL_DIR
```
Please configure `run.sh` depending on your environment.
The parameters of the neural network are redundant to some extent. In many tasks, we can turn the Float32 model into Int8 model on the premise of precision. At present, Paddle-TRT supports to turn the trained Float32 model into Int8 model off line. The specific processes are as follows:
1)**Create the calibration table**. We prepare about 500 real input data, and input the data to the model. Paddle-TRT will count the range information of each op input and output value in the model, and record in the calibration table. The information can reduce the information loss during model transformation.
2)After creating the calibration table, run the model again, **Paddle-TRT will load the calibration table automatically**, and conduct the inference in the INT8 mode.
2. compile and test the INT8 example
change the `mobilenet_test` in `run.sh` to `fluid_generate_calib_test` and run
``` shell
sh run.sh
```
We generate 500 input data to simulate the process, and it's suggested that you use real example for experiment. After the running period, there will be a new file named trt_calib_* under the `SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/_opt_cache` model directory, which is the calibration table.
Then copy the model dir with calibration infomation to path
change `fluid_generate_calib_test` in `run.sh` to `fluid_int8_test`, and change model dir path to `SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib` and run
Subgraph is used to integrate TensorRT in PaddlePaddle. After model is loaded, neural network can be represented as a computing graph composed of variables and computing nodes. Functions Paddle TensorRT implements are to scan the whole picture, discover subgraphs that can be optimized with TensorRT and replace them with TensorRT nodes. During the inference of model, Paddle will call TensorRT library to optimize TensorRT nodes and call native library of Paddle to optimize other nodes. During the inference, TensorRT can integrate Op horizonally and vertically to filter redundant Ops and is able to choose appropriate kernel for specific Op in specific platform to speed up the inference of model.
We can see in the Original Network that the green nodes represent nodes supported by TensorRT, the red nodes represent variables in network and yellow nodes represent nodes which can only be operated by native functions in Paddle. Green nodes in original network are extracted to compose subgraph which is replaced by a single TensorRT node to be transformed into `block-25` node in network. When such nodes are encountered during the runtime, TensorRT library will be called to execute them.