提交 30e510a1 编写于 作者: T TeslaZhao

Merge branch 'develop' of https://github.com/TeslaZhao/Serving into develop

......@@ -79,6 +79,7 @@ The first step is to call the model save interface to generate a model parameter
- [Encryption](doc/C++_Serving/Encryption_EN.md)
- [Analyze and optimize performance(Chinese)](doc/C++_Serving/Performance_Tuning_CN.md)
- [Benchmark(Chinese)](doc/C++_Serving/Benchmark_CN.md)
- [Multiple models in series(Chinese)](doc/C++_Serving/2+_model.md)
- [Python Pipeline](doc/Python_Pipeline/Pipeline_Design_EN.md)
- [Analyze and optimize performance](doc/Python_Pipeline/Performance_Tuning_EN.md)
- [Benchmark(Chinese)](doc/Python_Pipeline/Benchmark_CN.md)
......
......@@ -74,6 +74,7 @@ Paddle Serving依托深度学习框架PaddlePaddle旨在帮助深度学习开发
- [加密模型推理服务](doc/C++_Serving/Encryption_CN.md)
- [性能优化指南](doc/C++_Serving/Performance_Tuning_CN.md)
- [性能指标](doc/C++_Serving/Benchmark_CN.md)
- [多模型串联](doc/C++_Serving/2+_model.md)
- [Python Pipeline设计](doc/Python_Pipeline/Pipeline_Design_CN.md)
- [性能优化指南](doc/Python_Pipeline/Performance_Tuning_CN.md)
- [性能指标](doc/Python_Pipeline/Benchmark_CN.md)
......
......@@ -10,115 +10,214 @@
- 缺点:需要改动代码,且需要重新编译。
本文主要介绍第二种效率高的方法,该方法的基本步骤如下:
1. 自定义OP
2. 自定义DAG
3. 编译
4. 服务启动与调用
1. 自定义OP(即定义单个模型的前处理-模型预测-模型后处理)
2. 编译
3. 服务启动与调用
# 1. 自定义OP
OP是Paddle Serving服务端的处理流程(即DAG图)的基本组成,参考[从0开始自定义OP](./OP_CN.md),该文档只是讲述了如何自定义一个调用预测的OP节点,您可以在此基础上加上前处理,后处理。
首先获取前置OP的输出,作为本OP的输入,并可以根据自己的需求,通过修改TensorVector* in指向的内存的数据,进行数据的前处理。
``` c++
const GeneralBlob *input_blob = get_depend_argument<GeneralBlob>(pre_name());
const TensorVector *in = &input_blob->tensor_vector;
一个OP定义了单个模型的前处理-模型预测-模型后处理,定义OP需要以下2步:
1. 定义C++ .h头文件
2. 定义C++ .cpp源文件
## 1.1 定义C++ .h头文件
复制👇的代码,将其中`/*自定义Class名称*/`更换为自定义的类名即可,如`GeneralDetectionOp`
放置于`core/general-server/op/`路径下,文件名自定义即可,如`general_detection_op.h`
``` C++
#pragma once
#include <string>
#include <vector>
#include "core/general-server/general_model_service.pb.h"
#include "core/general-server/op/general_infer_helper.h"
#include "paddle_inference_api.h" // NOLINT
namespace baidu {
namespace paddle_serving {
namespace serving {
class /*自定义Class名称*/
: public baidu::paddle_serving::predictor::OpWithChannel<GeneralBlob> {
public:
typedef std::vector<paddle::PaddleTensor> TensorVector;
DECLARE_OP(/*自定义Class名称*/);
int inference();
};
} // namespace serving
} // namespace paddle_serving
} // namespace baidu
```
## 1.2 定义C++ .cpp源文件
复制👇的代码,将其中`/*自定义Class名称*/`更换为自定义的类名,如`GeneralDetectionOp`
将前处理和后处理的代码添加在👇的代码中注释的前处理和后处理的位置。
放置于`core/general-server/op/`路径下,文件名自定义即可,如`general_detection_op.cpp`
``` C++
#include "core/general-server/op/自定义的头文件名"
#include <algorithm>
#include <iostream>
#include <memory>
#include <sstream>
#include "core/predictor/framework/infer.h"
#include "core/predictor/framework/memory.h"
#include "core/predictor/framework/resource.h"
#include "core/util/include/timer.h"
namespace baidu {
namespace paddle_serving {
namespace serving {
using baidu::paddle_serving::Timer;
using baidu::paddle_serving::predictor::MempoolWrapper;
using baidu::paddle_serving::predictor::general_model::Tensor;
using baidu::paddle_serving::predictor::general_model::Response;
using baidu::paddle_serving::predictor::general_model::Request;
using baidu::paddle_serving::predictor::InferManager;
using baidu::paddle_serving::predictor::PaddleGeneralModelConfig;
int /*自定义Class名称*/::inference() {
//获取前置OP节点
const std::vector<std::string> pre_node_names = pre_names();
if (pre_node_names.size() != 1) {
LOG(ERROR) << "This op(" << op_name()
<< ") can only have one predecessor op, but received "
<< pre_node_names.size();
return -1;
}
const std::string pre_name = pre_node_names[0];
//将前置OP的输出,作为本OP的输入。
GeneralBlob *input_blob = mutable_depend_argument<GeneralBlob>(pre_name);
if (!input_blob) {
LOG(ERROR) << "input_blob is nullptr,error";
return -1;
}
TensorVector *in = &input_blob->tensor_vector;
uint64_t log_id = input_blob->GetLogId();
int batch_size = input_blob->_batch_size;
声明本OP的输出
``` c++
//初始化本OP的输出。
GeneralBlob *output_blob = mutable_data<GeneralBlob>();
output_blob->SetLogId(log_id);
output_blob->_batch_size = batch_size;
VLOG(2) << "(logid=" << log_id << ") infer batch size: " << batch_size;
TensorVector *out = &output_blob->tensor_vector;
int batch_size = input_blob->GetBatchSize();
output_blob->SetBatchSize(batch_size);
```
完成前处理和定义输出变量后,核心调用预测引擎的一句话如下:
``` c++
if (InferManager::instance().infer(engine_name().c_str(), in, out, batch_size)) {
LOG(ERROR) << "Failed do infer in fluid model: " << engine_name().c_str();
//前处理的代码添加在此处,前处理直接修改上文的TensorVector* in
//注意in里面的数据是前置节点的输出经过后处理后的out中的数据
Timer timeline;
int64_t start = timeline.TimeStampUS();
timeline.Start();
// 将前处理后的in,初始化的out传入,进行模型预测,模型预测的输出会直接修改out指向的内存中的数据
// 如果您想定义一个不需要模型调用,只进行数据处理的OP,删除下面这一部分的代码即可。
if (InferManager::instance().infer(
engine_name().c_str(), in, out, batch_size)) {
LOG(ERROR) << "(logid=" << log_id
<< ") Failed do infer in fluid model: " << engine_name().c_str();
return -1;
}
//后处理的代码添加在此处,后处理直接修改上文的TensorVector* out
//后处理后的out会被传递给后续的节点
int64_t end = timeline.TimeStampUS();
CopyBlobInfo(input_blob, output_blob);
AddBlobInfo(output_blob, start);
AddBlobInfo(output_blob, end);
return 0;
}
```
DEFINE_OP(/*自定义Class名称*/);
在此之后,模型预测的输出已经写入与OP绑定的TensorVector* out指针变量所指向的内存空间,此时`可以通过修改TensorVector* out指向的内存的数据,进行数据的后处理`,下一个后置OP获取该OP的输出。
最后如果您使用Python API的方式启动Server端,在服务器端为Paddle Serving定义C++运算符后,最后一步是在Python API中为Paddle Serving服务器API添加注册, `python/paddle_serving_server/dag.py`文件里有关于API注册的代码如下
``` python
self.op_dict = {
"general_infer": "GeneralInferOp",
"general_reader": "GeneralReaderOp",
"general_response": "GeneralResponseOp",
"general_text_reader": "GeneralTextReaderOp",
"general_text_response": "GeneralTextResponseOp",
"general_single_kv": "GeneralSingleKVOp",
"general_dist_kv_infer": "GeneralDistKVInferOp",
"general_dist_kv": "GeneralDistKVOp",
"general_copy": "GeneralCopyOp",
"general_detection":"GeneralDetectionOp",
}
} // namespace serving
} // namespace paddle_serving
} // namespace baidu
```
其中左侧的`”general_infer“名字为自定义(下文有用)`,右侧的`"GeneralInferOp"为自定义的C++OP类的类名`。
在`python/paddle_serving_server/server.py`文件中仅添加`需要加载模型,执行推理预测的自定义的C++OP类的类名`。例如`general_reader`由于只是做一些简单的数据处理而不加载模型调用预测,故在👆的代码中需要添加,而不添加在👇的代码中。
``` python
default_engine_types = [
'GeneralInferOp',
'GeneralDistKVInferOp',
'GeneralDistKVQuantInferOp',
'GeneralDetectionOp',
]
### TensorVector数据结构
TensorVector* in和out都是一个TensorVector类型的指指针,其使用方法跟Paddle C++API中的Tensor几乎一样,相关的数据结构如下所示
``` C++
//TensorVector
typedef std::vector<paddle::PaddleTensor> TensorVector;
//paddle::PaddleTensor
struct PD_INFER_DECL PaddleTensor {
PaddleTensor() = default;
std::string name; ///< variable name.
std::vector<int> shape;
PaddleBuf data; ///< blob of data.
PaddleDType dtype;
std::vector<std::vector<size_t>> lod; ///< Tensor+LoD equals LoDTensor
};
//PaddleBuf
class PD_INFER_DECL PaddleBuf {
public:
explicit PaddleBuf(size_t length)
: data_(new char[length]), length_(length), memory_owned_(true) {}
PaddleBuf(void* data, size_t length)
: data_(data), length_(length), memory_owned_{false} {}
explicit PaddleBuf(const PaddleBuf& other);
void Resize(size_t length);
void Reset(void* data, size_t length);
bool empty() const { return length_ == 0; }
void* data() const { return data_; }
size_t length() const { return length_; }
~PaddleBuf() { Free(); }
PaddleBuf& operator=(const PaddleBuf&);
PaddleBuf& operator=(PaddleBuf&&);
PaddleBuf() = default;
PaddleBuf(PaddleBuf&& other);
private:
void Free();
void* data_{nullptr}; ///< pointer to the data memory.
size_t length_{0}; ///< number of memory bytes.
bool memory_owned_{true};
};
```
# 2. 自定义DAG
DAG图是Server端处理流程的基本定义,在完成上述OP定义的基础上,参考[自定义DAG图](./DAG_CN.md),您可以自行构建Server端多模型(即多个OP)之间的处理逻辑关系。
框架一般需要在开头加上一个`general_reader`,在结尾加上一个`general_response`,中间添加实际需要调用预测的自定义OP,例如`general_infer`就是一个框架定义好的默认OP,它只调用预测,没有前后处理。
例如,对于OCR模型来说,实际是串联det和rec两个模型,我们可以使用一个`自定义的"general_detection"`和`"general_infer"(注意,此处名字必须与上述Python API中严格对应)`构建DAG图,代码(`python/paddle_serving_server/serve.py`)原理如下图所示。
``` python
import paddle_serving_server as serving
from paddle_serving_server import OpMaker
from paddle_serving_server import OpSeqMaker
op_maker = serving.OpMaker()
read_op = op_maker.create('general_reader')
general_detection_op = op_maker.create('general_detection')
general_infer_op = op_maker.create('general_infer')
general_response_op = op_maker.create('general_response')
op_seq_maker = serving.OpSeqMaker()
op_seq_maker.add_op(read_op)
op_seq_maker.add_op(general_detection_op)
op_seq_maker.add_op(general_infer_op)
op_seq_maker.add_op(general_response_op)
### TensorVector代码示例
```C++
/*例如,你想访问输入数据中的第1个Tensor*/
paddle::PaddleTensor& tensor_1 = in->at(0);
/*例如,你想修改输入数据中的第1个Tensor的名称*/
tensor_1.name = "new name";
/*例如,你想获取输入数据中的第1个Tensor的shape信息*/
std::vector<int> tensor_1_shape = tensor_1.shape;
/*例如,你想修改输入数据中的第1个Tensor中的数据*/
void* data_1 = tensor_1.data.data();
//后续直接修改data_1指向的内存即可
//比如,当您的数据是int类型,将void*转换为int*进行处理即可
```
# 3. 编译
# 2. 编译
此时,需要您重新编译生成serving,并通过`export SERVING_BIN`设置环境变量来指定使用您编译生成的serving二进制文件,并通过`pip3 install`的方式安装相关python包,细节请参考[如何编译Serving](../Compile_CN.md)
# 4. 服务启动与调用
## 4.1 Server端启动
仍然以OCR模型为例,分别单独启动det单模型和的脚本代码如下:
```python
#分别单独启动模型
python3 -m paddle_serving_server.serve --model ocr_det_model --port 9293#det模型
python3 -m paddle_serving_server.serve --model ocr_rec_model --port 9294#rec模型
```
在前面三个小节工作做好的基础上,一个服务启动两个模型串联,只需要在`--model后依次按顺序传入模型文件夹的相对路径`即可,脚本代码如下:
# 3. 服务启动与调用
## 3.1 Server端启动
在前面两个小节工作做好的基础上,一个服务启动两个模型串联,只需要在`--model后依次按顺序传入模型文件夹的相对路径`,且需要在`--op后依次传入自定义C++OP类名称`,其中--model后面的模型与--op后面的类名称的顺序需要对应,`这里假设我们已经定义好了两个OP分别为GeneralDetectionOp和GeneralRecOp`,则脚本代码如下:
```python
#一个服务启动多模型串联
python3 -m paddle_serving_server.serve --model ocr_det_model ocr_rec_model --port 9295#多模型串联
python3 -m paddle_serving_server.serve --model ocr_det_model ocr_rec_model --op GeneralDetectionOp GeneralRecOp --port 9292
#多模型串联 ocr_det_model对应GeneralDetectionOp ocr_rec_model对应GeneralRecOp
```
## 4.2 Client端调用
此时,Client端的调用,也需要传入两个Client端的[proto定义](./Serving_Configure_CN.md),python脚本代码如下:
## 3.2 Client端调用
此时,Client端的调用,也需要传入两个Client端的proto文件或文件夹的路径,以OCR为例,可以参考[ocr_cpp_client.py](../../examples/C++/PaddleOCR/ocr/ocr_cpp_client.py)来自行编写您的脚本,此时Client调用如下:
```python
#一个服务启动多模型串联
python3 ocr_cpp_client.py ocr_det_client ocr_rec_client
#ocr_det_client为第一个模型的Client端proto文件夹相对路径
#ocr_rec_client为第二个模型的Client端proto文件夹相对路径
python3 自定义.py ocr_det_client ocr_rec_client
#ocr_det_client为第一个模型的Client端proto文件夹相对路径
#ocr_rec_client为第二个模型的Client端proto文件夹相对路径
```
此时,对于Server端而言,`'general_reader'`会检查输入的数据的格式是否与第一个模型的Client端proto格式定义的一致,`'general_response'`会保证输出的数据格式与第二个模型的Client端proto文件一致
此时,对于Server端而言,输入的数据的格式与`第一个模型的Client端proto格式`定义的一致,输出的数据格式与`最后一个模型的Client端proto`文件一致。一般情况下您无须关注此事,当您需要了解详细的[proto的定义,请参考此处](./Serving_Configure_CN.md)
......@@ -30,9 +30,9 @@ from paddle_serving_server import OpMaker
from paddle_serving_server import OpSeqMaker
op_maker = serving.OpMaker()
read_op = op_maker.create('general_reader')
general_infer_op = op_maker.create('general_infer')
general_response_op = op_maker.create('general_response')
read_op = op_maker.create('GeneralReaderOp')
general_infer_op = op_maker.create('GeneralInferOp')
general_response_op = op_maker.create('GeneralResponseOp')
op_seq_maker = serving.OpSeqMaker()
op_seq_maker.add_op(read_op)
......@@ -65,13 +65,13 @@ from paddle_serving_server import OpGraphMaker
from paddle_serving_server import Server
op_maker = OpMaker()
read_op = op_maker.create('general_reader')
read_op = op_maker.create('GeneralReaderOp')
cnn_infer_op = op_maker.create(
'general_infer', engine_name='cnn', inputs=[read_op])
'GeneralInferOp', engine_name='cnn', inputs=[read_op])
bow_infer_op = op_maker.create(
'general_infer', engine_name='bow', inputs=[read_op])
'GeneralInferOp', engine_name='bow', inputs=[read_op])
response_op = op_maker.create(
'general_response', inputs=[cnn_infer_op, bow_infer_op])
'GeneralResponseOp', inputs=[cnn_infer_op, bow_infer_op])
op_graph_maker = OpGraphMaker()
op_graph_maker.add_op(read_op)
......@@ -92,10 +92,10 @@ from paddle_serving_server import OpMaker
from paddle_serving_server import OpSeqMaker
op_maker = serving.OpMaker()
read_op = op_maker.create('general_reader')
dist_kv_op = op_maker.create('general_dist_kv')
general_infer_op = op_maker.create('general_infer')
general_response_op = op_maker.create('general_response')
read_op = op_maker.create('GeneralReaderOp')
dist_kv_op = op_maker.create('GeneralDistKVInferOp')
general_infer_op = op_maker.create('GeneralInferOp')
general_response_op = op_maker.create('GeneralResponseOp')
op_seq_maker = serving.OpSeqMaker()
op_seq_maker.add_op(read_op)
......
......@@ -29,9 +29,9 @@ from paddle_serving_server import OpMaker
from paddle_serving_server import OpSeqMaker
op_maker = serving.OpMaker()
read_op = op_maker.create('general_reader')
general_infer_op = op_maker.create('general_infer')
general_response_op = op_maker.create('general_response')
read_op = op_maker.create('GeneralReaderOp')
general_infer_op = op_maker.create('GeneralInferOp')
general_response_op = op_maker.create('GeneralResponseOp')
op_seq_maker = serving.OpSeqMaker()
op_seq_maker.add_op(read_op)
......@@ -63,13 +63,13 @@ from paddle_serving_server import OpGraphMaker
from paddle_serving_server import Server
op_maker = OpMaker()
read_op = op_maker.create('general_reader')
read_op = op_maker.create('GeneralReaderOp')
cnn_infer_op = op_maker.create(
'general_infer', engine_name='cnn', inputs=[read_op])
'GeneralInferOp', engine_name='cnn', inputs=[read_op])
bow_infer_op = op_maker.create(
'general_infer', engine_name='bow', inputs=[read_op])
'GeneralInferOp', engine_name='bow', inputs=[read_op])
response_op = op_maker.create(
'general_response', inputs=[cnn_infer_op, bow_infer_op])
'GeneralResponseOp', inputs=[cnn_infer_op, bow_infer_op])
op_graph_maker = OpGraphMaker()
op_graph_maker.add_op(read_op)
......@@ -90,10 +90,10 @@ from paddle_serving_server import OpMaker
from paddle_serving_server import OpSeqMaker
op_maker = serving.OpMaker()
read_op = op_maker.create('general_reader')
dist_kv_op = op_maker.create('general_dist_kv')
general_infer_op = op_maker.create('general_infer')
general_response_op = op_maker.create('general_response')
read_op = op_maker.create('GeneralReaderOp')
dist_kv_op = op_maker.create('GeneralDistKVInferOp')
general_infer_op = op_maker.create('GeneralInferOp')
general_response_op = op_maker.create('GeneralResponseOp')
op_seq_maker = serving.OpSeqMaker()
op_seq_maker.add_op(read_op)
......
......@@ -85,7 +85,7 @@ print(fetch_map)
用户也可以将数据格式处理逻辑放在服务器端进行,这样就可以直接用curl去访问服务,参考如下案例,在目录`Serving/examples/C++/fit_a_line`.
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
```
客户端输入
```
......
......@@ -57,7 +57,7 @@ Here, `client.predict` function has two arguments. `feed` is a `python dict` wit
Users can also put the data format processing logic on the server side, so that they can directly use curl to access the service, refer to the following case whose path is `Serving/examples/C++/fit_a_line`
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
```
for client side,
```
......
# 如何配置TensorRT动态shape
(简体中文|[English](./TensorRT_Dynamic_Shape_EN.md))
## 引言
在Pipeline/C++开启TensorRT`--use_trt`后,关于如何进行动态shape的配置,以下会分别给出Pipeline Serving和C++ Serving示例
以下是动态shape api
```
void SetTRTDynamicShapeInfo(
std::map<std::string, std::vector<int>> min_input_shape,
std::map<std::string, std::vector<int>> max_input_shape,
std::map<std::string, std::vector<int>> optim_input_shape,
bool disable_trt_plugin_fp16 = false);
```
具体API说明请参考[C++](https://paddleinference.paddlepaddle.org.cn/api_reference/cxx_api_doc/Config/GPUConfig.html#tensorrt)/[Python](https://paddleinference.paddlepaddle.org.cn/api_reference/python_api_doc/Config/GPUConfig.html#tensorrt)
### C++ Serving
`**/paddle_inference/paddle/include/paddle_engine.h` 修改如下代码
```
if (engine_conf.has_use_trt() && engine_conf.use_trt()) {
config.SwitchIrOptim(true);
if (!engine_conf.has_use_gpu() || !engine_conf.use_gpu()) {
config.EnableUseGpu(50, gpu_id);
if (engine_conf.has_gpu_multi_stream() &&
engine_conf.gpu_multi_stream()) {
config.EnableGpuMultiStream();
}
}
config.EnableTensorRtEngine((1 << 30) + (1 << 29),
max_batch,
min_subgraph_size,
precision_type,
true,
FLAGS_use_calib);
// set trt dynamic shape
{
int bsz = 1;
int max_seq_len = 512;
std::map<std::string, std::vector<int>> min_input_shape;
std::map<std::string, std::vector<int>> max_input_shape;
std::map<std::string, std::vector<int>> optim_input_shape;
int hidden_size = 0;
min_input_shape["stack_0.tmp_0"] = {1, 16, 1, 1};
min_input_shape["stack_1.tmp_0"] = {1, 2, 1, 1};
min_input_shape["input_mask"] = {1, 1, 1};
min_input_shape["_generated_var_64"] = {1, 1, 768};
min_input_shape["fc_0.tmp_0"] = {1, 1, 768};
min_input_shape["_generated_var_87"] = {1, 1, 768};
min_input_shape["tmp_175"] = {1, 1, 768};
min_input_shape["c_allreduce_sum_0.tmp_0"] = {1,1, 12288};
min_input_shape["embedding_1.tmp_0"] = {1, 1, 12288};
max_input_shape["stack_0.tmp_0"] = {bsz, 16, max_seq_len, max_seq_len};
max_input_shape["stack_1.tmp_0"] = {bsz, 2, max_seq_len, max_seq_len};
max_input_shape["input_mask"] = {bsz, max_seq_len, max_seq_len};
max_input_shape["_generated_var_64"] = {bsz, max_seq_len, 768};
max_input_shape["fc_0.tmp_0"] = {bsz, max_seq_len, 768};
max_input_shape["_generated_var_87"] = {bsz, max_seq_len, 768};
max_input_shape["tmp_175"] = {bsz, max_seq_len, 768};
max_input_shape["c_allreduce_sum_0.tmp_0"] = {bsz,max_seq_len, 12288};
max_input_shape["embedding_1.tmp_0"] = {bsz, max_seq_len, 12288};
int g1 = 0;
int g2 = 0;
int t1 = 0;
int t2 = 0;
std::string var_name = "_generated_var_";
std::string tmp_name = "tmp_";
for (int i = 0; i < 44; ++i) {
if (i > 32) {
hidden_size = 768;
g1 = 2*i-1;
g2 = 2*i;
t1 = 4*i-1;
t2 = 4*i;
min_input_shape[var_name+std::to_string(g1)] = {1, 1, hidden_size};
min_input_shape[var_name+std::to_string(g2)] = {1, 1, hidden_size};
min_input_shape[tmp_name+std::to_string(t1)] = {1, 1, hidden_size};
min_input_shape[tmp_name+std::to_string(t2)] = {1, 1, hidden_size};
max_input_shape[var_name+std::to_string(g1)] = {bsz, max_seq_len, hidden_size};
max_input_shape[var_name+std::to_string(g2)] = {bsz, max_seq_len, hidden_size};
max_input_shape[tmp_name+std::to_string(t1)] = {bsz, max_seq_len, hidden_size};
max_input_shape[tmp_name+std::to_string(t2)] = {bsz, max_seq_len, hidden_size};
}
if (i <32) {
hidden_size = 12288;
g1 = 2*i;
g2 = 2*i+1;
t1 = 4*i;
t2 = 4*i+3;
min_input_shape[var_name+std::to_string(g1)] = {1, 1, hidden_size};
min_input_shape[var_name+std::to_string(g2)] = {1, 1, hidden_size};
min_input_shape[tmp_name+std::to_string(t1)] = {1, 1, hidden_size};
min_input_shape[tmp_name+std::to_string(t2)] = {1, 1, hidden_size};
max_input_shape[var_name+std::to_string(g1)] = {bsz, max_seq_len, hidden_size};
max_input_shape[var_name+std::to_string(g2)] = {bsz, max_seq_len, hidden_size};
max_input_shape[tmp_name+std::to_string(t1)] = {bsz, max_seq_len, hidden_size};
max_input_shape[tmp_name+std::to_string(t2)] = {bsz, max_seq_len, hidden_size};
}
}
optim_input_shape = max_input_shape;
config.SetTRTDynamicShapeInfo(
min_input_shape, max_input_shape, optim_input_shape);
}
LOG(INFO) << "create TensorRT predictor";
}
```
### Pipeline Serving
`**/python/paddle_serving_app/local_predict.py`中修改如下代码
```
if use_trt:
config.enable_tensorrt_engine(
precision_mode=precision_type,
workspace_size=1 << 20,
max_batch_size=32,
min_subgraph_size=3,
use_static=False,
use_calib_mode=False)
head_number = 12
names = [
"placeholder_0", "placeholder_1", "placeholder_2", "stack_0.tmp_0"
]
min_input_shape = [1, 1, 1]
max_input_shape = [100, 128, 1]
opt_input_shape = [10, 60, 1]
config.set_trt_dynamic_shape_info(
{
names[0]: min_input_shape,
names[1]: min_input_shape,
names[2]: min_input_shape,
names[3]: [1, head_number, 1, 1]
}, {
names[0]: max_input_shape,
names[1]: max_input_shape,
names[2]: max_input_shape,
names[3]: [100, head_number, 128, 128]
}, {
names[0]: opt_input_shape,
names[1]: opt_input_shape,
names[2]: opt_input_shape,
names[3]: [10, head_number, 60, 60]
})
```
# How to configure TensorRT dynamic shape
([简体中文](./TensorRT_Dynamic_Shape_CN.md)|English)
## Overview
After enabling TensorRT `--use_trt` in Pipeline/C++. Regarding how to configure the dynamic shape, the following will give examples of Pipeline Serving and C++ Serving respectively
The following is the dynamic shape api
```
void SetTRTDynamicShapeInfo(
std::map<std::string, std::vector<int>> min_input_shape,
std::map<std::string, std::vector<int>> max_input_shape,
std::map<std::string, std::vector<int>> optim_input_shape,
bool disable_trt_plugin_fp16 = false);
```
For detail, please refer to API doc [C++](https://paddleinference.paddlepaddle.org.cn/api_reference/cxx_api_doc/Config/GPUConfig.html#tensorrt)/[Python](https://paddleinference.paddlepaddle.org.cn/api_reference/python_api_doc/Config/GPUConfig.html#tensorrt)
### C++ Serving
Modify the following code in `**/paddle_inference/paddle/include/paddle_engine.h`
```
if (engine_conf.has_use_trt() && engine_conf.use_trt()) {
config.SwitchIrOptim(true);
if (!engine_conf.has_use_gpu() || !engine_conf.use_gpu()) {
config.EnableUseGpu(50, gpu_id);
if (engine_conf.has_gpu_multi_stream() &&
engine_conf.gpu_multi_stream()) {
config.EnableGpuMultiStream();
}
}
config.EnableTensorRtEngine((1 << 30) + (1 << 29),
max_batch,
min_subgraph_size,
precision_type,
true,
FLAGS_use_calib);
// set trt dynamic shape
{
int bsz = 1;
int max_seq_len = 512;
std::map<std::string, std::vector<int>> min_input_shape;
std::map<std::string, std::vector<int>> max_input_shape;
std::map<std::string, std::vector<int>> optim_input_shape;
int hidden_size = 0;
min_input_shape["stack_0.tmp_0"] = {1, 16, 1, 1};
min_input_shape["stack_1.tmp_0"] = {1, 2, 1, 1};
min_input_shape["input_mask"] = {1, 1, 1};
min_input_shape["_generated_var_64"] = {1, 1, 768};
min_input_shape["fc_0.tmp_0"] = {1, 1, 768};
min_input_shape["_generated_var_87"] = {1, 1, 768};
min_input_shape["tmp_175"] = {1, 1, 768};
min_input_shape["c_allreduce_sum_0.tmp_0"] = {1,1, 12288};
min_input_shape["embedding_1.tmp_0"] = {1, 1, 12288};
max_input_shape["stack_0.tmp_0"] = {bsz, 16, max_seq_len, max_seq_len};
max_input_shape["stack_1.tmp_0"] = {bsz, 2, max_seq_len, max_seq_len};
max_input_shape["input_mask"] = {bsz, max_seq_len, max_seq_len};
max_input_shape["_generated_var_64"] = {bsz, max_seq_len, 768};
max_input_shape["fc_0.tmp_0"] = {bsz, max_seq_len, 768};
max_input_shape["_generated_var_87"] = {bsz, max_seq_len, 768};
max_input_shape["tmp_175"] = {bsz, max_seq_len, 768};
max_input_shape["c_allreduce_sum_0.tmp_0"] = {bsz,max_seq_len, 12288};
max_input_shape["embedding_1.tmp_0"] = {bsz, max_seq_len, 12288};
int g1 = 0;
int g2 = 0;
int t1 = 0;
int t2 = 0;
std::string var_name = "_generated_var_";
std::string tmp_name = "tmp_";
for (int i = 0; i < 44; ++i) {
if (i > 32) {
hidden_size = 768;
g1 = 2*i-1;
g2 = 2*i;
t1 = 4*i-1;
t2 = 4*i;
min_input_shape[var_name+std::to_string(g1)] = {1, 1, hidden_size};
min_input_shape[var_name+std::to_string(g2)] = {1, 1, hidden_size};
min_input_shape[tmp_name+std::to_string(t1)] = {1, 1, hidden_size};
min_input_shape[tmp_name+std::to_string(t2)] = {1, 1, hidden_size};
max_input_shape[var_name+std::to_string(g1)] = {bsz, max_seq_len, hidden_size};
max_input_shape[var_name+std::to_string(g2)] = {bsz, max_seq_len, hidden_size};
max_input_shape[tmp_name+std::to_string(t1)] = {bsz, max_seq_len, hidden_size};
max_input_shape[tmp_name+std::to_string(t2)] = {bsz, max_seq_len, hidden_size};
}
if (i <32) {
hidden_size = 12288;
g1 = 2*i;
g2 = 2*i+1;
t1 = 4*i;
t2 = 4*i+3;
min_input_shape[var_name+std::to_string(g1)] = {1, 1, hidden_size};
min_input_shape[var_name+std::to_string(g2)] = {1, 1, hidden_size};
min_input_shape[tmp_name+std::to_string(t1)] = {1, 1, hidden_size};
min_input_shape[tmp_name+std::to_string(t2)] = {1, 1, hidden_size};
max_input_shape[var_name+std::to_string(g1)] = {bsz, max_seq_len, hidden_size};
max_input_shape[var_name+std::to_string(g2)] = {bsz, max_seq_len, hidden_size};
max_input_shape[tmp_name+std::to_string(t1)] = {bsz, max_seq_len, hidden_size};
max_input_shape[tmp_name+std::to_string(t2)] = {bsz, max_seq_len, hidden_size};
}
}
optim_input_shape = max_input_shape;
config.SetTRTDynamicShapeInfo(
min_input_shape, max_input_shape, optim_input_shape);
}
LOG(INFO) << "create TensorRT predictor";
}
```
### Pipeline Serving
Modify the following code in `**/python/paddle_serving_app/local_predict.py`
```
if use_trt:
config.enable_tensorrt_engine(
precision_mode=precision_type,
workspace_size=1 << 20,
max_batch_size=32,
min_subgraph_size=3,
use_static=False,
use_calib_mode=False)
head_number = 12
names = [
"placeholder_0", "placeholder_1", "placeholder_2", "stack_0.tmp_0"
]
min_input_shape = [1, 1, 1]
max_input_shape = [100, 128, 1]
opt_input_shape = [10, 60, 1]
config.set_trt_dynamic_shape_info(
{
names[0]: min_input_shape,
names[1]: min_input_shape,
names[2]: min_input_shape,
names[3]: [1, head_number, 1, 1]
}, {
names[0]: max_input_shape,
names[1]: max_input_shape,
names[2]: max_input_shape,
names[3]: [100, head_number, 128, 128]
}, {
names[0]: opt_input_shape,
names[1]: opt_input_shape,
names[2]: opt_input_shape,
names[3]: [10, head_number, 60, 60]
})
```
\ No newline at end of file
......@@ -20,9 +20,9 @@ from paddle_serving_server import OpSeqMaker
from paddle_serving_server import Server
op_maker = OpMaker()
read_op = op_maker.create('general_reader')
general_dist_kv_infer_op = op_maker.create('general_dist_kv_infer')
response_op = op_maker.create('general_response')
read_op = op_maker.create('GeneralReaderOp')
general_dist_kv_infer_op = op_maker.create('GeneralDistKVInferOp')
response_op = op_maker.create('GeneralResponseOp')
op_seq_maker = OpSeqMaker()
op_seq_maker.add_op(read_op)
......
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .proto import server_configure_pb2 as server_sdk
import google.protobuf.text_format
import collections
class OpMaker(object):
def __init__(self):
self.op_dict = {
"general_infer": "GeneralInferOp",
"general_reader": "GeneralReaderOp",
"general_response": "GeneralResponseOp",
"general_text_reader": "GeneralTextReaderOp",
"general_text_response": "GeneralTextResponseOp",
"general_single_kv": "GeneralSingleKVOp",
"general_dist_kv_infer": "GeneralDistKVInferOp",
"general_dist_kv": "GeneralDistKVOp",
"general_copy": "GeneralCopyOp",
"general_detection":"GeneralDetectionOp",
}
self.op_list = [
"GeneralInferOp",
"GeneralReaderOp",
"GeneralResponseOp",
"GeneralTextReaderOp",
"GeneralTextResponseOp",
"GeneralSingleKVOp",
"GeneralDistKVInferOp",
"GeneralDistKVOp",
"GeneralCopyOp",
"GeneralDetectionOp",
]
self.node_name_suffix_ = collections.defaultdict(int)
def create(self, node_type, engine_name=None, inputs=[], outputs=[]):
if node_type not in self.op_dict:
if node_type not in self.op_list:
raise Exception("Op type {} is not supported right now".format(
node_type))
node = server_sdk.DAGNode()
......@@ -32,7 +46,7 @@ class OpMaker(object):
self.node_name_suffix_[node_type])
self.node_name_suffix_[node_type] += 1
node.type = self.op_dict[node_type]
node.type = node_type
if inputs:
for dep_node_str in inputs:
dep_node = server_sdk.DAGNode()
......@@ -47,6 +61,7 @@ class OpMaker(object):
# overall efficiency.
return google.protobuf.text_format.MessageToString(node)
class OpSeqMaker(object):
def __init__(self):
self.workflow = server_sdk.Workflow()
......@@ -79,6 +94,7 @@ class OpSeqMaker(object):
workflow_conf.workflows.extend([self.workflow])
return workflow_conf
# TODO:Currently, SDK only supports "Sequence".OpGraphMaker is not useful.
# Config should be changed to adapt command-line for list[dict] or list[list[] ]
class OpGraphMaker(object):
......
......@@ -142,6 +142,8 @@ def serve_args():
help="Max batch of each op")
parser.add_argument(
"--model", type=str, default="", nargs="+", help="Model for serving")
parser.add_argument(
"--op", type=str, default="", nargs="+", help="Model for serving")
parser.add_argument(
"--workdir",
type=str,
......@@ -183,7 +185,10 @@ def serve_args():
parser.add_argument(
"--use_xpu", default=False, action="store_true", help="Use XPU")
parser.add_argument(
"--use_ascend_cl", default=False, action="store_true", help="Use Ascend CL")
"--use_ascend_cl",
default=False,
action="store_true",
help="Use Ascend CL")
parser.add_argument(
"--product_name",
type=str,
......@@ -208,6 +213,11 @@ def start_gpu_card_model(gpu_mode, port, args): # pylint: disable=doc-string-mi
if gpu_mode == True:
device = "gpu"
import paddle_serving_server as serving
op_maker = serving.OpMaker()
op_seq_maker = serving.OpSeqMaker()
server = serving.Server()
thread_num = args.thread
model = args.model
mem_optim = args.mem_optim_off is False
......@@ -215,38 +225,58 @@ def start_gpu_card_model(gpu_mode, port, args): # pylint: disable=doc-string-mi
use_mkl = args.use_mkl
max_body_size = args.max_body_size
workdir = "{}_{}".format(args.workdir, port)
dag_list_op = []
if model == "":
print("You must specify your serving model")
exit(-1)
for single_model_config in args.model:
if os.path.isdir(single_model_config):
pass
elif os.path.isfile(single_model_config):
raise ValueError("The input of --model should be a dir not file.")
import paddle_serving_server as serving
op_maker = serving.OpMaker()
op_seq_maker = serving.OpSeqMaker()
read_op = op_maker.create('general_reader')
# 如果通过--op GeneralDetectionOp GeneralRecOp
# 将不存在的自定义OP加入到DAG图和模型的列表中
# 并将传入顺序记录在dag_list_op中。
if args.op != "":
for single_op in args.op:
temp_str_list = single_op.split(':')
if len(temp_str_list) >= 1 and temp_str_list[0] != '':
if temp_str_list[0] not in op_maker.op_list:
op_maker.op_list.append(temp_str_list[0])
if len(temp_str_list) >= 2 and temp_str_list[1] == '0':
pass
else:
server.default_engine_types.append(temp_str_list[0])
dag_list_op.append(temp_str_list[0])
read_op = op_maker.create('GeneralReaderOp')
op_seq_maker.add_op(read_op)
for idx, single_model in enumerate(model):
infer_op_name = "general_infer"
# 目前由于ocr的节点Det模型依赖于opencv的第三方库
# 只有使用ocr的时候,才会加入opencv的第三方库并编译GeneralDetectionOp
# 故此处做特殊处理,当不满足下述情况时,所添加的op默认为GeneralInferOp
# 以后可能考虑不用python脚本来生成配置
if len(model) == 2 and idx == 0 and single_model == "ocr_det_model":
infer_op_name = "general_detection"
else:
infer_op_name = "general_infer"
general_infer_op = op_maker.create(infer_op_name)
op_seq_maker.add_op(general_infer_op)
#如果dag_list_op不是空,那么证明通过--op 传入了自定义OP或自定义的DAG串联关系。
#此时,根据--op 传入的顺序去组DAG串联关系
if len(dag_list_op) > 0:
for single_op in dag_list_op:
op_seq_maker.add_op(op_maker.create(single_op))
#否则,仍然按照原有方式根虎--model去串联。
else:
for idx, single_model in enumerate(model):
infer_op_name = "GeneralInferOp"
# 目前由于ocr的节点Det模型依赖于opencv的第三方库
# 只有使用ocr的时候,才会加入opencv的第三方库并编译GeneralDetectionOp
# 故此处做特殊处理,当不满足下述情况时,所添加的op默认为GeneralInferOp
# 以后可能考虑不用python脚本来生成配置
if len(model) == 2 and idx == 0 and single_model == "ocr_det_model":
infer_op_name = "GeneralDetectionOp"
else:
infer_op_name = "GeneralInferOp"
general_infer_op = op_maker.create(infer_op_name)
op_seq_maker.add_op(general_infer_op)
general_response_op = op_maker.create('general_response')
general_response_op = op_maker.create('GeneralResponseOp')
op_seq_maker.add_op(general_response_op)
server = serving.Server()
server.set_op_sequence(op_seq_maker.get_op_sequence())
server.set_num_threads(thread_num)
server.use_mkl(use_mkl)
......
......@@ -49,8 +49,8 @@ class Server(object):
self.workflow_fn:'str'="workflow.prototxt" # Only one for one Service/Workflow
self.resource_fn:'str'="resource.prototxt" # Only one for one Service,model_toolkit_fn and general_model_config_fn is recorded in this file
self.infer_service_fn:'str'="infer_service.prototxt" # Only one for one Service,Service--Workflow
self.model_toolkit_fn:'list'=[] # ["general_infer_0/model_toolkit.prototxt"]The quantity is equal to the InferOp quantity,Engine--OP
self.general_model_config_fn:'list'=[] # ["general_infer_0/general_model.prototxt"]The quantity is equal to the InferOp quantity,Feed and Fetch --OP
self.model_toolkit_fn:'list'=[] # ["GeneralInferOp_0/model_toolkit.prototxt"]The quantity is equal to the InferOp quantity,Engine--OP
self.general_model_config_fn:'list'=[] # ["GeneralInferOp_0/general_model.prototxt"]The quantity is equal to the InferOp quantity,Feed and Fetch --OP
self.subdirectory:'list'=[] # The quantity is equal to the InferOp quantity, and name = node.name = engine.name
self.model_config_paths:'collections.OrderedDict()' # Save the serving_server_conf.prototxt path (feed and fetch information) this is a map for multi-model in a workflow
"""
......@@ -92,6 +92,12 @@ class Server(object):
self.model_config_paths = collections.OrderedDict()
self.product_name = None
self.container_id = None
self.default_engine_types = [
'GeneralInferOp',
'GeneralDistKVInferOp',
'GeneralDistKVQuantInferOp',
'GeneralDetectionOp',
]
def get_fetch_list(self, infer_node_idx=-1):
fetch_names = [
......@@ -298,7 +304,7 @@ class Server(object):
fout.write(str(list(self.model_conf.values())[idx]))
for workflow in self.workflow_conf.workflows:
for node in workflow.nodes:
if "dist_kv" in node.name:
if "distkv" in node.name.lower():
self.resource_conf.cube_config_path = workdir
self.resource_conf.cube_config_file = self.cube_config_fn
if cube_conf == None:
......@@ -306,7 +312,7 @@ class Server(object):
"Please set the path of cube.conf while use dist_kv op."
)
shutil.copy(cube_conf, workdir)
if "quant" in node.name:
if "quant" in node.name.lower():
self.resource_conf.cube_quant_bits = 8
self.resource_conf.model_toolkit_path.extend([workdir])
self.resource_conf.model_toolkit_file.extend(
......@@ -343,17 +349,12 @@ class Server(object):
# If there is only one model path, use the default infer_op.
# Because there are several infer_op type, we need to find
# it from workflow_conf.
default_engine_types = [
'GeneralInferOp',
'GeneralDistKVInferOp',
'GeneralDistKVQuantInferOp',
'GeneralDetectionOp',
]
# now only support single-workflow.
# TODO:support multi-workflow
model_config_paths_list_idx = 0
for node in self.workflow_conf.workflows[0].nodes:
if node.type in default_engine_types:
if node.type in self.default_engine_types:
if node.name is None:
raise Exception(
"You have set the engine_name of Op. Please use the form {op: model_path} to configure model path"
......
......@@ -157,19 +157,19 @@ class WebService(object):
op_maker = OpMaker()
op_seq_maker = OpSeqMaker()
read_op = op_maker.create('general_reader')
read_op = op_maker.create('GeneralReaderOp')
op_seq_maker.add_op(read_op)
for idx, single_model in enumerate(self.server_config_dir_paths):
infer_op_name = "general_infer"
infer_op_name = "GeneralInferOp"
if len(self.server_config_dir_paths) == 2 and idx == 0:
infer_op_name = "general_detection"
infer_op_name = "GeneralDetectionOp"
else:
infer_op_name = "general_infer"
infer_op_name = "GeneralInferOp"
general_infer_op = op_maker.create(infer_op_name)
op_seq_maker.add_op(general_infer_op)
general_response_op = op_maker.create('general_response')
general_response_op = op_maker.create('GeneralResponseOp')
op_seq_maker.add_op(general_response_op)
server.set_op_sequence(op_seq_maker.get_op_sequence())
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册